Study Show Companies Struggle with Big Data Management Performance Issues Because of Bad Data

Thursday, June 23, 2016

StreamSets has announced the results of a survey that delved into the challenges of bad data on data management performance issues. The survey was conducted by Dimensional Research and included responses from 314 data management professionals globally.

The primary research goal was to capture how companies manage the flow of big data. The research also investigated and documented current tools’ capabilities, data quality and efforts to maintain big data pipelines and infrastructure.

The survey revealed pervasive data pollution as companies face challenges on a number of data performance management issues including the lack of ability to stop bad data to keeping data flows operating effectively. 90 percent of respondents report flowing bad data into their data stores while just 12 percent consider themselves good at the key aspects of data flow performance management.

Among the findings:

- Ensuring data quality is the most common challenge faced when managing big data flows (68 percent).

- In relation to bad data flowing into stores, 74 percent of organizations reported currently having bad data in their stores, despite cleansing data throughout the data lifecycle.

- While 69 percent of organizations consider the ability to detect diverging data values in flow as "valuable" or "very valuable," only 34 percent rated themselves as "good" or "excellent" at detecting those changes.

- 12 percent of respondents rate themselves as "good" or "excellent" across five performance management areas detecting the following events: pipeline down, throughput degradation, error rate increases, data value divergence and personally identifiable information (PII) violations.

- Performance degradation (44%), error rate increases (44%) and detecting divergent data (34%) were where respondents felt weakest.

- Detecting a "pipeline down" event was the only metric where a large majority felt positively about their capabilities (66%).

- In relation to tweaking of pipelines due to data drift, 85 percent said that unexpected changes to data structure or semantics create a substantial operational impact. Over half (53%) reported that they have to alter each data flow pipeline several times a month, with 23% making changes several times a week or more.

- Nearly two thirds of respondents use ETL/data integration tools and 77 percent use hand coding to design their data pipelines.