1. https://appdevelopermagazine.com/big-data
  2. https://appdevelopermagazine.com/how-streamsets-simplifies-setting-up-new-ingest-pipelines/
3/4/2016 10:21:08 AM
How StreamSets Simplifies Setting Up New Ingest Pipelines
/Streamsets-Big-Data-App-Developer-Magazine_libpicls.jpg
App Developer Magazine

Big Data

How StreamSets Simplifies Setting Up New Ingest Pipelines


Friday, March 4, 2016

Stuart Parkerson Stuart Parkerson


We recently visited with Girish Pancha, CEO of Streamsets, to talk about how the company’s open source software is used to build and operate reliable ingest pipelines. StreamSets’ Data Collector is a low-latency ingest infrastructure tool that provides the ability to create continuous data ingest pipelines using a drag and drop UI within an integrated development environment (IDE).

ADM: How is StreamSets different from other data services and solutions available in the same space?

Pancha: For developers, StreamSets greatly simplifies setting up new ingest pipelines. The service makes it very simple to plug in a new component, test it then place it into production, without having to write custom code. It runs on organizations’ existing clusters and operates 100 percent in-memory so it scale in lockstep with the rest of a customer’s infrastructure. 

For operators, StreamSets provides complete control over, and visibility into, customers’ Big Data pipelines. The StreamSets Data Collector gives visibility to data in motion by assessing the quality of both the data flow and the data itself at each step of the ingest pipeline, testing data for anomalous conditions, alerting on these conditions and re-routing for automated cleansing. 

Other offerings in this space cannot inspect the data in real-time, which means that they cannot provide early warning or automated handling and have limited value in helping organizations detect and get to the root cause of data accuracy problems. StreamSets’ solution can provide an early warning when data structure or schema starts to drift. Data drift often means that the data for downstream analysis is incorrect information.
 
Ultimately, The StreamSets Data Collector can detect data drift whereas other solutions cannot. 

ADM: What is data drift and how is it impacting organizations?

Pancha: Data drift is defined as the unpredictable, unannounced and unending  mutation of data schema and semantics caused by the constant changes in systems generating the data at the source. It’s a natural consequence of an increasing reliance on “other people's’ systems” for data generation. These systems are locally optimized upstream; however, upstream optimization impacts downstream data integrity. The following data integrity issues often arise:

- Data loss: The data no longer fits what organizations expect. 

- Data corrosion: The meaning has changed but downstream systems are not aware of the semantic drift.
  
When organizations experience data drift, it impacts them in multiple ways. They pass bad data to consuming apps, which then generate incorrect insights. When incorrect insights arise, organizations’ trust in the whole data operation can start to erode. Data engineers and scientists spend too much time on finding and fixing these data issues — “janitorial” or “fire-fighting”  tasks —  which adds cost and makes those employees less available to meet new business requirements.  

ADM: How does StreamSets solve this issue of data drift? 

Pancha: As StreamSets watches the data flows, it can inspect the actual data records, detecting  anomalies as data drifts from historical patterns. Then the StreamSets user can decide how to address the drift, perhaps by transforming the data in-stream so it arrives consumable despite the drift, or adjusting downstream applications to the new reality.
 
ADM: Who can benefit from using StreamSets and what are the advantages?

Pancha: StreamSets can be thought of as general purpose ingest infrastructure. Any enterprise that ingests data - streams and or batch data - can benefit from StreamSets’ solution. Data engineers get a visual UI to connect data sources to destinations and make a variety of built-in transformations to sanitize the data while it is in motion. Customers can also script custom transforms. StreamSets is resistant to data drift because, unlike traditional ETL tools, it doesn’t rely on schema and uses a standard record format to facilitate complete visibility into the data flow and automated handling of drift conditions.  

Ultimately, customers make their data engineers productive by reducing hand coding; they receive an ingest system that delivers clean, ready-to-use data reliably; and, they achieve the ability to monitor data flow in real time, including early warning and the ability to diagnose each step in the pipeline when trouble occurs.
Girish

ADM: What projects or use cases do organizations use StreamSets for?

Pancha: StreamSets provides the highest value in cases where customers have numerous evolving source endpoints that deliver data that must be arrive in a high-quality state for immediate analysis. Organizations who adopt StreamSets value its ability to both simplify pipeline development and provide visibility and control over the operation of those pipelines. 

As an example, Lithium uses the StreamSets Data Collector with its messaging fabric to enable near-real-time data flow. StreamSets is a binding agent that helps them un-silo, transform and route their data across hundreds of nodes with visibility and control. 

Another example is with Cisco, who uses StreamSets as part of their InterCloud offering and values its ability to automatically handle infrastructure changes and its ability to provide intelligent monitoring and dynamic shaping of internal operational logs and multi-datacenter ingestion logs.

Four patterns we see in the market are: ingestion to Hadoop, ingestion to big data search tools, ingesting log data, and as a code-free way to consume into and produce from Kafka.

ADM: How and why did you come up with this solution? Has anything like it been available in the past?

Pancha: My partner, Arvind Prabhakar, and I came up with StreamSets as we felt that the legacy technologies were not addressing the pain points big data users were experiencing as they struggled to onboard data. We are just now getting to a maturity level where enterprises are taking on how to operationalize Big Data ingest, and to not just treat it as a series of projects. 

This maturity was achieved for traditional data warehouses previously via data integration tools, but those solutions are not fit-for-purpose in a real-time big data world plagued by data drift.  


How StreamSets Simplifies Setting Up New Ingest Pipelines




Read more: https://streamsets.com/




Subscribe to App Developer Magazine

Become a subscriber of App Developer Magazine for just $5.99 a month and take advantage of all these perks.

MEMBERS GET ACCESS TO

  • - Exclusive content from leaders in the industry
  • - Q&A articles from industry leaders
  • - Tips and tricks from the most successful developers weekly
  • - Monthly issues, including all 90+ back-issues since 2012
  • - Event discounts and early-bird signups
  • - Gain insight from top achievers in the app store
  • - Learn what tools to use, what SDK's to use, and more

    Subscribe here



Stay Updated

Sign up for our newsletter for the headlines delivered to you

SuccessFull SignUp

Featured Stories


Top manufacturing trends for 2026
Top manufacturing trends for 2026 Tuesday, June 23, 2026


API scoring tool shows if your API is ready for AI
API scoring tool shows if your API is ready for AI Monday, June 22, 2026




Agentic AI Reality Check: The Million-Dollar Mistake Hiding Inside ERP
Agentic AI Reality Check: The Million-Dollar Mistake Hiding Inside ERP Friday, June 19, 2026


Influencer Debate AI Anthropic IPO Reveals Industry Concerns
Influencer Debate AI Anthropic IPO Reveals Industry Concerns Wednesday, June 17, 2026


Subscription apps are losing users faster than ever
Subscription apps are losing users faster than ever Tuesday, June 16, 2026


DomainTools announces real time threat feeds
DomainTools announces real time threat feeds Monday, June 15, 2026


Take It Down Act results in warning letters from FTC
Take It Down Act results in warning letters from FTC Friday, June 12, 2026


Nvidia valuation fears grow
Nvidia valuation fears grow Friday, June 12, 2026


Anthropic launches Claude Design
Anthropic launches Claude Design Wednesday, June 10, 2026


Spotlite Expands Into AI Era With New IP Protection Tool
Spotlite Expands Into AI Era With New IP Protection Tool Wednesday, June 3, 2026


Get More App News