How Apache Kafka is Fundamentally Changing the Streaming of Big Data

Tuesday, January 12, 2016

LinkedIn, Netflix and Uber are just a few companies who are utilizing Apache Kafka, which facilitates realtime data streams and provides an open source messaging system.

We recently visited with Jay Kreps, co-founder and CEO at Confluent, which he founded with other members of the team that built Kafka at LinkedIn. Jay is the original author of several of open source projects including Apache Kafka, Apache Samza, Voldemort, and Azkaban.

With Confluent solution, the team has created a stream data platform to help other companies leverage Kafka to get access to enterprise data as realtime streams. In mid December, Confluent announced the GA availability of the Confluent Platform 2.0, adding new features to enable the operation of Apache Kafka as a centralized stream data platform.

ADM: What is streaming data?

Kreps: Think of it as a way of dealing with data that treats data as a continuous, unbounded stream. A surprising amount of data-rich processing today is still built in a once-a-day, batch fashion where data is collected into some store - say a Hadoop cluster or data warehouse - and processed periodically. This batch-oriented approach is quite limiting for a lot of modern applications which are necessarily much more real-time.

The ecosystem growing around streaming data and stream processing is an attempt to take a lot of the advantages of batch processing systems and extend them into a low-latency domain that updates continuously in milliseconds rather than once a day.

ADM: Who is using Kafka? How?

Kreps: Kafka is a popular open source system for managing real-time streams of data from websites, applications and sensors that is now being used as fundamental infrastructure by thousands of companies, ranging from LinkedIn, Netflix and Uber, to Cisco, Microsoft and Goldman Sachs.

Kafka’s influence and impact continues to grow. Today it powers every part of LinkedIn’s business where it has scaled to more than 1.2 trillion messages/day. It also drives Microsoft’s Bing, Ads and Office operations at more than 1 trillion messages/day.

ADM: How are organizations benefitting from Kafka? What problem is it solving?

Kreps: There are two fundamental patterns of usage for Kafka. The first is as a big globally distributed data pipeline that captures data from logs, sensors, databases, and so on, and makes this data available to the rest of the organization. This lets any service, business unit, database or application publish a stream of “messages” or “events” into Kafka without needing to concern itself with which systems might consume this stream.

We see people building Kafka-based data pipelines to feed Hadoop clusters and data warehouses, to feed real-time monitoring systems, to synchronize data between disparate business units, and to feed data back and forth between on-premise apps and public cloud deployments.

The strength of Kafka as a data pipeline is that it is easy to scale and covers the full set of demands for data transport. It’s low latency, so it can be used for production applications, real-time monitoring and analytics systems which need to get results quickly.

Kafka scales horizontally on commodity hardware so it can support massive log, sensor and event streams. It offers strong consistency and fault-tolerance guarantees that make it suitable for critical data streams. It can easily persist large volumes of data, which allows it to support batch systems like Hadoop clusters or data warehouses.

There were previous systems that covered individual aspects of this - that is they might be low latency, or fault tolerant, or have scalable storage, or good for high-throughput - but really nothing that did all of these together. The result was that companies often ended up with several of these things jury rigged together.

The second use for Kafka is something this type of real-time data pipeline enables, namely real-time stream processing. Having a real-time stream in Kafka allows applications to tap into this data pipeline and react to the stream of events in real time. These stream processing applications may execute simple business logic or complex machine learning algorithms. They may be built as stand-alone applications or using a stream processing framework such as Storm or Spark Streaming.

We see users building stream processing applications to detect fraud, munge together disparate data sources for storage into other data systems, or do any of hundreds of other business specific application uses.

ADM: How will the Confluent Platform help developers better integrate Kafka within their organization?

Kreps: The founders of Confluent gained a lot of experience building out a massive data pipeline and stream processing infrastructure at LinkedIn and in working with some of Silicon Valley’s biggest tech companies to build similar infrastructure. What we found was that all these big deployments ended up building lots of customized supporting tools around Kafka. We learned a lot of best practices watching these in-house Kafka stacks evolve.

The Confluent Platform is our effort to rebuild a complete off-the-shelf streaming data platform. It complements Kafka with a set of open source components that help you roll this out in your organization with a minimum of in-house tool development. It contains support for managing data formats, clients and REST access, connectors to common data systems, and so on.

ADM: What are some of the new features for developers?

Kreps: There are a range of new features aimed at developers, including a new Java consumer that simplifies the development, deployment, scalability and maintenance of consumer applications using Kafka, a supported C client with fully featured producer and consumer implementation, and more than 500 other individual performance, operational, and management improvements.

ADM: What are the benefits for enterprises? How does the latest update help them?

Kreps: Version 2.0 of Confluent Platform has several new features that specifically address the unique security and authentication needs of large enterprises that must handle highly sensitive personal, financial or medical data, and operate multi-tenant environments with strict quality of service standards.

Features include data encryption capabilities, authentication and authorization allowing for access control with permissions that can be set on a per-user or per-application basis, and configurable quotas that allow throttling of reads and writes to ensure quality of service.

ADM: What types of customers benefit most directly from the new robust security features?

Kreps: All enterprise customers will benefit from these new security features, but large financial services firms and healthcare IT companies in particular will now be able to leverage Kafka for real-time insights while keeping security and compliance requirements top of mind.

ADM: How will organizations use Kafka connectors? Are there any organizations already working to develop connectors?

Kreps: Kafka Connect is a new connector-driven data integration feature that facilitates large-scale, real-time data import and export for Kafka, enabling developers to easily integrate various data sources with Kafka without writing code. It offers a 24/7 production environment including automatic fault-tolerance, transparent, high-capacity scale-out and centralized management - meaning more efficient use of IT staff. We’ve already seen a huge amount of interest, with several companies beginning development around this framework even prior to its first release.

ADM: Where do you see the evolution of the Kafka ecosystem one year from now?

Kreps: Kafka is growing into its role at the center of the streaming ecosystem. We at Confluent will be working on improving the ecosystem of clients and connectors that help people plug into Kafka, helping to improve the ecosystem for stream processing, and helping to advance the state of operational tools for running Kafka at scale.

Read more: http://www.confluent.io/