Big Data

Ben Markham Chats About Being a Big Data Developer With Apache Hadoop

Wednesday, August 17, 2016

We recently sat down with Ben Markham, Big Data Architect, Xiilab and ODPi member about developing sensor data solutions, working with Hadoop, what the big data market looks like in South Korea, and how it is to develop for a South Korean company?

ADM:You are a developer for Xiilab, what do you do for them?

Markham: I develop applications with Spring and Java EE, which integrate with commonly used Apache Hadoop distributions. Hortonworks Data Platform is used the most, but these applications can integrate with other Apache Hadoop distributions. I've built applications demonstrating the power of IoT. For example, taking raw data and displaying them on graphs or providing algorithms to predict future data such as how much your electric bill might be based on how much energy your house and other appliances are using.

ADM: At Xiilab you work with sensor data solutions, tell me more about working with that technology?

Markham: Sensors are a fun and interesting piece of technology to work with. The most difficult aspect of working with sensors is figuring out what to do with the raw data that comes from the sensor so you can provide accurate results of whatever it is you are making. Things such as how frequently do you want to ingest data, how to store it, how long do you want store it, etc. These days, streaming data is all the craze, so now you have different streaming option to choose from. All of these seemingly simple decisions can ruin an application and its desired results if not thought out and planned carefully.

ADM: What is the biggest challenge you face as a developer of big data technology?

Markham: Choosing a specific API for development is the biggest challenge I come across. Take IoT for example, do I use Storm, Kafka, Spark Streaming, or NiFi? Security as well. Ranger? Knox? Sentry? Which one do I use? I need to know what the best API to use is, so I don’t have to make changes later down the road.

ADM: You are involved in ODPi, how will standardizing Apache Hadoop distributions effect your daily work?

Markham: Standardizing Apache Hadoop distributions would give me a bit of a relief as a developer. I put a lot of time and effort into working with many different distributions because they are all different. You have to make sure APIs work across multiple Hadoop distributions. Its exhausting to learn and care about. Focusing on just one distribution model would allow me to focus more on the applications I am creating for my company. Having a standard would also help the innovation of Apache Hadoop distributions because it would encourage many great minds to come together as one instead of everyone doing their own thing for niche situations.

ADM: You mentioned it is important to ensure APIs work across multiple Apache Hadoop distributions, why is this important?

Markham: Right now, there’s no way to ensure an application will run across all distributions. A developer will have to do all kinds of things to make sure their app is portable across distributions.

Let’s take Apache Hive for example. You may have to alter your Apache Hive queries because some distributions only ship MapReduce and not Apache Tez. If it’s just a couple queries, then it’s not so bad, but you wish you could write this one time and not worry about it.

ADM: You mentioned Apache Hadoop features can vary significantly depending on the distribution provider, how do they vary?

Markham: I think the biggest significant difference between Apache Hadoop distributions is open source vs. open core. On the one hand, you have open source that is backed by a world-wide community who share their knowledge and innovation. On the other hand, you have open core, which only a niche community understands and training for this community is very limited. As a developer that builds on-top of Apache Hadoop, making sure my applications easily run on the open core distributions is difficult because the answers to the questions I come up against are not easily available. This slows my development.

ADM: Xiilab is based in South Korea, what is Big Data market like over there?

Markham: Apache Hadoop and other big data technologies are still very new in South Korea. There’s so much room for growth and it is so exciting to see people learning more about Apache Hadoop. There aren’t many companies using Apache Hadoop to it’s full potential, so we at Xiilab provide lectures to educate people and open their eyes to the larger big data world. Being the only ODPi member from South Korea, Xiilab tries to be a strong representative of the initiative and is a driving of Apache Hadoop education in South Korea.

ADM: You mentioned lectures. What kind of people are attending the lectures and what materials are you teaching?

Markham: We give lectures for University students or anyone who wants to learn about Apache Hadoop. Like I said before, Apache Hadoop is still very new to South Korea, so we try to teach the basics of Apache Hadoop. When I give lectures, I like to tell the history of Apache Hadoop and how there came to be many different distributions and how they can vary significantly depending on the distribution provider, which is why ODPi is important. I encourage students to go home and start experimenting with Hortonworks Sandbox because it’s free and allows them to have a legitimate Apache Hadoop cluster in their hands.

ADM: What do you see Hadoop and Big Data looking like 2 years from now?

Markham: With 5 Apache Hadoop distributions becoming ODPi Runtime compliant, I see a full standard for all distributions emerging in 2 years. From this, I see developing applications for YARN or IoT much easier to make. I would like to see all great big data minds coming together as a community to focus on furthering innovating on one of the most innovative piece of technology; Apache Hadoop.