Programming

Time series data guide for app developers

Friday, September 2, 2022

Google-Play-Store-Developers-Claim-Leaderboard

Michael Gargiulo explains what every developer should know about time-series data, the popular data workload disrupting the database market, in the developers guide to time series data. He also shares how time-series data allows us to develop deeper and more voluminous information and knowledge.

Time-series databases are exploding today as one of the fastest-growing database categories. With the cloud ushering in ever more "self-service", the volume of data continues to increase as innovation brings more computational power and storage capacity, many developers today are trying to wrap their heads around how to handle time-series data at the database layer. Developers understand that time-series data is the fuel for some of the most interesting deep analytics and real-time data features they are trying to build into their applications, but it’s also tricky to handle, and identify and can be very expensive to screw up.

So let’s take a look at what time-series data is, what makes these workloads different, and what requirements developers need to understand to be successful.

Time-stamped data, everywhere, everything all around us

Time-series data is everywhere, it's truly ubiquitous to industry and surrounds all of us. From Wall Street to the main street, almost everyone uses or creates time-series data throughout the day, whether you’re driving to work in your car, riding the subway, listening to Spotify, or simply turning off your lights, time-series data is being generated and recorded. Time-series data helps us blend the physical with the digital world to improve efficiency, productivity, safety, health, and so many more applications that benefit every aspect of our lives, but what exactly is it?

"Time-series data'' is often simply referred to as data with a time-stamp. Expanding this a bit further, we can define "time-series data" as a sequence of metrics collected over time for one individual entity.

Easy enough, but you may be asking yourself, well isn’t that just about all data?

Well, you may be correct, at a root level, but a time-series workload and its requirements are different from raw time-series data points.

Let's look closer at an example.

In Finance, you may want to record the closing price of an index, such as the NASDAQ, over time. A simplified version of this dataset may look like this:

time	NASDAQ closing price
2022-07-01 16:00:00.000	11566.50
2022-07-02 16:00:00.000	12000.98
2022-07-03 16:00:00.000	11789.55
2022-07-04 16:00:00.000	12036.24

This is an example of time-series data, it is one-dimensional, and the data points apply to a single entity, the NASDAQ, collected over time. Time is the primary dimension and it uniquely identifies the series. However, it's unlikely that you are simply collecting data for one entity, it's more likely developers will be working with data spanning multiple entities and dimensions. This is referred to as "panel data" commonly called "cross-sectional time-series data" it is ultimately just made up of many time series and thus can simply be defined as multi-dimensional time-series data.

Imagine now a more modern use case where we want to collect data about every security listed on the NASDAQ throughout the day to build an application that allows users to track and chart price movements and trading volume in real-time.

time	security symbol	trade price	volume
2022-07-01 11:05:00.000	MSFT	212.50	5
2022-07-01 11:05:01.000	MSFT	212.49	1
2022-07-01 11:05:01.000	AMZN	105.55	10
2022-07-01 11:06:00.000	IBM	36.54	1000
2022-07-01 11:06:01.000	AMZN	105.52	1

Our dataset now is much bigger, wider, and more powerful for users. It is multi-dimensional, the time and security symbol make up two distinct dimensions and the combination of both uniquely identifies a series.

Just defining "time-series data" and "panel data" is not enough for us to truly understand all that encompasses time-series data workloads. Another important concept to understand is "cardinality" which is defined as the number of unique series. As noted above, a series is uniquely identified by "time" if it's one-dimensional time-series data or by "time" and other dimensions or metadata such as a security symbol if it's more commonly "panel data". As we increase the amount of data, we often will naturally have an increase in cardinality which can be a demanding aspect of time-series data that requires unique handling at the database layer.

While "panel data" and "time-series data" are distinctly classified types of data, these two data types are often used interchangeably because they have many commonalities, especially in their workload characteristics, properties, and usage patterns, that unify them in the database layer and distinguish them from traditional workloads.

Time series databases

Time-series databases have been around for quite some time so why now are they continuing to skyrocket in popularity?

At its core, it all goes back to our desire for information and knowledge, which can only be driven by collecting and analyzing more data. The more time-series data we collect, the deeper we can analyze and draw insights from looking at how data changes over time. Time-series data is collected at low-level granularity, its continuous nature often means it is also high-volume, and it will continue to grow in size at a rapid rate as data points are collected. Time-series data is ideally collected with the intent to provide as much detail as possible when asking questions about our data, the low-level granularity nature of time-series data further compounds its complexities for developers, but opens a wealth of possibilities for its usage and end-state. For example, sensors on a rocket collect metrics such as temperature every nanosecond, or sensors on an F1 car collect dozens of metrics to track and analyze car performance during and after race laps. As we collect more data points with rich detail, we can draw more meaningful insights from the changes occurring by not just historically identifying patterns, but ultimately forecasting and predicting future outcomes and that is where the true value is derived from time-series data.

Time-series databases have brought the attention, usability, and technological innovation needed to efficiently manage and process all of this high-volume low-level granularity time-series data.

However, do we need them?

Different workload properties and demands

We must constantly ask a database storing time-series data challenging questions about our data.

Typically, we want to retrieve all of the information over some time for a given series to drill down to meaningful insights about the changes over time, such as calculating the exponential moving average for security which can help us potentially predict when the price of a security may rise or fall and by how much. This is where dealing with time-series data gets interesting.

Before we look at the properties of time-series data workloads closer, it’s helpful to understand a very brief history of traditional relational database systems which still dominate the industry.

Relational databases have been traditionally created to optimize for what is known as OLTP (online transaction processing).

OLTP is focused on a "current-state" view of data, it mostly ignores the time dimension and is built around the fundamental concept of transactions which are managed via CRUD operations, particularly UPDATE. Under OLTP, data is the mutable thing about a bank ATM or cash register at the store, these are transactional operations in which entities are typically updated with the most recent transactions to show the latest information, for example, your current checking account balance.

While a bank ATM or cash register might sound like time-series data, their use case and workload requirements are different.

Let's look at some more traditional time-series examples to see why:

Sensor data - A smartwatch or any generic sensor data being generated by IoT devices may record metrics like your heart rate, sleep patterns, steps traveled, etc. Your dataset would include these metrics and be continuously generated over time, identified by time, and like some other metadata such as customer identifier or sensor identifier.

Financial tick data - As discussed above, streaming trade data about every security listed on the NASDAQ throughout the day may include a stock name/identifier, price, trade volume, etc.

Fleet monitoring - Data is collected from a sensor on an automobile and may contain an automobile identifier, a time, and metrics such as current speed, GPS location, fuel level, etc.

In all of these examples, we are recording every transaction which occurs, data is always appended.

While a bank ATM will process every transaction and thus record it, it's focused on the current state view, data is primarily updated. ATM data is not necessarily meant to track the moving average of your last 10 transactions or your predicted future balance next week, both common questions we may ask of time-series data in a traditional time-series use case.

Putting it all together, we can see some very clear differences in the characteristics of OLTP and time-series workloads:

Time-Series data workloads generally will always have common properties regardless of use case or industry:

1. Data normally arrives in time-order
2. Data is mostly immutable, it is primarily appended (INSERT)
3. Data is indexed on time (it is always a primary dimension) and usually also associated with multiple other dimensions such as identifying keys (stock symbol, sensor identifier, customer identifier, etc..)
4. Data is often high in volume

Of course, sometimes data can come out of order due to network issues or it may be more common in IoT scenarios to have to reload late data, but these examples don't constitute the norm, but rather the exception.

These properties distinguish time-series data workloads from other workloads and as you have probably guessed or already know, performance is paramount when dealing with time-series data workloads. However, the performance considerations and requirements associated with the workloads aren’t necessarily (and most often not) the same as transactional processing (OLTP) or pure analytical processing (OLAP). Time-series workloads fall somewhere in the middle of both and often it's a matter of choosing a solution that can optimize for the specific requirements your use case may have, for example, many time-series use cases are focused on application analytics and have real-time query latency requirements, particularly regarding recent data (which is often more important than historical data), businesses want to know what is happening now and use insights from the past to make decisions picture a recommendations application providing discounts to customers who frequently abandon carts, but have active items in their current cart. These use cases mean read performance is paramount to choosing a solution.

Do I need a time series database?

If the aim of collecting more and more time-series data is more than just a byproduct of technological innovation, then to continue to develop knowledge and draw actionable insights and predict future outcomes on this low-level granularity time-series data we need systems built for the unique performance considerations of time-series data workloads.

As with most other data types or workloads, when choosing a database, there are some core performance considerations, developers typically find themselves analyzing and trying to optimize for:

Read
Write
Scalability (cardinality handling)
Storage efficiency

Ultimately, most databases store their information on disks and disks store their information on blocks of 4, 8, and 32 kilobytes. Because disks store information on blocks, all the layers above disks, disk caches, operating systems, file systems, and database software are designed to deal with blocks of data larger than a typical time-series record size.

When we try to store time-series data in a traditional system built for OLTP, as each record comes in it's stored sequentially on a block on disk. To find them we need a couple of indexes. One index is for each data point that allows us to find it, delete it, and if necessary use replication. And then the other index contains the series and the time which allows us to find the range of time for a particular series by its uniquely identifying keys, for example, the security symbol and time. These make relatively large indexes - one for each of the small data points a phenomenon that also leads to the data on the disk is not particularly well-organized or stored efficiently.

The problem amplifies when you come to read this information back, you now have to read the data block-by block because that's how the underlying hardware works. Now when we do that, for each block we read and process and pull the information from, we have to essentially read and then discard the information. So it means that reading is a much more intensive process than it needs to be.

Time-series databases look to solve this problem by optimizing for the unique characteristics of time-series data workloads and ultimately their usage patterns. Time is a first-class citizen, meaning that data should be ordered by time by default. Since time-series workloads are generally appended only if data is ordered by time, we are normally writing to the end of the data which has unique benefits for reading performance on recent data as systems can generally keep data cached in memory.

A time-series database or a database built to handle time-series data needs to be able to handle large volumes of data on both write and read while optimizing its underlying storage layer to efficiently store large amounts of low-level granularity time-series data. However, optimizing for all four areas, read, write, storage efficiency, and cardinality handling is not truly possible, there must always be some trade-offs. When choosing a system developers should focus on optimizing in the best possible way for the use case as there isn’t a true one-size-fits-all. A developer must understand how time-series data will not only be generated but as many of its workload properties as they can such as growth, data shape (is it extremely wide panel data or a flexible schema), how it will be consumed today and potentially tomorrow to guide them in choosing the correct solution.

One of the easiest ways to understand what solution to choose is to look at what systems are solving for when it comes to the characteristics of time-series workloads discussed above. OLTP data is normally updated and randomly accessed over different keys; systems have historically been built for write optimization and consistency around a current-state model. However, time-series data is almost always appended and its usefulness is only truly gained by analyzing the data, it's really heavy, but also can write heavy, reads however are typically more paramount to optimize for. Recording all of the data is not useful if you can’t access it efficiently to not only draw insights and conclusions but take action and develop innovation.

Historically, time-series databases or solutions within traditional databases were built to optimize writing performance by incorporating concepts such as LSM trees, but as we continue to advance and work with more time-series data, it's becoming clear that a shift in optimization preference is happening; read performance continues and should continue to be a preferred optimization along with storage efficiency and cardinality handling. Time-series databases along with traditional relational and non-relational databases have continued to build more and more solutions that aim to bridge the gap between OLTP and OLAP, where time-series data workloads often live as hybrid "analytics" workloads. This shift has also brought much attention to column-store databases and added another complexity for developers in choosing a system for time-series data. While a use-case can be either write or read-heavy, it's most often that time-series workloads will be both or conceptually considered both, which again emphasizes the importance of understanding the usefulness of the data and its end-state within your use case.

So while a specific time-series database may not be necessary as technology continues to advance, they exist for the fundamental reason that time-series data is ever-growing, requires strong performance considerations, and treats time as a first-class citizen. Time-series databases look to solve problems associated with traditional database models and eliminate or reduce the impact that traditional transactional systems have traditionally built.

Time-series data is the natural evolution of the information age, its innovation, and data growth. Over time, all workloads may become time-series as we continue to advance our ability to store and process data as we have more power to store and process more data, we will.

Summary

Ultimately, time-series data is just like any other data in the sense its usefulness is gained by turning raw data into information and ultimately knowledge and wisdom. As innovation and technology advance so does our thirst for information, which boiled down to its core, is all about data. Data is the fundamental building block that makes up everything. Time-series data gives more breadth and depth to raw data and allows us to develop deeper and more voluminous information and knowledge which ultimately makes it different.

In the past, a current-state view of the world was the norm, but with innovation comes a desire for more. We, as people, will never stop expanding and pushing the boundaries of what we know are all data, and born into an age that is data-obsessed, our pursuit for knowledge will continue to churn innovation. Time-series data is how we move along on the journey in pursuit of knowledge. A steady state view of the world is no longer exciting or insightful, the predictive power and rich information time-series data can provide expand our horizons and possibilities, it is the natural link in a chain ever expanding and attempting to answer ever complex questions.

Comments

Your name and email will not be public or shared in any way.

This content is made possible by a guest author, or sponsor; it is not written by and does not necessarily reflect the views of App Developer Magazine's editorial staff.