Author: Jeremiah Evans, Sr. Application Developer
At Zirous, we know that your data is key to your success. We also know that the data landscape is changing, and that keeping up with that change is vital to keeping your organization vibrant. As data-driven, quantitative analytics are being used more to make better decisions, we see more and more customers asking how to ensure that they keep pace with these changes.
That brings us to the elephant in the room – everyone knows that Big Data is key to continued success, but the path from where you are to where you want to be is often unclear. There’s a lot to consume – both in terms of data and in changes to your process. Just remember, how do you eat an elephant?
One bite at a time.
In this article we will briefly outline how Zirous can help navigate from where you are to where you want to be. To understand the solution we need to understand the challenge. Our challenge is defining Big Data. Let’s break it up into three main categories: Volume, Velocity, and Variety. These “3 V’s” of Big Data help categorize the types of changes, and the various ways we must adapt to the new realities of data. We will cover this in brief today, but stay tuned for a more in-depth treatment of each topic!
This is what most people probably think of when they think “Big Data” – so much data! Images of the massive data centers of Google or Facebook come to mind. But as anyone familiar with their own data knows, you don’t have to be a Google or a Facebook for your data to strain the capabilities of a traditional database architecture. Even well-tuned, high-end databases can begin underperforming when tasked with what are becoming routine analytic queries on an average amount of data.
Ask yourself, “What questions am I not asking because there’s too much data to sift through?”
One of the realities driving the increase in volume is the increase in velocity, or the speed with which data is being generated and collected. Every smartphone and connected device is generating thousands of data points every day, even every hour. A traditional batch-driven system is fine when you’re only getting your data once a day, but that’s no longer the case. What if you could respond in real time to changing conditions based on streams of data coming from social media, an assembly line, or a fleet of connected vehicles?
Ask yourself, “What decisions could I be making better if I was seeing the current state of the system, and not the state from yesterday, or a week ago?”
From spreadsheets to databases, flat files to PDFs, you see data in dozens of different formats every day. This variety is a pain point for many of our clients, often requiring hours and hours of manual effort to transform new data into a format where it can be joined up with the data already in their system. How do you tell what data is being duplicated, or what data is missing, when you can’t compare it all? How long does it take to add additional data to your system, as you map it to your existing warehouse?
Ask yourself, “What data am I missing because it’s not tracked when it doesn’t fit into my existing data models?”
Eating the Elephant
As data flows through your systems, it typically filters through four phases: Land, Standardize, Model, and Analyze. In a traditional data architecture, Landing, Standardization, and Modeling are usually done at once, because the incoming data needs to “look right” before it can be stored in a traditional data model. This is referred to as “Schema on Write.” Leveraging the Hadoop concept of “Schema on Read,” we can separate these into distinct, more easily managed steps.
For example, one of our current clients has data from multiple internal and external sources they need to put in their Hadoop platform. Some of it is from external vendors who may or may not agree to changes in the format of data, and some of it is created using internal processes that are well established and not subject to change. Typically, one would expect a complex process to begin bringing each type of data into the platform, including development of a complete data model and detailed data mapping.
Instead, we were able to begin connecting to these data sources as-is while their architect worked on the master model. We were able to focus on just the technical requirements of getting the data, leaving the complex transformation for later when we could do it all in one place. As their architect provided us with various models of the data, we were able to simply develop views to the underlying data that didn’t require us to transform it first, or require all of the data to be formatted the same way.
At several points, the architect discovered that a vendor was not providing the data they expected, or the end consumers indicated a need for a different data set. This sort of requirement volatility can often derail a team’s productivity as they scramble to modify all of the data mappings and architecture models, but since we had never changed the raw data, we were able to quickly refine the views to meet the realities of the incoming data and the requirements of those end users.
This flexibility means that you can start landing new data in your platform before you know all of the specifics, and you can quickly evolve your model, since you’re not throwing any of the incoming data away. At the same time, you can begin working on your analysis requirements, and work backwards into the data you already have, tweaking instead of rewriting when you realize you need different data.
This is just one example of how we’re handling the Big Data elephant. In upcoming posts we’ll dive deeper into those “3 V’s”, looking at specific challenges and how our phased approach to Big Data implementation can help you get from where you are to where you want to be. Data is changing, and you can change with it – one piece at a time. Bon appetite!