Data Lakes Closing the Gap Between Enterprises and Big Data

The idea of Big Data is great, but what happens when enterprises can’t effectively access and use the Big Data they’ve been investing in? That’s the issue a large percentage of companies are facing today. According to a recent Forrester survey, 67% of companies cannot access Big Data, 59% of those that can access it can’t integrate it, and 56% of companies claim that the update process for using Big Data is too slow to be profitable. Those numbers can make many companies think twice about how fantastic Big Data truly is. Big Data is going to continue to be a large part of enterprises’ success, though, and there are still projections that the amount of data available worldwide will increase to something like 44 zetabytes in the next five years. The issue now is to figure out what data preparations are feasible for companies to implement in the coming months that would allow them to successfully access, integrate and use Big Data to its potential.

Data lakes are one of the tools being proposed to close this gap between Big Data and enterprises. Data lakes are a tool to manage a company’s data that contrast the traditional Data warehouse. Data warehouses are the place where a company keeps its data (both historical and current), and the data in a data warehouse is rather refined. Data warehouses organize data by subject area, is highly structured and always has a defined purpose within the company. Data lakes are an alternative to data warehouses, and on the very surface, they accomplish the same objective: it stores data for a company. Data lakes are more robust than their warehouse counterparts, however, and they accept any and all data (structured and unstructured), do not transform the data they receive, and they do not place the requirement on data that an end-use be defined. The data lake schematic already sounds more conducive to Big Data.

The major differences between data warehouses and data lakes include: data lakes’ ability to retain all data, support all data types and users, adapt to change, and provide faster results. Because all data in a data warehouse has to have some specific end-purpose, a large amount of data that could be stored in a data warehouse (and possibly become useful two or three years down the road) must be eliminated, which makes accessing and storing useful information from Big Data difficult. Data warehouses also only accept structured data, but data lakes will accept either structured or unstructured data, which also lends itself to the way Big Data functions. Data lakes also do a better job of allowing employees from different sectors of a company to access data and use it effectively (i.e. data analysts and data scientists can use data just as effectively as operational users). Data warehouses are by definition more structured that data lakes, making data lakes more adaptable to enterprise structure changes. As a result of these major differences, data lakes tend to produce results faster. The right people can easily access any and all data they need, and because the data is in its raw form, no time is wasted on transforming it from one type to another.

Data Lake

In general, data lakes help companies better access and integrate all the data out there into their business. Data lakes have provided an affordable, feasible way to solve the problem many companies expressed in the Forrester survey mentioned earlier. Companies such as Microsoft are implementing and selling products like Microsoft Azure so businesses can use data lakes and begin unlocking the potential of Big Data. Other solutions and explanations are found in Blue Granite’s free eBook called “Understanding Data Lakes in a Modern Data Architecture.” If companies switch to data lakes instead of data warehouses, there will undoubtedly be a learning curve and the solution will not be perfect. Jeff Frick of theCUBE warns that “the myth of pumping all the data into the lake and sprinkling some pixie dust… just doesn’t hold water. At the end of the day, it’s a tool.” Companies need to use data lakes as the robust tool they are and figure out how to replace the idea of “sprinkling pixie dust” with tangible solutions.


Rebecca Seasholtz

Rebecca is a senior Materials Science and Engineering major at Georgia Tech. She specializes in soft materials (i.e. plastics and textiles) and has also worked extensively with functional materials for electrical applications. Rebecca is originally from Grayson, GA and likes to spend her free time running, cycling, drinking coffee, or hanging around the campus house of a ministry she attends at Georgia Tech. Contact Rebecca at [email protected]