If you’re someone like me, you enjoy structure, neatness, and simplicity.
But in some cases, it’s best to step back and allow organized chaos to unfold. This is the basis of something called a data lake.
A data lake is a repository for structured, unstructured, and semi-structured data. Data lakes are much different from data warehouses since they allow data to be in its rawest form without needing to be converted and analyzed first.
In simpler terms, all types of data that are generated by both humans and machines can be loaded into a data lake for classification and analysis later on.
Data warehouses, on the other hand, require data to be properly structured before any work can get done.
To get a deeper understanding of data lakes and why they’re the optimal candidate for housing big data, it’s important to dive into what makes them so different from data warehouses.
Both data lakes and data warehouses are repositories for data. That’s about the only similarity between the two. Now, let’s touch on some of the key differences:
James Dixon, founder and Chief Technology Officer of Pentaho, coined the term “data lake” after providing an analogy differentiating data lakes from data warehouses.
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state,” said Dixon. “The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
So, how are data lakes capable of storing such vast and diverse amounts of data? What is the underlying architecture of these massive repositories?
Data lakes are built upon a schema-on-read data model. A schema is essentially the skeleton of a database outlining its model and how data will be structured within it. Think of a blueprint.
The schema-on-read data model means you can load your data in the lake as-is without having to worry about its structure. This allows for much more flexibility.
Data warehouses, on the other hand, are comprised of schema-on-write data models. This is a much more traditional model for databases.
Every set of data, every relationship, and every index in the schema-on-write data model must be clearly defined ahead of time. This limits flexibility, especially when adding in new sets of data or features that could potentially create gaps within the database.
The schema-on-read data model acts as the backbone of a data lake, but the processing framework (or engine) is how data actually gets loaded into one.
Below are the two processing frameworks which “ingest” data into data lakes:
Hadoop, Apache Spark, and Apache Storm are among the more commonly used big data processing tools which are capable of either batch or stream processing.
Some tools are particularly useful for processing unstructured data such as sensor activity, images, social media posts, and internet clickstream activity. Other tools prioritize processing speed and usefulness with machine learning programs.
Once the data is processed and ingested into the data lake, it’s time to make use of it.
Data warehouses rely on structure and clean data, whereas data lakes allow data to be in its most natural form. This is because advanced analytic tools and mining software intake raw data and transform it into useful insight.
Big data analytics will dive into a data lake in an attempt to uncover patterns, market trends, and customer preferences to help businesses make informed predictions faster. This is done through four different analyses.
Data mining is defined as “knowledge discovery in databases,” and is how data scientists uncover previously unseen patterns and truths through various models.
For example, a clustering analysis is a type of data mining technique that can be applied to a set within a data lake. This will group large amounts of data together based on their similarities.
Through data visualization tools, data mining helps clear up the chaotic nature of unstructured, raw forms of data.
Data lakes may be flexible, scalable, and quick to load, but it does come at a price.
Ingesting unstructured data requires a lack of data governance and processes that ensure the right data is being looked at. For most businesses – especially those that have yet to adopt big data – having unorganized, uncleaned data isn’t an option.
Misuse of metadata or processes to keep the data lake in check can actually lead to something called a data swamp. You wouldn’t go swimming in a swamp, would you?
There’s also the issue of data security.
Data lakes are a fairly new concept in IT, which means some of the tools are still working out the security kinks. One of these kinks is ensuring only the right people have access to sensitive data loaded into the lake.
But like any new technology, these issues will resolve with time.
|TIP: Ready to take a deeper dive into the data world? Learn the basics of master data management (MDM) and why it's important for businesses.|
Despite some of the challenges of data lakes, the fact remains that more than 80 percent of all data is unstructured. As more businesses turn to big data for future opportunities, the application of data lakes will rise.
Unstructured data like social media posts, phone call recordings, and clickstream activity contain valuable information that cannot be withheld in data warehouses.
While data warehouses are strong in structure and security, big data simply needs to unconfined so it can flow freely into data lakes.
Devin is a former Content Marketing Specialist at G2, who wrote about data, analytics, and digital marketing. Prior to G2, he helped scale early-stage startups out of Chicago's booming tech scene. Outside of work, he enjoys watching his beloved Cubs, playing baseball, and gaming. (he/him/his)
Subscribe to keep your fingers on the tech pulse.