- Understand what Data Enrichment is and its importance
- Understand the circumstances in which you need Data Enrichment
- Understand the difference between Data Enrichment and Data Collection/Aggregation
- Learn about how to use LogDNA in conjunction with Data Enrichment
Collecting data is excellent. Collecting data and then enriching it is even better.
Indeed, especially in modern environments where analyzing trends or problems requires correlating multiple data sources, data enrichment has become an essential step for deriving real value from data.
This article explains what data enrichment means, how it works, and which specific features to look for when integrating enrichment into your data analysis and management strategy. It also walks through examples of how LogDNA simplifies data enrichment.
Data Enrichment, Defined
Data enrichment is the process of contextualizing data from one source with data collected from additional sources.
Data enrichment can involve the intermixing of data collected internally with information gleaned from external sources. For example, an engineering team might enrich an internal network performance data set with data from an Internet Service Provider about network performance issues in the ISP’s infrastructure. In this case, the enrichment would help the team determine which problems within internal network resources triggered network issues, and stemmed from ISP-owned resources that the team doesn’t control.
Data enrichment could also entail the correlation of multiple internal data sources. For example, a team may enrich data about application performance with data that records CI/CD operations to determine how the deployment of a new application update impacted application performance.
Why Is Data Enrichment Important?
The main benefit of data enrichment is that it makes it easier to identify relevant trends and root-cause problems within complex IT environments.
Without data enrichment, engineers have to evaluate each data set independently, often meaning that they take shots in the dark to understand complex trends or identify the root cause of issues. For example, if an application’s response time degrades, and you are only looking at logs from the application, it’s typically hard to determine whether the root cause of the performance issue lies in the network, or the cloud environment hosting the application. The application logs alone won’t contain context about what is happening within these other layers of your stack.
Likewise, it can be challenging to understand the full scope of a trend or problem when looking at a single, unenriched data source. Let’s assume that you use a hybrid cloud (including private and public infrastructure), but separately analyze the infrastructure logs from your personal and public resources. As a result, you may not know whether a performance problem with one infrastructure component impacts the entire hybrid cloud environment,or just one part.
However, by correlating different data types, it becomes much easier to identify interrelated patterns that span multiple layers of your stack, or other components of your environment. Ultimately, this translates to less guesswork in analyzing data, and reduces the time it takes to glean insights from the data.
Do You Need Data Enrichment?
Data enrichment isn’t strictly necessary for every data set or data analytics workflow. Teams that support relatively simple environments may understand what is happening within them by analyzing just a single data set.
Similar conditions may hold if, for example, your business relies mainly on SaaS apps that an external vendor manages. In that case, the data available to you from the application will typically be limited, because you won’t have access to the underlying infrastructure logs, and probably not the application logs, either. You can track primary data like application availability metrics, but you probably won’t have much reason – or ability – to attempt to enrich that data with data from other sources.
On the other hand, any business that operates in a complex, multi-component environment stands to benefit significantly from data enrichment. As described in the previous section, enriched data makes it easier and faster to determine what matters within data sets, and map data to what is happening within an environment. Factors like this are why data enrichment has become an integral part of many data analytics strategies for modern businesses, which tend to use complex cloud architectures and deploy various applications and resources across them.
Data Enrichment vs. Data Collection and Aggregation
It can be easy to confuse data enrichment with other parts of the data management and analytics process – especially data collection and aggregation. Although there are similarities between each of these operations, they are distinct processes:
- Data collection is the process of gathering relevant information from an IT environment. You need to collect data from multiple sources (or find external data sources already prepared for you) to perform data enrichment. But enrichment is a separate step.
- Data aggregation is the compilation of multiple data sources into a single data set or database. The main difference between data aggregation and data enrichment is that aggregation results in a new data set with little indication of how the original data sources relate to each other. In contrast, data enrichment preserves the distinctions between different data sources, while at the same time highlighting how those sources correlate.
Data Enrichment Techniques: The LogDNA Example
Theoretically, you could enrich data manually by linking disparate data sources by hand. You could, for example, read through two log files side-by-side, manually noting points at which seemingly related events occur in each log at similar times. But this approach requires a lot of time and effort and is impractical at scale. It also leaves a lot of room for human error and oversight.
A better data enrichment strategy is to leverage tools, such as LogDNA, that automatically identify commonalities between logs. LogDNA offers several relevant features in this regard.
Intelligent Log Parsing
For instance, you could use auto-parsing to search for a given hostname or IP address within multiple logs. In this way, you could determine how activity involving a specific host within a network log correlates with action by the same host in a server log. Because LogDNA auto-parsing supports more than two dozen different types of logs and formats, it’s beneficial in situations where you need to identify relationships within logs from other clouds or applications.
Parsing also allows you to enrich logs based on timestamps, log level, and tags within disparate logs. An example use case is if you want to determine whether an event that the system recorded in one log, but not another, was absent from the second log due to a difference in log level – as opposed to the actual absence of the event within the system generating the second log.
Kubernetes is an example of a platform where log enrichment is significant, given the many components of Kubernetes and the difficulty of collecting and analyzing logs from all of them.
To make Kubernetes logging easier, LogDNA automates Kubernetes log collection from across all layers of the Kubernetes stack and provides a purpose-built Kubernetes enrichment feature. LogDNA Kubernetes enrichment displays Kubernetes events and metrics alongside your other logs. This enrichment simplifies the determination of how activity in Kubernetes orchestration services relates to activity within your applications or host infrastructure, compared to analyzing Kubernetes logs and other logs separately, and trying to piece together relationships between them.
Although straightforward data analysis workflows may not require data enrichment, enrichment techniques are critical for quickly making sense of what is happening in complex, multi-layered environments. Data enrichment doubles down on the value of your raw data. It makes it easier to get to the root cause of performance issues or identify broad trends that would not be evident by analyzing one log in isolation.