In the beginning, there was the Log – or to be a bit more precise, there were application logs. At least that’s how it was in the early days of application development, when raw log data itself was more often than not the point where troubleshooting began.
Now, of course, the starting point for troubleshooting with cloud-based applications is much more likely to be an automatically-generated alert, or an indication on a monitoring dashboard that something isn’t quite right. But the troubleshooting road still leads to those application logs – where once they were the starting-point, now they’re the destination.
Why is this true, and what can you do to get the most out of your logs as a troubleshooting resource? We’ll be looking at those questions in this post.
Logs, the Old-Fashioned Way
Back in the Good Old Days of application development (which were not that long ago, and to be honest, not that good), alerts typically consisted of after-the-fact evidence of problems – such as system breakdowns, security breaches, or performance degradation – with little or no advance warning.
To be ahead of the game, you needed to keep a close eye on application logs for signs of trouble. But with no real-time analysis and without a dashboard full of graphic visualizations, log analysis more often than not meant using command-line tools to convert raw log data to text format, then using search commands or visually scanning through the output.
You had to know what you were looking at, and what you were looking for – not an easy task when the evidence of impending trouble could consist of anomalous patterns of user access or resource use over a relatively long period of time. It was often easier and less time-consuming to wait until something went wrong, and then dig into the logs in the hope of tracing the problem back to its origin.
Monitoring Makes a Difference
Things have changed considerably since then, and when it comes to monitoring and analyzing logs and other indications of system behavior, they have definitely changed for the better. If you use first-rate monitoring tools, there’s a good chance that you’ll see trouble coming before it strikes; and if a failure does occur, you’ll likely be able to contain the problem at an early stage and minimize the damage.
But in order to be useful, monitoring tools must point the way to specific problems, and ultimately to specific instances of those problems in order to determine their nature, trace their origin, and fix them. This is obvious for security issues (who broke in, when, and where), but it is also true of more seemingly generalized problems, such as performance (which services get overloaded, where do the requests come from, and when does it happen).
To get the most out of cloud application monitoring, you need to use both monitoring tools and key metrics to move rapidly from indicating that an issue may exist to pinpointing its specifics. And that (spoiler alert!) will ultimately lead you back to logs.
Transactions, Traces, and Spans
How does this work? Consider the basic challenge of tracking down a software issue on a multi-user system with distributed services: It isn’t enough just to identify the type of issue and look for its origin. You need to identify the transactions that led to the problem, and then follow them back to where the problem began.
The trail of a transaction through the system is its trace. As the name implies, this trail consists of operations that are identifiable as part of the transaction. Each of these operations is in turn represented by a span – a set of timestamped records identifying the actions that make up the operation. Operations can have child operations; these are represented in the trace as child spans, giving the trace a treelike structure.
Note that a trace is not aggregate data abstracted from log records; it is the trail of a specific transaction, and it consists of individual spans with associated log data. This means that a monitoring and analytics service that allows you to visualize traces and drill down to specific spans at a detailed level has already done most of the heavy lifting in terms of log analysis.
Putting Traces to Work
Sumo Logic’s Traces feature, for example, relies largely on the same collectors and agents used by other Sumo monitoring services. With it, you can quickly produce a table of traces using one or more filter-driven queries based on factors such as the number of errors, duration, spans, or involvement of specific services. The table itself lists key items of information for each trace, including start time, duration (with graphic breakdown by service), root service, number of both spans and errors, and HTTP status. This is useful in and of itself, but it’s basically a starting point. From here, you can drill down to individual traces, spans, and ultimately, logs.
When you click on a trace in the table, you will see a detailed, time-based view of the trace, with individual, color-coded and labeled spans shown in sequence, including duration and parent-child relationships. Spans that have errors are flagged visually for immediate recognition; you can also filter the view to display only spans with errors.
A display such as this can tell you a considerable amount at a glance. Along with clearly marked error spans, unusually long durations and atypical parent-child span chains are easy to detect. This, in turn, can provide you with valuable clues about possible sources of trouble, and about which spans require further investigation.
To drill down to a span, you click on its image in the Trace View window. This brings up a panel with detailed information about both the span and its context. At this point, you’re already deep into log data, but it is so clearly focused and so well-organized (under three separate tabs: Summary, Metadata, and Infrastructure) that it bears little resemblance to the raw logs of the Not-So-Good-Old-Days.
This data includes detailed information about not only the span itself, but also the cloud, image, and Kubernetes environment, including parent span, container, pod, and host IDs. It also includes detailed data about key infrastructure elements, along with links for troubleshooting the selected element using Sumo Logic’s analytics and visualization features. And that brings us back down to the logs themselves:
If you do need to do a log search, the span information panel includes links to targeted log searches (for both trace and span IDs), and Kubernetes-based infrastructure log searches. The troubleshooting links for key infrastructure elements also include log searches targeted to those elements. You can also use Sumo Logic’s extensive search features to run queries based on detailed information on the span, or for multiple traces and spans.
So, yes, it may still come down to logs. But how you get there makes a difference. When you use a sophisticated, full-featured monitoring service to drill down from traces through spans to closely-targeted identifiers in your application logs, the process is fast, eliminating the time-consuming (and often, maddening) bottlenecks of an un-targeted raw log-data search. These days, it’s easy to tame your logs, and not be tamed by them.