The day structured logging was deployed into production was the happiest day of my career.
Picture this scene:
I’m working my first technical job after spending 11 years in the Marine Corps, two years on the family farm, and 4 months in a Galvanize Data Science bootcamp (which was just like Marine boot camp, but with more hoodies). It’s 23:00 on a Thursday night and I’m on the phone with one of our company’s major customers. Their integration with our service isn’t working, and it’s my job to figure out why.
Things are not going well.
For starters, I have no idea if the problem’s on their end, our end, or somewhere in between. The tech I’m on the phone with can’t figure out where the SDK logs are being stored, so there’s no chance of finding errors on their end.
I’ve got access to our centralized logging system, but that isn’t exactly doing me any favors. The customer’s server is buried behind who knows how many different enterprise firewalls so searching by source IP is out of the question. Also, while I’m sure the log messages will make perfect sense to our developers in the morning, as far I’m concerned, they may as well be PC LOAD LETTER repeated over and over again.
I know better than to play the “Sorry, I’m the new guy...” card, but I’m quickly running out of options.
Eventually we cobble together enough bits and pieces of evidence that conditional probability points to it being a problem on their end. I stumble off to bed with the imprint of the computer screen still on my eyes.
When I wake up, I can’t shake the feeling that there has to be a better way. Like any Marine I share my problems openly and honestly. One-on-one, after hours, over beer, with a buddy who happens to work in the company’s security team. Coincidentally, it syncs up with an initiative their team was working on for in-app security visibility. We go back and forth on what our various needs are and make a plan to meet up with Mike, one of the lead developers on the Engineering team, who happened to write another blog post on structured logging just last week.
Lo and behold, when we sit down with Mike, it turns out that structured logging has been on his wishlist for a while now and he’s been kicking around ways to implement it in Log4j.
We’ve established a need that product, engineering, and security are all actively interested in? This is starting to feel like we might be onto something.
The three of us sat down and hammered out a logging structure based on our personal wishlist as well as various sources that we found on the web. Hat tip to Enterprise Ready’s Audit Logs post.
We settle on the following structure, a combination of developer-set variables and things that the logging system will implement automagically.
If developers are the gods of a system, then having structured logging makes me feel like a legitimate demigod.
For starters, I can hone-in on customer data with what feels like superhuman speed. Once I find the logs, I can understand them quickly, find the code which generated them, identify the developer responsible for the area where any errors are occurring, and figure out if the same error is impacting any of our other customers. Compared to what I had been struggling with, this feels like a form of near omniscience.
Those powers come across to our customers too. Being able to verify a customer issue in seconds (sometimes before they even know it’s happening) sets the tone for whatever the remediation process is going to be - it lets them know that we’ve got their back and are capable of understanding and fixing whatever’s gone wrong. This rollout has saved me hours and hours per week, and has created a lot of goodwill with the folks who keep us in business.
And it’s not just me, the superpowers of structured logging are transferable.
The logging structure enables non-technical members of our product team to handle support calls that normally had to wait to be elevated. Instead, our customer support and sales people can dive into the centralized logging system and pull up dashboards and filters to identify and isolate issues without having to know the ins-and-outs of our system’s search query language. Issues that Customer Success and Sales need to elevate arrive with links to the exact same searches and filtered dashboards, giving the technical team a leg-up on dealing with the problem.
On the security side, the tagging system makes it possible to slice security events in dozens of ways - by IP, client, action, application, API call, SDK, etc. The structure also enables anomaly detection built into the application itself, which makes possible to alert on unusual or unexpected events, like calls to protected internal APIs without a session hash. If anything unusual happens, the security team can go straight to the developers responsible for the code to begin investigating.
There’s a lot of wins for everyone involved and I’d like to thank Mike and the rest of the team for their hard work, insight, and willingness to listen to the new guy.
This is Part 2 of a 2-Part series on Structured Logging. Read the first blog post here.
Interested in learning more about TruSTAR Engineering? Drop us a line at firstname.lastname@example.org.