How Innocent SQL Statements Create Multiple Sources of Truth
Mar 26, 2025
There are usually a few different ways to write a SQL statement and produce the exact same output. While the output is the same, the statements aren’t equal. In fact, the way that you write SQL can increase bad architectural practices and produce bugs down the road. To avoid these issues, it’s important to understand how good SQL can become bad architecture.
Here’s one example that I’ve seen over and over again across my 15 years of leading analytics teams at Amazon, eBay, VMware, and other companies.
Background
When I was at eBay, one of my partner analytics teams owned a metric for calculating losses. The calculation wasn’t technically complex, but there were 9 different fields that had to be pulled into the calculation. It was the equivalent of “x + y - z + (a*2)”. Again, nothing that can’t easily be handled in SQL. When the team originally created this formula, I’m assuming that they did what any analyst or engineer likely would have done. They wrote this formula as part of their “select” statement in SQL, generated their output, and went about their day.
Given that this team owned the calculation of the total dollars lost, this metric became a metric that was used in all of their standard reports. Again, the team used that same formula in each of their SQL statements to generate their Excel reports. They also used this formula in their custom calculations inside of Tableau. To an outside observer, and possibly to the team, everything probably appeared to be working very well. So, let’s fast forward a bit.
I was working in a different team and I needed to provide an estimate of dollars saved through my fraud mitigation practices. I did what hopefully any analyst would do. I contacted the source team and asked them to explain how they calculated the total lost dollars for their organization. They proceeded to give me about 15 minutes of education and then they emailed me about 20 lines of SQL code for me to use in my analysis and future reports. Piece of Cake. Unfortunately, unbeknownst to the team, this had issues written all over it.
A Common Practice
On the surface, there probably doesn’t seem to be anything wrong with this practice. Almost every analytics team operates this way, so how could it be wrong? Compared to other situations, this might actually seem like a great practice.
For example, in the corporate setting, there can be a lot of people working for the company, which makes it hard to find the owner of the data that you’re looking at. Worse, there’s frequently a significant amount of employee turnover and sometimes as an analyst you’ll come to find that there isn’t really an owner of the data or code. Technically there’s someone who might own it, but there’s plenty of cases where the “owner” has little knowledge of the data.
This challenge of being able to quickly find a knowledgeable owner frequently leads to data analysts blindly using data and tables that they find. But in this case, we actually spoke with a human, giving us confidence that we had the right query. However, the accuracy of the code wasn’t the problem. The problem had to do with what was almost guaranteed to happen in the future.
Anticipating a Future State
Things change. There’s no getting around change in your data environment. Tables are deprecated, fields are renamed, and code is changed. But with SQL, you can make these changes and sometimes your old code will run, and sometimes it won’t. It just depends on the change. This means that one of two things could happen if a bug were to occur.
First, the change could cause all related SQL scripts to break and throw errors when they run. These aren’t fun issues to wake up to where your team has a standard report that was working perfectly just the other day. Now it’s suddenly broken and you don’t know why because you didn't change the code on your side. It’s not fun to deal with, but at least this isn’t as bad as the second possibility.
If the code didn’t cause an error to be thrown, it doesn’t mean that you don’t have a bug. It just means that you’re likely going to have a hard time ever realizing that you have a bug. And when you do realize that something is wrong, you’re likely going to have an even harder time figuring out where the bug came from. Trust me when I tell you that you’d rather have a query that throws an error and is completely broken. But you really don’t want a bug in the first place.
As an analyst or engineer, you won’t know when you’re going to need to make a change but you should anticipate that you’ll need to at some point. And when you do make a change, you’re going to need to avoid breaking things or causing headaches for other people as well.
The partner team that I was working with had built a custom formula into their report. This was perfectly acceptable if it was a one-time need. But the moment that the code was copied and pasted into another script or dashboard, this should have been a red flag. In the software development world, the best practice is to always keep your code DRY - Don’t Repeat Yourself.
By duplicating the code and formula, the team had introduced the risk of multiple sources of truth. If an update was ever required, the team would have to change their code in multiple locations. And each of those changes would require quality assurance (QA) checks. Or at least they should require QA checks, but not everyone does those as a best practice. Now, that’s assuming that you can locate all of the places that the code was deployed.
If you’re a software developer reading this, you’re probably thinking, “That’s easy, just grep the code repository.” Unfortunately, rarely do analytics teams use a code repository. If they have one, chances are it’s not consistently used. And if the repository isn’t consistent it means that you have gaps. Now, you’re back to square one with not knowing all the places where this code could be located.
Talk about a nightmare. And this is what analytics teams deal with, err, potentially deal with if they’re thinking about the broader ecosystem, when they need to make a change. But these potential impacts are frequently overlooked.
In my situation where I was to borrow the other team’s formula, how would I ever know if that team changed their code? How would that team ever know that I was using their code? Technically, there are a few ways to determine this, but there are almost always gaps with those ways. Ideally the code is architected in the best possible way to avoid these problems to begin with.
Building a Better System
There are two things that the team could have done to mitigate the risk associated with the formula being changed, which did happen.
First, they could have rebuilt their source table or had a data engineering team re-build their source table. Instead of writing the formula in SQL, their formula could have been written in an ELT (extract-transform-load) script and then the result could have been stored in an individual field in the database table. With this method, any time the formula changes, an analyst or engineer only has to change the code in one location.
Another option would have been for the team to create a View. If rebuilding the database table wasn’t an option, they could have written a View with their formula included. This would have acted as a layer over the top of the table. Again, this would ensure that there would only be one place where the formula would need to be changed.
Why Didn’t These Things Happen?
I can’t say what the exact reason was as to why the team didn’t build their environment this way. But I can share three key reasons why this tends to occur.
The first reason is the gap of knowledge and training. In order to suggest these solutions, analysts must first be aware that there are potential problems related to their work. If an individual analyst is working in an isolated manner, thinking only about their code and not the impact of the team and systems, then it wouldn’t be reasonable to expect the analyst to create a solution to a problem that they don’t even know exists.
This gap of knowledge and training is the primary reason why I see issues like this. Across my 25 years in the tech space and 15 years leading analytics teams at Amazon, eBay, VMware, and others, I’ve yet to encounter analytics teams that have been empowered with the necessary training and knowledge to avoid these issues.
The second reason has to do with architecture. In some environments, analysts are restricted from creating these solutions. In some environments a data engineering team would be required to update an ELT script or even create a View. Even if the analytics team has the ability to make such changes, those changes would generally be avoided because additional key architectural components are frequently missing from the analytics environment. And without those components, making changes could be risky and could actually create multiple sources of truth!
The third reason has to do with culture. To avoid creating multiple sources of truth and to create a well architected environment, the company must have a culture of knowledge sharing and have a company-wide analytics mindset. While analysts are always part of a team, the way code is written is not necessarily done so for the benefit of other analysts and engineers. Instead, much of the code being written is thought of as for the specific analyst’s use. Rarely are all pieces of code and related assets built with a broader company perspective in mind.
Conclusion
To avoid creating multiple sources of truth and costly bugs, it’s important to create an analytics environment that minimizes potential issues. To take the first step of building such an environment, begin by elevating the knowledge and skills of your analytics team.
By simply being aware of these potential issues, your team will be armed with knowledge to ask the right questions and seek out guidance. But the right culture must also exist.
The second step is to build a culture where data teams care about the entire data ecosystem, not just their individual work. Analysts will need to be supported to deliver the right results, even when being pressured to build quickly while avoiding best practices.
When you build the proper analytics environment and culture, you’ll increase the efficiency and effectiveness of your organization. You’ll also make inroads towards solving issues of multiple sources of truth in your data environment.
Brandon Southern, MBA, is a Sr. Director of Analytics and the founder of Analytics Mentor. With a 25-year career in the tech industry, Brandon has excelled in diverse roles encompassing analytics, software development, project management, and more. He has led analytics teams at Amazon, eBay, VMware, GameStop, and more.
You can learn more about Brandon and Analytics Mentor at http://www.analyticsmentor.io/
Subscribe to Receive New Articles
Join our mailing list to receive the latest articles and tips to elevate your career.
We hate SPAM. We will never sell your information, for any reason.