Queueing at Redox part 3: Idea to implementation
Sep 13, 2021
Our entire organization is dedicated to solving problems slowing down the frictionless adoption of technology in healthcare. Within engineering at Redox we have developed a few strategies for breaking down complex, nebulous problems and solving them quickly, safely, and effectively.
Occasionally a particularly tenacious issue pops up that requires the entire toolbox to be thrown at it. A recent occurrence of this at Redox was part of our patented queueing system that we developed to handle large amounts of asynchronous traffic between our customers. For a long time we had occasional performance issues within this system that would be induced by a large backup/spike in traffic through the system.
Today we’ll discuss some of the details of this particular issue and also the various strategies we used to break down and chip away at this nebulous and complex issue.
Background
In order to understand this issue, let's take a real quick look at how our queuing system works and why it exists. The short rationale as to why we don’t use a traditional off-the-shelf FIFO queueing system is that “one at a time” is simply too slow. If we were to process messages for our customers one at a time it would be impossible to keep up. We also can’t process truly in parallel because order is very important with healthcare data. If a patient is discharged before they are admitted things get weird. This is the problem from which our custom queueing solution was born. It enables us to complete partial concurrent, or parallel, processing on our messages, then line messages back up sequentially to perform the stages of processing where order matters. This means slower, non-order-dependent stages, such as parsing XML, happen in parallel, but then sending translated messages to the destination customer can happen in FIFO order.
In our system today we have hundreds of queues. Each “queue” represents one-half of a message's journey, either from a customer to us or from us to a customer. Additionally, we also split out separate queues for each data model a customer uses, as FIFO only needs to be enforced on a per data model basis.
To implement this queueing system we use a humble Postgres database. All of our messages are inserted into the same table and workers of different “flavors” search for and perform work on relevant messages as they move through various statuses and complete both the concurrent and sequential stages of their processing steps. If you haven’t read part 1 of this series yet, it does a deeper dive into the reasons for, and implementation of, our parallel FIFO queue which will help provide a bit more context for this story.
The problem
Generally, our queueing system is able to keep up with traffic from our customers in essentially real-time, with very low latency and infrequent backing up of traffic. However, there are a few reasons outside of Redox’s control that message queues for one or more customers may build up a depth or backlog to be processed. This could happen when a destination is down or encountering an error, or when a customer has to perform a backload to bring a new connection online, among other reasons. The database backing our queuing system had been performing very well during our normal daily traffic but performance would degrade significantly if we ended up with a large backup of messages within our database.
How we tackle problems
Experimentation
Our queuing system is composed of several different processing steps that each message moves through on its way to completion. Each of these steps comes along with its own query(s) powering it, and even during load testing, it was not always obvious which queries within the system were the main offenders when it came to our performance issues. We would see the database CPU spike, which would make queries slower, but a chicken/egg problem developed where we weren’t sure if the queries were slow because of high CPU or if the CPU was high because of slow queries. One of our engineers came up with the idea of doing a load testing round where we isolate each phase of our processing before moving on to the next, similar to how ships going through the Panama Canal must wait in a lock before moving on to the next part of the canal. During this experiment, we were able to more accurately identify which specific stages of processing were providing the most load for the database. For a deeper dive into the story behind the test be sure to check out part 2 of our queuing series.
Level up
With the issue in the back of our minds, many on our team started to utilize our Learning Fridays, days where engineers can engage in whatever best supports individual or team growth, to level up our understanding of Postgres query performance optimization. This level-up has paid dividends for the team, as the learnings have been applied multiple times even outside of this particular issue.
Using some of the skills we had learned we started to look for ways we could improve the problematic queries we had identified during our Panama Canal experiment. We were able to rearrange our sequential message finding query so that performance no longer degraded with queue depth, but instead the total number of queues we have on the database. This is much more stable than queue depth and easier for us to control if needed in the future. Unfortunately, we still weren't able to crack the code on the concurrent query, which was our chief offender.
We need answers
As the number of connections on our network grew, so did the frequency and impact of these incidents. We were running out of small tweaks, and also time to keep iterating, so we opened up to more obtrusive options. This tends to mean solutions that take longer to implement, carry more risk, or generally involve larger changes to how our system works. We had all sorts of theories on how we could re-architect to eliminate this issue, but all took substantial enough effort that we couldn’t risk taking the time to implement just to find out it did nothing to solve our current problem or introduced a new problem. This predicament led us to come up with a new tool for problem-solving.
Hacking away
Typically when we’re approaching a project, and we aren’t quite sure how we want to accomplish our end goal, we’ll start with what we call a “Design Alternatives'' document. This consists of us writing down our ideas on a way to implement a solution, followed by some analysis into the pros/cons of each approach and some discussion on which way we’d like to proceed. With this particular issue, however, we had extremely high uncertainty in our ideas, each requiring extensive testing or POC development, so we didn’t feel comfortable with our typical design process.
Roughly equivalent to a hackathon, we decided to dedicate a full day for our entire team so each person, or a pair, could either create a proof of concept or take a deep dive analysis into, one of the ideas that we thought had promise. The day before we met for a short while to brainstorm and describe the solution each of us wanted to look into. The day of, each person spent as much time as they could focusing on just that issue, along with a meeting in the afternoon for us to come together, discuss progress, and potentially pivot/abandon ideas, or come up with solutions to pitfalls we were running into. By the end of the day we had eliminated a few ideas we had for various reasons such as complexity of the solution, or ineffectiveness at solving the problem as we had hoped, but more importantly, identified two solutions that did show some promise.
Iterate! Iterate! Iterate!
Once again, creativity struck. We had an idea for yet another solution to our woes. This one was the same in principle as one of our favorites, but was a significantly simpler implementation and didn't introduce any new tools into the fray. Utilizing Learning Friday once again, we were able to create a POC of this idea and prove if it had promise or not. In this particular case, the POC appeared to be a raging success, proving just as effective as our more complicated solution, but much much simpler to bring to reality, and with no added infrastructure costs.
Light at the end of the tunnel
What was the idea?
Because we couldn’t figure out a way to make our concurrent query perform well when faced with significant message depth, we did the next best thing. We made it impossible for there to be a high depth of messages waiting for concurrent processing! This may sound like sidestepping the problem because to be honest, it is. By creating a holding pen for excess messages and only letting out a controlled number at a time, we were able to maintain the performance of our concurrent query but also not introduce a significant amount of latency to processing either. We additionally built-in circuit breaker type logic so that this holding pen is only used if the depth of the particular queue is starting to grow above a certain threshold to ensure that our normal traffic is not complicated or slowed down by this extra stage in processing. Once that threshold is met, instead of putting new messages directly into the queue as “ready for concurrent”, we place them in the metaphorical holding pen. We then have a very low overhead process that simply checks for messages in the holding pen, and if there’s room in the queue for messages, it moves them over.
Happily every after
With an idea we have confidence in, we did one last final design to flesh out the details of our chosen solution. With our cards created, estimated, and prioritized we went on our way and got the project knocked out. The immense power of Learning Friday was on grand display for this project. We did discover a workable solution through our organized design process, but giving our engineers the flexibility to act on a hunch and spend some time POC-ing wacky ideas led us to a slight modification of our idea that avoided adding an additional, complex dependency to our infrastructure and took the project from 3-5 sprints to implement down to 1 sprint on top of that. Our depth-induced queueing issues are an issue no longer and peace was brought upon the land for 1000 years... Or at least until the next problem pokes its head up.
This is part three of a series. Check out part one and part two to learn more.