Your System Isn’t Slow – Your Bottleneck is not Optimized

bmckay959 Avatar

I’ve been working on a fairly large ETL project with around 300B rows in a table and roughly 6TB of compressed data. I built the system using Laravel and came across some interesting technical challenges that I haven’t faced before.

Honestly, I didn’t plan to write this, but after chatting with Ruben over at Spatie, he encouraged me to share. So here goes.

Some of the issues we’ve come across are disk saturation, ElastiCache serverless shard limits, large database tables and more. Most of these issues are terribly hard to test and predict prior to pushing to production, so you deploy your updates only to watch your system come to a screeching halt. I’ve only done that about 100 times during this project. Think of this as me trying to help you not make the same mistakes that I made.

What is a bottleneck?

In it’s simplest form, a bottleneck is any part of your system that is slower than the rest, creating a backlog. It could be

  • Database inserts
  • Disk I/O saturation
  • CPU limits
  • External API constraints

The list is really limitless and depends on each system, but I would like to emphasize the ‘creating a backlog’ comment from above.

Why not just fix the bottleneck?

To be clear, fixing or optimizing the bottleneck should be your first move. But sometimes you hit what I call the ‘reasonable limit’.

Sure, you could improve performance by 1% by spending an extra $5,000 per month on a server upgrade. But that’s not a reasonable tradeoff for our system, though Amazon would make that tradeoff in a heartbeat!

Sometimes, the bottleneck is completely out of your control, like an external API’s rate limit. In those cases, optimization isn’t even an option.

Optimizing the wrong thing doesn’t fix the problem

This one stings a bit…

I’ve optimized multiple parts of the system, only to realize the amazingly awesome optimization I made didn’t actually improve the speed of the system. Why? Because the bottleneck was still the bottleneck.

Think of this like a highway…you upgraded from 4 to 8 lanes so more cars could go faster, but the exit ramp was actually the issue, so no meaningful improvement was made. Cars were still backed up because you optimized the wrong thing.

When that happens, it usually means you’re not measuring the right parts of your system. You’re flying blind.

In our case, I built a Grafana dashboard to monitor CPU, memory, and disk stats. Then I used Spatie’s Prometheus package to track application specific metrics:

  • How many files just got added?
  • What type of files?
  • How big are they?
  • What state are they in?

This was helpful because it allowed me to spot patterns in the process. I would notice my job throughput dropping significantly at times, only to return back to normal an hour or two later. Turns out, the files being processed during that time were just larger on average in size. The same amount of bytes were being processed, it was just happening over a smaller number of files. I would have worked to optimize this file type, only to realize that no optimization was actually needed had I not been tracking the file size as well as the file count.

Keep your bottlenecks busy and in optimal conditions

Let’s go back to that external API example. Say we have a hard limit of 60 requests per minute, but need to send 600 requests.

Laravel’s first party queue system really helps here. We push the 600 requests to the background, then ensure we send exactly 60 requests per minute (ok, maybe 55 to be safe, but you get the point). Now, we know that those jobs will take 10 minutes to process. No big deal, we pushed the bottleneck to the background to work at max throughput while the other parts of the system can continue working as well.

In our ETL system, each file type takes a different amount of time to process 100 records:

  • File type “A”: 10 seconds
  • File type “B”: 30 seconds

This variation affects our database bottleneck. Type A pushes data fast enough to max out throughput, but type B uses only ~30% of the database capacity.

Scaling up workers for type B to hit max throughput causes problems when type A comes back, suddenly overwhelming the database.

Decoupling steps to smooth throughput

Initially, we used a single job to extract the records and insert them into the database. This works well enough for most systems and is a perfectly reasonable design decision when starting out. However, as our system scaled, this single-step process made it difficult to keep our database running consistently at maximum throughput. Some files would cause the database inserts to run at maximum throughput, while others only around 10%.

Our solution was to decouple these operations into two seperate jobs. The first job extracts the records and writes them to a CSV file. The second job imports that CSV into the database. This allowed us to run the extraction jobs aggressively in parallel without overwhelming the database and it made the database imports more predictable and consistently running at maximum throughput regardless of the original file type. The hard work has been done in the extraction job.

I like to think of this as optimizing to the bottleneck, not around it.

Yes, doing this has delayed the first record getting into the database, but has significantly brought forward the last record being put into the database. That was the real goal.

When to optimize versus just throwing more resources at the problem

This one can be hard to know exactly the right time to do what, especially if you are the sole developer/owner of the project. I really think this is a pendulum that swings back and forth. To start, you should probably upgrade the 2GB ram, 1 CPU server. No big deal, you went from $5/month to $15/month. But, at some point the $15/month becomes $1,500/month and you need to upgrade to $3,000/month. Now, spending 1 week on an optimization that makes it where you can stay at $1,500/month really starts to make sense. So, you work on optimizing the code and it works great and the performance increased. But, unfortunately, the next optimization isn’t quite as easy and will take 8 weeks of time to build, so upgrading the server starts to look like a good choice.

You won’t get it right every time, but if you compare your time value vs cost, a simple spreadsheet can help clarify what makes sense.

Final thoughts

A bottleneck isn’t inherently bad, we all have them. What matters is making sure your biggest bottleneck is always working at it’s most optimal conditions. That’s where your optimizations should be focused.

Track your throughput before and after the optimization to see if it does what you were hoping it would do. I have made optimizations that brought additional complexity to the system(i.e. second database, etc) that I ended up rolling back because even though parts of the system were faster, the end result didn’t actually improve and the added complexity was only going to become a very nasty technical debt that I didn’t want to pay interest on.

Don’t be a hero. You don’t need 50% gains every time. Try to get 10% more throughput with the first refactor, then another 5% later. The incremental change allows you to optimize to the ‘reasonable limit’ without over engineering the problem.

I hope this helps and don’t hesitate to reach out with any questions.