When Clean Architecture Hits AWS Infrastructure Reality

We migrated our monolith to serverless and it felt clean. Then CloudFormation started failing and the reason wasn't what we expected.

3 hours ago   •   2 min read

By Hannes Michael
Photo of a Strangler Fig by David Clode / Unsplash

What happens when your clean architecture hits infrastructure reality

It seemed like the right call

It all started with a huge plotly dash application written in python, that one of our partners developed internally.

1 app - 3 files - 6000 lines of code.

We knew this would be a beast, but we knew we had the tools to fight it. The strangler fig pattern and a good understanding of architecture in the serverless world. So the mission was clear:

  1. Identify modules in the monolith
  2. Pull out the modules into serverless AWS Lambda functions and put them behind an API Gateway
  3. Replace the old application module by module
  4. Tear down the old application

This felt like the right thing to do and resulted in cleanly separated modular code with an easy to extract Swagger compatible documentation.

Then things started

So we created a CDK infrastructure stack which defines the API Gateway. Each Function wired to an endpoint and set up properly, so we get all possible responses and expected requests on API export.
This created an easy to deploy CloudFormation stack, which in the beginning deployed quite quickly, since it's able to deploy multiple Resources at the same time.

After more and more features got moved to API Gateway, deployments started failing. Seemingly randomly. After some investigation, we came to following error:

Too Many Requests (Service: ApiGateway, Status Code: 429, ...) 

Status on Resource type AWS::ApiGateway::Model

The culprit

To be honest we were quite surprised, that CloudFormation does not respect the limitations of API Gateway. So we checked the service quotas of API Gateway:

"CreateResource - 5 requests per second". That's way too few, we can't increase it AND cloud formation does not respect it?! "It can't be. That must be a bug!", we thought and seemingly quite a lot of other developers think the same, because this issue on GitHub is open since 2022. There we found out we don't have a code problem. It's an infrastructure orchestration problem on AWS side.

The fix that worked... sort of

In this thread, we also found a potential solution, which seemed to be rather ugly, but worked for some: Create a chain of deployed resources. This is orders of magnitudes slower, but never hits the quota limit.

We created a ticket to look into this further and accepted slower deployments at the price of guaranteed stability.

With this solution the deployment time went from 1 minute to 5 after the fix, and has since crept up to 8–9 minutes as the API grew. It's a number that only moves in one direction.

Final thoughts

The real lesson here isn't about API Gateway quotas. It's that CloudFormation and the services it deploys to are maintained by different teams with different priorities. When those worlds don't align, the gap lands in your backlog.

For now we're living with the slower deployments. But the 8-9 minute deployment times are already pushing us toward a different approach entirely. One that involves letting a second strangler fig take root. More on that in the next article.

Spread the word

Keep reading