AWS

When Clean Architecture Hits AWS Infrastructure Reality

We migrated our monolith to serverless and it felt clean. Then CloudFormation started failing and the reason wasn't what we expected.

a month ago • 2 min read

By Hannes Michael

Photo of a Strangler Fig by David Clode / Unsplash

What happens when your clean architecture hits infrastructure reality

It seemed like the right call

It all started with a huge plotly dash application written in python, that one of our partners developed internally.

1 app - 3 files - 6000 lines of code.

We knew this would be a beast, but we knew we had the tools to fight it. The strangler fig pattern and a good understanding of architecture in the serverless world. So the mission was clear:

Identify modules in the monolith
Pull out the modules into serverless AWS Lambda functions and put them behind an API Gateway
Replace the old application module by module
Tear down the old application

This felt like the right thing to do and resulted in cleanly separated modular code with an easy to extract Swagger compatible documentation.

Then things started

So we created a CDK infrastructure stack which defines the API Gateway. Each Function wired to an endpoint and set up properly, so we get all possible responses and expected requests on API export.
This created an easy to deploy CloudFormation stack, which in the beginning deployed quite quickly, since it's able to deploy multiple Resources at the same time.

After more and more features got moved to API Gateway, deployments started failing. Seemingly randomly. After some investigation, we came to following error:

Too Many Requests (Service: ApiGateway, Status Code: 429, ...)

Status on Resource type AWS::ApiGateway::Model

The culprit

To be honest we were quite surprised, that CloudFormation does not respect the limitations of API Gateway. So we checked the service quotas of API Gateway:

"CreateResource - 5 requests per second". That's way too few, we can't increase it AND cloud formation does not respect it?! "It can't be. That must be a bug!", we thought and seemingly quite a lot of other developers think the same, because this issue on GitHub is open since 2022. There we found out we don't have a code problem. It's an infrastructure orchestration problem on AWS side.

The fix that worked... sort of

In this thread, we also found a potential solution, which seemed to be rather ugly, but worked for some: Create a chain of deployed resources. This is orders of magnitudes slower, but never hits the quota limit.

We created a ticket to look into this further and accepted slower deployments at the price of guaranteed stability.

With this solution the deployment time went from 1 minute to 5 after the fix, and has since crept up to 8–9 minutes as the API grew. It's a number that only moves in one direction.

Final thoughts

The real lesson here isn't about API Gateway quotas. It's that CloudFormation and the services it deploys to are maintained by different teams with different priorities. When those worlds don't align, the gap lands in your backlog.

For now we're living with the slower deployments. But the 8-9 minute deployment times are already pushing us toward a different approach entirely. One that involves letting a second strangler fig take root. More on that in the next article.

When Clean Architecture Hits AWS Infrastructure Reality

What happens when your clean architecture hits infrastructure reality

It seemed like the right call

Then things started

The culprit

The fix that worked... sort of

Final thoughts

Understanding AWS SQS Fair Queues

Beyond "Vibe Coding": Replacing My Keyboard With AI-Augmented Development

Keep reading

What happens when your clean architecture hits infrastructure reality

It seemed like the right call

Then things started

The culprit

The fix that worked... sort of

Final thoughts

Spread the word

Understanding AWS SQS Fair Queues

Beyond "Vibe Coding": Replacing My Keyboard With AI-Augmented Development

Keep reading

Understanding AWS SQS Fair Queues

How to Schedule AWS Lambda Functions with EventBridge Rules in CDK (Cron Jobs Guide)

Unlocking Inbox Success with Amazon SES Virtual Deliverability Manager