AWS Well-Architected at first glance - Part 1 - Operational Excellence

The AWS Well-Architected Framework provides cloud architects six pillars that provide guidelines to achieve resilient, high-performing and secure infrastructure for their workloads. Design principles and best practices are provided to ensure you design and run your workloads on a solid foundation.

6 months ago   •   3 min read

By Johann
Photo by Andrea G / Unsplash
Table of contents

The AWS Well-Architected Framework provides cloud architects six pillars that provide guidelines to achieve resilient, high-performing and secure infrastructure for their workloads. Design principles and best practices are provided to ensure you design and run your workloads on a solid foundation.


The six pillars

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization
  • Sustainability

Incorporation of these pillars into your projects can help you overcome the challenges to build stable and efficient systems that deliver on your expectations and requirements.


Operational Excellence pillar

At Amazon, we define operational excellence as a commitment to build software correctly while consistently delivering a great customer experience. It contains best practices for organizing your team, designing your workload, operating it at scale, and evolving it over time. Operational excellence helps your team to focus more of their time on building new features that benefit customers, and less time on maintenance and firefighting. To build correctly, we look to best practices that result in well-running systems, a balanced workload for you and your team, and most importantly, a great customer experience. – AWS documentation

The Operational excellence pillar helps to get out new features and bug fixes to your customers faster and reliable. This will be achieved by following the design principles and best practices.


Design Principles

  • Perform operations as code: Operations on your cloud infrastructure and applications should be done with code. The benefit of this is that your changes can be version controlled, the operations are repeatable and they can be automated.
  • Make frequent, small, reversible changes: Your design should be loosely coupled and planned for scaling. Using the help from automated deployments and tests, small changes can be deployed more rapidly and as you track the code in a version management system, they are reversible. This way, your cycle time will decrease.
  • Refine operations procedures frequently: As your workloads change, you might also need to change your operations. Do checks on these changes with the team and iterate if necessary.
  • Anticipate failure: Test out and identify possible failures that could raise, then improve on these with the previously mentioned small, reversible changes. Things to consider here include to check if your monitoring was on point and your procedures in case of a failure are effective.
  • Learn from all operational failures: Improve with lessons learned in case of a failure or operational events. AWS advises to share the gained knowledge through all teams and if possible the entire organization.
  • Use managed services: Reduce the burden of managing services and focus on your own workloads by utilizing the managed services of AWS wherever possible. Build your procedures around these services.
  • Implement observability for actionable insights: gain a greater understanding of your workload behavior by implementing telemetry and track performance, reliability, cost and health of your infrastructure. This helps with making informed decisions where and what can be improved.

Best Practices

Operational excellence in the cloud is achieved in the areas:

  • Organization: It is beneficial to have a common understanding of the entire workload, the role of the teams and every individual in it. Shared business goals will create business success set by defined priorities. The priorities should be reviewed regularly as goals may shift over time. Therefore you should keep evaluating internal and external customer needs, possibly compliance and governance and manage benefits and risks. These reviews should also include your organizational culture.
  • Prepare: You have to fully understand your workloads and how they are expected to behave. This understanding will show you which metrics are important to keep an eye on so you can provide insight to their status and where external dependencies are located.
    • Implement observability
    • Design for operations
    • Mitigate deployment risks
    • Operational readiness and change management
  • Operate: The metrics that you defined in the prepare step also help to measure if you achieved your business goals. Use these metrics to understand the health of your workload and operations. Also use them to make informed decisions if you should notice that the business outcomes may become at risk, or are at risk, and respond appropriately.
    • Utilizing workload observability
    • Understanding operational health
    • Responding to events
  • Evolve: Use the knowledge gained from your metrics and the decisions that you made to continuously improve over time. Implement frequent, small, incremental changes and evaluate their success. Share your knowledge in the team and if possible the entire organization and keep iterating.
    • Learn, share, and improve

This is the end of the exploration of the Operational Excellence chapter of the AWS Well-Architected Framework. This should help you to start understanding how to design, manage, and optimize your AWS workloads.

If you want to learn more see the AWS Well-Architected Tool.

Operational Excellence is just one aspect of the AWS Well-Architected Framework, there is still much more to explore! In the upcoming posts, the journey through the other chapters of the framework will continue, providing you with a comprehensive understanding of the best practices for architecting systems on AWS.

See you in the next AWS Well-Architected post!

Spread the word

Keep reading