How We Work in Cloud Platform
Our Cloud Platform team has a lofty goal. Take a product that works for several people, scale its infrastructure to work for tens of millions of active players, and then repeat it for all our other products and services - including Big Data infrastructure - in a way that is reliable, maintainable, scalable and secure.
It is not easy to accomplish this feat when you have 5+ million API requests per minute and petabytes of data to manage. It requires a set of robust practices and processes to structure all services in a similar way. That is why we embrace a few practices within the Cloud Platform team to make sure we have consistent and repeatable processes.
It is essential that we have an internal framework on how to deploy and maintain services in a uniform way. We use a set of technologies to achieve this goal. For instance, technologies and services like AWS, Consul, Prometheus and many other newer technologies are being used in all the services we manage. This architecture provides a foundation on which all products are built.
At the same time, we closely collaborate with developers to apply DevOps practices wherever possible, to ensure we have the same mindset across the board. For that reason, the Cloud Platform team also serves as the ambassador of this paradigm to champion DevOps adoption within Peak. Our vision is to empower software engineers with the tools and processes so they can own the services they build. Developers collaborate with the Cloud Platform team on infrastructure code bases and work together on issues in a blameless way.
We are also responsible for managing the entire database stack. Although MySQL is our choice of RDBMS when it comes to user facing products, we have experience on a wide range of storage layers, like NoSQL and NewSQL databases. Like any part of our architecture, we deal with some challenges that are only apparent at this scale, one of which is MySQL sharding. The whole story changes when you have hundreds of thousands of concurrent queries coming to your databases and we need to make sure that we can scale to even more in the long run.
Failures are inevitable at this scale. Our infrastructure is an organism that lives and breathes, as we never settle down and always look for more efficient, performant, manageable and secure solutions to serve our users better. This scale and agility requires a comprehensive approach to monitoring. We use several technologies that help us to monitor our services in real time as well as allow us to perform deep-dive failure analysis. In scope of shared responsibility our developers also take on-call duties where they can contribute to resolution. At the end of the day, we are all on the same ship with the same ultimate purpose.
It is not possible to deal with all these challenges without having a long-term vision. We achieve all these with a very dedicated team that turns this vision into actions. The Cloud Platform team has a shared understanding of their goals and work towards them regardless of the product and services they manage.
They are doing all of this on their own terms with everyone prioritizing according to issues they see as important. While making sure that our services are up and running, we always develop the next step and optimize the processes. For example, our traffic has been increasing but the effort required to manage our systems has stayed the same for the past few years. This is because we have been constantly improving our automation and further developing our practices to make sure we create more value for the people who engage with our products and services. Everything we do is to provide a better experience for them, and they are at the heart of our approach.