T-Mobile Handles 1M+ Transactions per Day on Kubernetes
The need for containers as a service
In 2016–2017, T-Mobile rebuilt its monolithic architecture using Pivotal Platform. By shifting to a microservices-based architecture, the company was able to scale up its apps, cutting production time from 7 months to just a few days. In addition, bug fixing is now done within a day with zero downtime.
While the implementation of Pivotal Platform found success at T-Mobile, there still remained certain challenges with container orchestration. According to James Webb and Brendan Aye of T-Mobile, the company needed a solution that would serve as a common standard to run vendor-supplied Docker containers. Furthermore, there was a lack of persistent storage and limited management of non-HTTP/HTTPs traffic. Finally, stateful application management had to be external.
At KubeCon 2018, James and Brendan listed the requirements T-Mobile had for the container-as-a-service layer. From the perspective of the platform teams, the organization needed:
- high availability at every level (a control plane, worker nodes, authentication/authorization)
- automated operations and deployment of a control plane and cluster builds
- zero downtime life cycle management across upgrades, OS patching, and infrastructure maintenance
- LDAP integration
- API configurability
As for the DevOps team, T-Mobile expected to have native support experience for out-of-the-box CERT/load balancing, single- and cross-replication of availability zones, cross-cluster replication, etc. In addition to that, container orchestration had to provide TCP ingress and centralized logging/metrics.
Recently, Jeffrey Kelly and Dormain Drewitz (Pivotal) hosted a podcast discussing how Kubernetes helped T-Mobile to address the challenges and meet the requirements. In this interview, Mohammad Salman and Matt Murphy (T-Mobile) along with Ryan Meharg (Altoros) shared their experience in making this possible.
Building automation pipelines
Initially, Mohammad and Matt were exploring how to integrate Kubernetes with T-Mobile’s development cycle. As a two-man team, it was a challenge for them to learn about new technologies, while also managing their other responsibilities. According to Matt, they did not have the time for actual Kubernetes support, as he and Mohammad were too busy going through public Git repos in order to learn Concourse CI. With most of the things done manually, there were some pipelines set up, but the team actually needed to tie everything together.
Ryan had prior experience working with T-Mobile on Pivotal Platform, so he was brought back in for the organization’s adoption of Kubernetes. For this purpose, the company utilized Pivotal Container Service (PKS)—a tool enabling operators to deploy, run, and provision enterprise-grade Kubernetes clusters.
“We have a really good relationship with Ryan and Altoros, especially in the Pivotal Application Service (PAS) world and using Concourse. The nice thing about platform automation is that it’s the same tool used in PAS, as well as PKS. So, it was a real good fit to have Altoros help us out to get the PKS stuff off the ground.” —Matt Murphy, T-Mobile
With support from Altoros, the team at T-Mobile acquired the assistance in learning and developing on Kubernetes. For six months, Matt and Mohammad were pair-programming with Ryan to facilitate both training and knowledge transfer.
“Doing the whole pair programming is good for knowledge and skills transfer. We’ve picked up small things that are helpful from anything like RESTful APIs to Bash scripting. If they need to add even more automation in the future, they can now build it themselves.” —Ryan Meharg, Altoros
Equipped with their knowledge of Kubernetes, Matt and Mohammad are spreading PKS awareness in T-Mobile. They now host regular PKS 101 classes at T-Mobile to ensure that both their customers and their developers know what it is and how it works.
“We just had a PKS 101 class in our headquarters, where 40 developers showed up. We helped them to deploy a simple application on our Kubernetes cluster, which is running on PKS. The experience has been really good. We have seen large improvements since last year.” —Mohammad Salman, T-Mobile
The next vital step for the collaborating teams was to achieve automation, thus remove human error and minimizing the efforts needed to deploy new clusters. By enabling PKS-based pipelines, T-Mobile reduced the amount of time spent on keeping both the code and platform up-to-date. It was a crucial achievement as Pivotal had really fast release cycles with security patches on the operating system, as well as stemcells, coming out every other week.
“When you’re doing things manually, it’s prone to human error. When T-Mobile is growing so rapidly, it’s easy for technical debt to spiral out of control. We just want to keep things fast and deploy things as quickly as possible. With PKS pipelines, there’s been massive improvements in creating this platform automation product.” —Ryan Meharg, Altoros
Furthermore, whenever a pipeline is run, it builds out a full cluster ready for the developers to go on instantly. If they need to rebuild a cluster, it already includes logging tools, a dashboard, the cluster role bindings, as well as persistent storage. A performance tool is also used to make sure that the best practices and standards are met on each Kubernetes cluster.
“As soon as we run a pipeline on a Kubernetes cluster, we make sure that the cluster is ready for the customers to deploy their workload on it, and that we don’t need any manual work done. The whole point of the pipeline was to deliver a ready Kubernetes cluster with all the third-party components on top of the PKS layer.” —Mohammad Salman, T-Mobile
The team at T-Mobile came up with a shared-nothing architecture, which is regionally distributed. Each region comprises three availability zones, and each zone, in its turn, is a single rack with independent networking, computing, and storage. Maximum capacity for each region is set to 55 terabytes of memory, 2 petabytes of storage, and 2,200 cores.
Running in two data centers, there are separate regions for production and non-production. Near-near and near-far deployment strategies were enabled for applications with a data center preference. Global server load balancing is available for active-active and active-passive cross-region deployments.
Millions of transactions on 24 Kubernetes clusters
While automation in any capacity can be seen as a positive thing, it’s easier to advocate for when actual results are shown. In T-Mobile’s case, Matt, Mohammad, and Ryan shared some situations where automation was of great help. The first example was being able to quickly deliver on the needs of developers, looking to deploy an application.
“Automation makes us very efficient, when a developer team comes and says we have a new product coming out, and we’re going to need three clusters by tomorrow. “It’s something that’s actually very realistic, and we could spin up and deliver to them.” —Matt Murphy, T-Mobile
On another note, automation enables T-Mobile to quickly roll out upgrades to their Kubernetes environments. With automation, the telco provider was able to upgrade from PKS v1.1.5 to v.1.3.5 with minimal manual work. The new version includes features that will come handy for certain Kubernetes cluster configurations.
Automation also helps organizations to scale up with a small team, added Ryan. In this case, it meant Matt and Salman, two DevOps engineers, can operate and manage T-Mobile’s 40 clusters, which are running on 12 foundations. These clusters have thousands of containers and pods for hundreds of developers, running hundreds of applications.
“It takes longer to automate something, but the long-term benefits pay off in not having to repeat tedious processes over and over again. If you’re a Linux system admin, that’s probably the worst thing you can be doing.” —Ryan Meharg, Altoros
Automation makes it also possible to roll out security updates without any interruptions. Matt shared that a few months ago, a quite important Common Vulnerabilities and Exposures (CVE) came out, and it could expose a lot of customer data. With automation, T-Mobile was able to upgrade all of the environments quickly without any interruption, negating any issue the CVE could have potentially caused.
“In the old days, we would have had thousands of boxes to patch. We would have had to put a group together to make sure everything was done. Now, we literally had this running on 5–6 pipelines, and we were done. When something needs to be patched immediately, this allows us to do that quite well.” —Matt Murphy, T-Mobile
With Pivotal Platform, T-Mobile is now able to support 300+ million transactions in production (around 10,000 transactions per second) on more than 34,000 application instances (containers). Running 30
cf push commands per day, the company has gained a 40% increase in release velocity. All of these are run and maintained by less than 12 operators.
Four of T-Mobile’s mission-critical systems are already running 100% on PKS. These include order management, retail store, and call centers apps, as well as maps.t-mobile.com. At the container-as-a-service level, 24 single and multi-tenant clusters ensure the company is able to handle a million production transactions daily.
Lessons learned and future plans
Throughout the entire project, there were some key takeaways that Matt, Mohammad, and Ryan wanted to share:
- Use a sandbox cluster for rebuilding on a regular basis. This helps to ensure that everything else is up-to-date software-wise.
- Developers should follow the Twelve-Factor App methodology in order to allow patches to be rolled in seamlessly.
- Use naming standards when automating. This makes it easier when trying to find something for troubleshooting purposes.
- Have more than one production environments to prevent disruptions because of critical issues.
Platform automation is about building blocks that an organization can use and expand on with the goal of having everything automated, so that small teams can run an entire platform, which essentially runs itself.
Moving forward, T-Mobile is exploring how to provide hosted data services on Kubernetes. The company investigates how to manage microservices with a service mesh, such as Istio, and explores the use of Knative as an abstraction layer.
Check out the full podcast for more.
In this video from KubeCon 2018, James Webb and Brendan Aye of T-Mobile share the company’s experience of building and scaling Kubernetes on-premises.
In the podcast recorded in 2019, Matt Murphy, Mohammad Salman, Ryan Meharg, Dormain Drewitz, and Jeff Kelly discuss platform automation at T-Mobile.
- T-Mobile Slashes Production Time from 7 Months to Days with Cloud Foundry
- K8s Meets PCF: Pivotal Container Service from Different Perspectives
- Automating Deployment of Pivotal Container Service on AWS
About the experts