T-Mobile Slashes Production Time from 7 Months to Days with Cloud Foundry

A monolithic architecture was rebuilt with Pivotal Cloud Foundry, microservices, and containers to handle 40 millions of daily transactions.
Why read this?
Use Case for Cloud Foundry:

A major telco uses Cloud Foundry to speed up the development life cycle and achieve scalability of its applications.

Business or Technical Result:
  • A 1,000-function monolith was renovated by introducing a microservices-based architecture.
  • Production cycles were slashed from seven months to days.
  • New platform auto-scales and handles 40 millions calls a day.
  • Bug fixing is done within a day with zero downtime.
Lessons learned:

Monitoring is critical to avoid major disasters, responding to operational issues at short notice. Transfer of ownership to participants reduces cross-organizational conflict, while empowering team members to develop, test, and run their own code. Management buy-in is essential.

What else is in the stack?

Java, Spring Cloud Services, LeoFS, MySQL, RabbitMQ, Apigee Edge, Kubernetes, Swagger

Cool fact about deployment:

With onboarding started in May 2016, T-Mobile was able to launch its platform into production early in July that year. By August 2016, 100% of production traffic was moved to Pivotal Cloud Foundry.

Company Description:

T-Mobile US, Inc. is a major wireless network operator in the United States. Its headquarters are located in Bellevue, Washington, in the Seattle metropolitan area.
The company traces its roots to VoiceStream Wireless PCS founded in 1994 as a subsidiary of Western Wireless Corporation. In May 2001, it was purchased by Deutsche Telekom for $35 billion and renamed T-Mobile US, Inc. in July 2002.

Becoming the “un-carrier”

T-Mobile US is a wireless network operator with 72.6 million customers as of early 2017. Operating under the T-Mobile and metroPCS brands, it bills itself as the “un-carrier,” offering a range of continuously updated innovative services and flexible rate plans.

In addition to reducing dependence on contracts and uncoupling device costs from servicing expenses, the initiative has provided:

  • Access to free unlimited music and video streaming
  • Possibility to make calls and send messages via a Wi-Fi connection without additional apps, logins, or costs
  • Tools to help business customers go mobile (such as a free .com domain name, a website optimized for mobile devices, and an e-mail address)
  • A wide variety of exclusive promotions and bonuses from T-Mobile’s partners (e.g., Walmart, Domino’s Pizza, Lyft, StubHub, etc.)

steps-of-the-uncarrier-initiative-at-t-mobile-v13The phases of the “un-carrier” initiative (Source)

While giving customers more freedom in how they use wireless and mobile services, the “un-carrier” initiative required a lot of technical innovation internally.

For instance, an environment that mainly comprises large Java-based monoliths was a major impediment to changes. T-Mobile had trouble scaling up (and then back down) during several big holiday promotion periods it faces each year. In addition, the release cycles were long and complex—with almost 7 months and 72 steps taking to launch the code to production. With traditional development and operations teams working separately, differences among test and production servers made configuration difficult.

 

Investigating Cloud Foundry

Seeking ways to speed up deployment and improve scalability and configuration management, T-Mobile turned to Cloud Foundry—at first, to its open-source version. After some time, the company realized it needed expert assistance in onboarding and troubleshooting, as well as in adopting microservices. So, it decided to switch to Pivotal Cloud Foundry (PCF).

A recent presentation revealed how T-Mobile used PCF to rebuild one of its legacy Java-based systems—running on WebLogic—with more than 1,000 functions. According to Brendan Aye (Principal Architect for Cloud Foundry), and Melissa Chapman (Sr. Product Manager at T-Mobile), the platform was seen as a solution to creating a scalable environment.

CF Summit 2017 T-Mobile Cloud FoundryMore than one reason to choose Cloud Foundry (Source)

Microservices were meant to reverse the dependence on the existing large, monolithic infrastructure. Containers were conceptualized as a way to ensure environment consistency with all the containers pushed in the same way.

The team began with one of the monolith’s functions—GetUsage, which provided T-Mobile customers with access to information about their data usage. Having quite a low impact on the whole system, it was a perfect choice to try out a new approach. Though representing a relatively small portion of the legacy app, the function handled 2.5 to 12 million calls per day.

 

Headache caused by Cloud Foundry? Day 2 Operations assistance

Adoption and results

With onboarding started in May 2016, T-Mobile was able to launch its platform into production early in July that year. By August 2016, 100% of production traffic was moved to Pivotal Cloud Foundry.

The team totally re-built the monolith’s GetUsage function, scaling it up to handle about 40 million calls per day. Whereas, with the old system, the release could take seven months and 72 steps, Cloud Foundry made it possible to reduce the production cycle to just days. In addition, bugs are now got fixed within the same day when they are detected—with zero impact on the system performance.

CF Summit 2017 T-Mobile Cloud Foundry Melissa Chapman v3

“If we were to do this with our traditional infrastructure, it would probably take us seven months. With Cloud Foundry, it was a day or so.” —Melissa Chapman, T-Mobile

T-Mobile started gathering metrics on their application instances in August 2016. The team began the shift to PCF and microservices when the system was reaching 1,000 instances.

Autoscaling ensured no human intervention. By May 2017, the team already had 3,000 application instances to manage, reaching the critical mass adoption rate, where “users started helping users with problems, onboarding, etc.”

CF Summit 2017 T-Mobile Cloud Foundry platform adoptionApplication instances double in just two months after the microservices mandate (Source)

Apart from the dramatic reduction of the production time, Cloud Foundry enabled T-Mobile to get the fail-fast and fix-fast concepts up and working. According to Brendan, risks on developing new ideas could be taken, “by allowing us to move more quickly and try things out that we’re not able to try otherwise, and we’re able to recover from that risk just as quickly.”

Furthermore, he says, “the principle of the cloud is rapid elasticity, being able to get code out the door more quickly and being able to scale those applications up and down to meet demand, even sometimes daily.”

Whereas previously T-Mobile had separate teams for testing and production, adopting Cloud Foundry has aided the company in introducing an entirely new DevOps culture. Guiding a product through the development process to production, teams are now “wholly responsible for owning, developing, and operating these different services.”

CF Summit 2017 T-Mobile Cloud Foundry Brendan Aye v3Brendan Aye overviewing T-Mobile’s achievements with PCF

Shifting the responsibility, the company helped its employees to realize the true value and meaning of the DevOps culture. “They are responsible for the code they develop, how they test it, and how they run it,” Brendan says.

 

The role of APIs

The PCF platform is complemented by Apigee’s API management framework, which enables using existing services (like GetUsage) for consumption. Via APIs, the customers and partners of T-Mobile can access a variety of data resources, while helping the company to diversify its offers with new services and additional bonuses (such as free pizzas).

“With our growth come exciting opportunities where our business partners, other innovative apps, and entrepreneurs want to integrate with us. APIs are our way to involve them.”
—Himanshu Kumar, T-Mobile

Using APIs, both internal and external developers are enabled to quickly roll out applications without compromising security and stability.

A webinar from Himanshu Kumar, a Principal Developer at T-Mobile, and Paul Williams of Apigee provided a detailed overview of how these APIs work.

CF Summit 2017 T-Mobile Cloud Foundry Apigee APIApigee’s architecture for managing T-Mobile’s “experience” and “capability” APIs (Source)

According to the webinar, the “capability” APIs are “detached from experience” and “purely focused on the resource or the underlying representation of an entity in the purest form.” The “experience” APIs are designed to “achieve an optimal use from a user experience perspective.”

As Paul explained, APIs are at the heart of creating a digital value chain. They are seen as “technical contracts between developers and the team that’s implementing functions and capabilities.”

“The goal is to rapidly innovate, as developers build apps for customers to do more business. They (the teams) can iterate on top of those APIs without having to affect ongoing development and maintenance of apps and their interactions with back-end systems.” —Paul Williams, Apigee

Himanshu also pointed out, “we are trying to imagine IT systems and solutions as things that we can break into capabilities, with an unambiguous assignment to teams that can own them, have a life cycle, and they feel empowered.”

The customer-centric APIs implemented by T-Mobile Nederland is an example of how the company employs APIs to exchange data with internal and external parties. (For more about using APIs for running microservices on Cloud Foundry, read our brief post on the topic, featuring another Apigee discussion.)

 

Technical challenges and lessons learned

According to Brendan and Melissa, initial T-Mobile’s infrastructure resources “were established to support teams developing monolithic apps on stateful servers.” So, migrating to Spring Cloud Services on PCF required certain workarounds:

Networking was “the biggest hurdle.” Initially, PCF didn’t support T-Mobile’s networking layout delivering a number of separate networks for each of the company’s availability zones. Some of the PCF tiles (e.g., MySQL and RabbitMQ) did not support such multi-subnet topologies. So, as the Pivotal team was working to update PCF services to support multi-subnetworks, developers at T-Mobile had to perform manual BOSH deployments as a temporary measure to enable Spring Cloud Services.

“We had to actually crack open the tiles, use BOSH releases from them, and deploy them manually with BOSH.” —Brendan Aye, T-Mobile

“RabbitMQ and MySQL were a must,” but the tiles for these two applications are multi-tenant single clusters, which was not going to work well within T-Mobile’s large production environment. Brendan explained that the so-called “bad neighbors” could flood the app with messages and requests, bringing down the entire cluster. Although the RabbitMQ and MySQL tiles were successfully used for Spring Cloud Services (e.g., Hystrix), it was decided not to offer them for actual production workloads.

CF Summit 2017 T-Mobile Cloud Foundry Technology ChallengesThe challenges faced while moving away from monolithic development (Source)

Working with private clouds, the company needed an on-premises S3 object storage. There was an initial option to use a built-in NFS server, but the system covered only a single instance, so did not meet the requirements. To provide a highly available S3 object storage, the team created BOSH releases for the open-source LeoFS tool and got them up and running across all the three availability zones.

Having multiple data centers, T-Mobile lacked global control over load balancing. So, cross-region load balancing became a customer responsibility.

After shifting to Cloud Foundry and microservices, the rapid increase in app instances caused compliance concerns. To address them, the company’s team set up automated provisioning of permissions to the platform. “We used a tool called CF Management that allows you to use a GitHub repo as a source of permissions for your orgs,” explained Brendan. “By doing this, we can leverage all the existing stuff we have in source control, such as pull requests and permissions.”

CF Summit 2017 T-Mobile Cloud Foundry Brendan Aye v2

“It’s very easy to see who approves something, when they approved it, and what’s changed.”
—Brendan Aye, T-Mobile

Brendan and Melissa emphasized the need for strong, consistent executive sponsorship to break down existing walls between and among departments and making the new system a success. Furthermore, the speakers pointed out persistence as one of the key factors in achieving important goals.

“We also learned to figure out what truly matters, what you need to ensure success and put your foot down when you need to.” —Brendan Aye, T-Mobile

Keeping all of the critical platform components under control by monitoring them allows for detecting problems in a timely manner and predicting possible hazards in the future.

melissa-chapman-t-mobile-using-cloud-foundry-v11 Melissa Chapman speaking about T-Mobile’s experience with PCF

 

What’s next?

With a new cloud-native development approach, a DevOps culture, and customer-centric APIs, T-Mobile has made a substantial progress in unleashing its “un-carrier” initiative. The results for the second quarter of 2017 include another 1.3 million customers added to their network and service revenues reaching a record level, up 8% year-over-year.

According to Brendan Aye, the team plans to further expand its offerings, to improve working with RabbitMQ and MySQL tiles, and to update the infrastructure foundations to get multi-subnet support. Furthermore, T-Mobile aspires to resolve the load-balancing issue to enable customers to “build and push one time and have it run across all the foundations.”

Earlier this September, the company also announced another move within the “un-carrier” initiative—providing a free Netflix subscription.

 

Want details? Watch the videos!

Table of contents
  1. What were the issues that T-Mobile was trying to solve? (0:31)
  2. Why did T-Mobile choose Pivotal Cloud Foundry? (1:49)
  3. Were there technological challenges with adoption? (4:08)
  4. How were existing processes migrated ? (7:13)
  5. In what way did the transformation affect people at T-Mobile? (10:04)
  6. How well did the first app launch with PCF perform? (14:30)
  7. How long did the adoption cycle last? (18:40)
  8. What were the lessons learned during adoption? (20:36)

In the video below, James Webb and Brendan Aye highlight how the bundle of Cloud Foundry and microservices helped T-Mobile to avoid downtime during the recent iPhone X launch.

 

Further reading

 

Related slides

 

About the experts

Melissa Chapman_bio_v2
Melissa Chapman has an extensive experience in leading teams to deliver exceptional 24×7 services, supporting and developing mission-critical distributed server and application infrastructures. She can boast o strong background in managing complex ongoing technology operations, IT and business process development, and change initiatives. Melissa has a track record of exceeding organizational objectives with exceptional ROI through skillful leadership.

 

Himanshu Kumar, t-mobile bio
Himanshu Kumar is a Principal Developer at T-Mobile. He has around seven years of experience in quality assurance (QA) testing and software design and development in wireless and mobile technology. He has domain knowledge of wireless voice and packet data networks, architectures, and core system protocols. Himanshu has lead experience with five member QA teams on testing next generation wireless products from the ground up by training them on product domain knowledge and tools.

 

Paul Williams_bio
Paul Williams is a Customer Success Manager at Apigee, which was acquired by Google in November 2016. He started programming while he was still in high school. Paul has a knack for understanding the key requirements in a business process and a unique ability to encode that into software requirements. Over the years, he leveraged that skill in an array of roles ranging from system integrations for legal management to aircraft maintenance, payment processing, and behavior change support.

 

Brendan Aye, T-Mobile bio
Brendan Aye is the Principal Cloud Foundry architect at T-Mobile, where he has been working for four years. Brendan is an experienced and dedicated problem solver. At T-Mobile, he makes use of his background in telecommunications for disaster recovery and troubleshooting.

 

 


This post was written by Alesia Bulanok, Roger Strukhoff, and Alex Khizhniak, edited by Carlo Gutierrez.
Interested in how to manage secure Cloud Foundry deployments distributed across multiple data centers?

To stay tuned with the latest updates, subscribe to our blog or follow @altoros.

  •  
  •  
  •  
1829