Cloud Foundry Advisory Board Meeting, May 2020: CI and Logs at Scale

by Carlo GutierrezMay 21, 2020

Engineers of T-Mobile shared experience in managing log loss on VMware Tanzu and lessons learned from the company's shift to life-cycle automation.

This month’s Cloud Foundry Community Advisory Board (CAB) meeting focused on two presentations from T-Mobile. The presentations centered around management of log loss on VMware Tanzu (Pivotal Cloud Foundry) and T-Mobile’s shift to life-cycle automation.

The meeting also discussed leadership changes for CF Extensions, the upcoming North American Summit, as well as updates around the ecosystem projects.

Table of Contents

Addressing log loss

Eamon Maguire

Eamon Maguire of T-Mobile led a presentation on managing log loss at scale on VMware Tanzu. To put context to the meaning of scale, he noted that T-Mobile had more than 3,000 applications, which ran 700 million daily transactions. These applications spanned 70,000 containers across over 20 foundations. (Previously, we’ve written how T-Mobile handles 1M+ transactions daily thanks to Kubernetes, and how the company slashed production time from months to days with Cloud Foundry.)

According to Eamon, T-Mobile had frequent issues with severe log loss across applications on VMware Tanzu. The loss could happen at any point in the system. The organization experienced a sustained log loss of 15%, and this ballooned to 75% during peak periods causing delays to various components.

Finding the cause of the issue was challenging, since T-Mobile did not own all of the components. Therefore, they did not have a good view from the application perspective and did not have consistent alerting. The organization had to rely on manual and ad-hoc processes to pinpoint issues.

“At any step in the chain, we could see issues with CPU, memory, or queues filling up. What we found out is that more often than not, the cause of the problem was a ‘noisy neighbor,’ which is an application that is just logging excessively. Since it’s a shared platform, there’s typically a single or a few applications responsible for flooding everything and causing the log loss.” —Eamon Maguire, T-Mobile

After identifying the cause of the log losses, T-Mobile developed a solution.

Define service-level objectives. 90% of logs should pass through Splunk.
Create terms of service. Individual application instances should not exceed 100,000 logs per minute. They should also never exceed a burst rate of 1 million logs per minute.
Monitor applications. Identify which VMware Tanzu clients are logging more than 100,000 logs per minute for each application.
Notify teams. Automatically e-mail the team behind offending applications.

Identifying the noisy neighbors (Image credit)

Using this method, T-Mobile has improved retention on Splunk. Automated detection and notifications are now helping the support team to save up to 2 hours daily.

Life-cycle automation

Brandon Indrick of T-Mobile presented on the organization’s shift to life-cycle automation. Prior to automation, the company faced such challenges as:

No consistency. Each time a foundation was brought online, the parameters, tiles, or versions would vary. The more foundations that were brought online, the greater the variance became, making it difficult to update.
No change tracking. Documentation simply was not being updated when a change was made.
Lengthy and chaotic upgrades. A single foundation would take up to 2 weeks to upgrade.

To resolve these challenges, T-Mobile made use of Pivotal Platform Automation with Concourse-based pipelines. Using this solution, foundation configurations are now living documents stored in source control. Naturally, automation also meant no more manual deployments, which reduced errors.

“Today, all of our tiles are backed up by configuration files. Foundation consistency, which was all over the board, is now at an all time high. From no change tracking to full change tracking, we have verification and accountability. From moving one foundation to a single-point upgrade every 2 weeks, we can now do over 3 multi-point foundation upgrades per week.”
—Brandon Indrick, T-Mobile

Automating deployments with Concourse (Image credit)

Runtime PMC

Eric Malm

Eric Malm of Pivotal reported the following developments:

The Release Integration team delivered cf-deployment v13 and cf-for-k8s v0.2.0.
The KubeCF team has released KubeCF v2.2.0, which integrates Eirini and EiriniX Helm charts.
The CLI team is preparing to release the initial general availability of CLI v7.
The CAPI team is improving kpack integration to enable such operations as updating buildpacks and rootfs.
Eirini now supports rolling deploys for applications. The team is also working on application tasks.
The Networking team is working with Route CRDs to translate routing information into the underlying Kubernetes cluster.
The Logging and Metrics team collaborated with the CAPI team to expose Cloud Controller metrics via Prometheus on Kubernetes. The work on injecting control-plane logs into the application log streams is in progress, as well.

Runtime PMC’s GitHub repo

CF Extensions

Troy Topnik

Troy Topnik of SUSE, who moderated the call, mentioned that nominations for a new CF Extentions PMC lead are still open. The position was previously held by Michael Maximilien of IBM (aka Dr. Max). After nearly four years as the head of CF Extensions, Dr. Max announced his decision to step down back on May 1.

CF Extensions’ GitHub repo

CF Foundation updates

Chip Childers

Chip Childers noted that the schedule for the North American Cloud Foundry Summit is now available. The conference will be held virtually from 9 a.m. to 5 p.m. CDT on June 24 and from 9 a.m. to 4 p.m. CDT on June 25. Registrations are open, and a contributor code has been sent to the cf-dev mailing list.

Recently, the Cloud Foundry Foundation also published its bi-weekly technical round-up and release notes report as of Q1 2020.

The next CAB call is preliminary scheduled for June 17, 2020, at 8 a.m. PDT. Anyone interested can join Cloud Foundry’s CAB Slack channel.