Cloud Foundry Deployment Metrics That Matter Most

by Anton SorokoMay 3, 2017
What Cloud Foundry metrics should we gather from VMs, BOSH, CF components, apps and the rest of the system? Here's a deep overview across the full stack.

Generating a variety of metrics

Lots of metrics per se do not necessarily imply lots of value for monitoring. If you have been working with monitoring systems for quite a while, you may have noticed that the metrics you get can be roughly divided into the following three groups:

  1. Simple, clear, and easily understandable metrics (15–20%*).
  2. Metrics that are mostly useless from an operator’s perspective, but may come in handy for developers during application debugging (65–75%*).
  3. Metrics that appear to be useful, but are difficult to interpret (10–15%*). They can be really valuable in case you are totally aware of how the system works and what exactly is being measured.

Note: *The figures are based on the author’s experience and represent his viewpoint.

Being a complex system comprising several distributed components, Cloud Foundry produces a great variety of metrics. Let’s take a look at those of them that are most valuable, including those that are best suited for triggering alerts.

A Cloud Foundry deployment can emit metrics at the infrastructure, platform, and application levels.

cloud-foundry-deployment-metrics-layers-of-monitoring-v15A Cloud Foundry deployment: Layers of monitoring

In this blog post, we will start with evaluation of metrics emitted at the infrastructure level and move up the abstraction scale to application metrics.

 

Virtual machines (VMs) and BOSH

BOSH has a monitoring component—the Health Monitor—that collects metrics from the BOSH Agents at all BOSH-deployed virtual machines.

Note: To retrieve metrics from a BOSH Agent, one can use the following tools:

Below are the metrics that can be collected while running virtual machines.

VM health

MetricDescription
system.healthyThe simplest VM health metric. As its name suggests, the metric indicates the health of a virtual machine from the BOSH perspective (a VM is up, and all the processes on it are running). This metric is an ideal candidate for setting up alert thresholds.

 

CPU

MetricsDescription
system.cpu.userThe percentage of CPU utilization that occurred while executing at the user level.
system.cpu.sysThe percentage of CPU utilization that occurred while executing at the system (kernel) level.
system.cpu.waitThe percentage of time that a CPU(s) was idle, during which the system had to process an bulky disk I/O request.
system.load.1mThe load average over the last minute.

Note: These CPU metrics are most useful for Diego cells.

Memory

MetricsDescription
system.mem.percentMemory usage in %
system.swap.percentSwap usage in %
system.mem.kbMemory usage in KB
system.swap.kbSwap usage in KB

Note: These memory metrics are most useful for Diego cells.

Storage

MetricsDescription
system.disk.<type>.percentThe amount of the space used
system.disk.<type>.inode_percentThe number of the inodes used

Where can be:

  • system: / (root partition)
  • persistent: partition for /var/vcap/store
  • ephemeral: partition for /var/vcap/data

These metrics are most useful for databases.

cloud-foundry-deployment-metrics-from-a-specific-virtual-machineA dashboard with metrics from a specific VM (created with Hearbeat)

As you can see, the metrics monitored by BOSH are quite basic, so it makes sense to deploy a more advanced collector of system metrics to also monitor I/O (recommended for databases) or Network (recommended for Gorouter, HAProxy, and NGINX) metrics.

 

Cloud Foundry components

The metrics gathered from the Cloud Foundry system components are accepted by the Loggregator and transported through a chain of its units, the last being—Firehose. It further streams the received metrics via nozzles to third-party systems for processing and persistence.

Let’s look at the most valuable metrics emitted by specific Cloud Foundry components.

Gorouter

MetricsDescription
total_routesThe current number of the registered routes. The count on all the routers should be the same, so this metric is a good candidate for setting up alert thresholds.
total_requestsThe lifetime** number of the received requests.
rejected_requestsThe lifetime number of bad requests received by Gorouter.
bad_gatewaysThe lifetime number of bad gateways.
latency.<component>The time (in milliseconds) it took the Gorouter to handle requests from each component (e.g., a Cloud Controller and UAA) to its endpoints.
requests.<component>The lifetime number of the requests received for each component (e.g., a Cloud Controller and UAA).
responsesThe lifetime number of the HTTP responses.
responses.XXXThe lifetime number of the HTTP response status codes of type XXX.

**It’s recommended to convert all the lifetime metrics to rate.

cloud-foundry-deployment-metrics-gorouterA dashboard with metrics from Gorouter

In addition, you can derive some useful metrics from the HttpStartStop event inside a nozzle.

To do so, you need to use the Uri field inside the HttpStartStop struct to get a URI, StartTimestamp / StopTimestamp to get a time interval, and StatusCode to distinguish the successfully completed responses from the failed ones.

MetricsDescription
requestThe number of the successfully completed HTTP responses for a particular URL
errorThe number of the failed HTTP responses for a particular URL
response timesResponse time for a particular URL
responses.XXXThe number of the HTTP response status codes of type XXX for a particular URL

Diego

MetricsDescription
CrashedActualLRPsThe total number of the long-runnіng process (LRP) instances that have crashed
LRPsMissingThe total number of the LRP instances that are desired, but have no record in the Bulletin Board System (BBS)
LRPsRunningThe total number of the LRP instances that are running on cells

The metrics below are good candidates for setting up alert thresholds.

MetricsDescription
RoutesTotalThe number of the routes in the route-emitter’s routing table.
ContainerCountThe number of containers hosted on a cell.
UnhealthyCellDetermines whether the cell has failed to pass its healthcheck against the Garden backend. “0” signifies healthy, and “1” signifies unhealthy.

The metrics that are useful for Capacity Planning are also worth mentioning.

MetricsDescription
CapacityRemainingContainersThe remaining number of containers this cell can host.
CapacityRemainingDiskThe remaining amount (in MiB) of the disk space available for this cell to allocate to containers.
CapacityRemainingMemoryThe remaining amount (in MiB) of the memory available for this cell to allocate to containers.
CapacityTotalContainersThe total number of containers this cell can host. Please, note that this value is set to 250 by default). To get viable data, set the relevant value for your infrastructure.
CapacityTotalDiskThe total amount (in MiB) of the disk available for this cell to allocate to containers.
CapacityTotalMemoryThe total amount in MiB of the memory available for this cell to allocate to containers.

cloud-foundry-deployment-metrics-diegoA dashboard with Diego metrics (capacity)

Etcd

MetricsDescription
IsLeaderDetermines whether the host is currently the Leader
FollowersThe number of Followers the host currently has

The fluctuation of these metrics can signify network, configuration, or Cloud Foundry upgrade issues.

 

Consul

Firehose does not stream metrics from Consul, so you need a third-party agent to monitor this Cloud Foundry component.

Among useful metrics are the current leader and the number of peers. The fluctuation of these metrics can signify network, configuration, or Cloud Foundry upgrade issues.

Furthermore, Consul stores the results of the service health checks. You can visualize these results on a single dashboard and attach alert rules (based on the service status and the above described metrics) to it.

cloud-foundry-deployment-metrics-a-consul-monitoring-dashboardA Consul monitoring dashboard

Cloud Controller

MetricsDescription
total_usersThe total number of users ever created, including inactive users.
http_status.XXXThe number of the HTTP response status codes of type XXX. It makes sense to set up alert thresholds for 5XX status codes.
log_count.debugThe number of log messages of different severities. Pay attention to the fatal/error/warn levels. A good candidate for setting up alert thresholds.

UAA

MetricsDescription
user_authentication_failure_countThe number of failed user authentication attempts since the last start of the UAA process
user_not_found_countThe number of times a user was not found since the last start of the UAA process
user_password_changesThe number of successful password changes by a user since the last start of the UAA process
user_password_failuresThe number of failed password changes by a user since the last start of the UAA process

These metrics are good candidates for setting up security alert thresholds.

 

Application metrics

There are several more components inside Cloud Foundry that do not stream metrics via Firehose, but also need monitoring to assure flawless operation of your Cloud Foundry deployment (e.g., internal MySQL/PostgreSQL, HAProxy, or NGINX).

See the Monitoring third-party services section to get an idea of how to retrieve metrics from such components.

The container metrics and HTTP events from applications are also streamed via Firehose and, hence, are available on Cloud Foundry out of the box.

System metrics

MetricsDescription
CPUCPU usage
MemoryMemory usage
DiskDisk usage

HTTP metrics

MetricDescription
responses.XXXThe number of the HTTP response status codes of type XXX (it makes sense to convert them to rate)

In addition, you can derive some useful metrics from the HttpStartStop event inside a nozzle.

To do so, you need to use the ApplicationId field inside the HttpStartStop struct to map a request to a particular application. StartTimestamp / StopTimestampStatusCode to distinguish good responses from the bad ones.

MetricsDescription
requestThe number of good HTTP responses
errorThe number of bad HTTP responses
response timesApplication response time

You can retrieve more technical metrics by using certain buildpack internals (e.g., JMX integration, APM agents, etc.). To get business metrics, you can define them inside the application and then send to a metrics receiving system (e.g., a statsd daemon).

cloud-foundry-deployment-metrics-application-visualizedApplication metrics visualized

 

Monitoring third-party services

Some services (e.g., RabbitMQ or Redis) stream metrics via Firehose, so all you need to start monitoring them is deploy a Firehose nozzle. With the services that do not send metrics to Firehose, things get a bit more complicated. You need to gather metrics on your own, know where to collect them from (e.g., IP+Port) and, probably, what the credentials are. Hopefully, there are lots of metric collectors featuring a predefined integration with a great number of services. Thus, you need to install the agent on the service and create a proper auto-configuration (e.g., based on a BOSH manifest file).

Similarly, you can monitor the Cloud Foundry components that do not stream metrics via Firehose (e.g., internal MySQL/PostgreSQL, HAProxy, or NGINX).

 

All the dashboards in the blog post were taken from Heartbeat—a monitoring tool containing all the mentioned metrics and many others.

 

Further reading


Learn More about Heartbeat - Cloud Foundry Monitoring Solution