Alert for Divergent Times on Linux – Using Chrony and AWS CloudWatch

Introduction

Accurate and precise timekeeping in Linux is of utmost importance in production servers for several critical reasons. First and foremost, many applications and services rely on precise time synchronization to function correctly, preventing failures and inconsistencies. Additionally, records and logs are often used for troubleshooting and performance analysis, and accurate time is essential for tracking events precisely.

In this scenario, observability plays a pivotal role. By embracing observability tools such as real-time monitoring and log analysis, system administrators can quickly identify any deviations in server time. Furthermore, alert systems can be configured to notify promptly about time-related issues.

Through observability, it’s possible to ensure that the server time remains synchronized, and in case of deviations, allow for a rapid response to correct the issue before it can have a negative impact on server operations and, consequently, the services provided to users.

In the post below, I have explained the installation and operation of Chrony, which we will be using in this article:

Problem

Virtual machine instances that are part of Auto Scaling are typically created from pre-configured images. These images may include incorrect or outdated time settings. Additionally, the startup process can take some time, resulting in a time lag between the time configured in the image and the actual boot time.

When multiple instances are dynamically deployed and deactivated in response to system load, manually correcting the time on each instance can be a challenging and error-prone task. Time discrepancies among machines can lead to synchronization issues in distributed services, inconsistent logs, and difficulties in tracking events across the environment.

The dynamic nature of Auto Scaling machines presents significant challenges in terms of environment observability. As new instances are created and others are shut down, the infrastructure topology constantly changes. This can lead to issues in discovering and monitoring instances in real-time.

Another issue is the centralized collection of logs, metrics, and events from different instances that are continuously added and removed. Dealing with the variety of dynamically assigned IP addresses, identifiers, and other attributes to machines can make configuring an efficient monitoring system challenging.

Solution for EC2 in Auto Scaling Group (ASG)

Fluxo - Observabilidade - Alerta de horários

For the machines operating within the Auto Scaling group that require accurate timekeeping with utmost certainty, it was necessary to configure Chrony and establish a structure for alerting in case of any time discrepancies, based on dynamically generated metrics.

The structure is explained in the above image and can be summarized as follows:

  • During the EC2 instance startup, an alarm is created based on the configuration existing at the path /etc/rc.local.
  • Configuration in the Crontab publishes the metric to AWS CloudWatch.
  • Before the EC2 instance termination, the alarm is deleted based on the “aa-run-before-shutdown.service.”

This entire process is essential to ensure that alarms are created dynamically for each EC2 instance that spawns within the Auto Scaling group, preventing any clutter in CloudWatch.

Steps

Key Point

The configurations demonstrated in this article are performed on a Linux server that serves as the basis for Auto Scaling. At the end of the process, it is recommended to create an Amazon Machine Image (AMI) and make the necessary configurations in the Launch Template and Auto Scaling. I will not detail these steps to keep the article from becoming too lengthy.

Additionally, we need an AWS SNS topic for sending emails through AWS CloudWatch alerts.

Script Creates Alarm

We need to create the alarm in CloudWatch every time the EC2 instance is launched. To do this, we’ll set up the /etc/rc.local file.

The /etc/rc.local file is a script file that runs during the Linux system’s boot process. However, the use of rc.local to start services or execute scripts can vary depending on the Linux distribution you are using. Starting from Debian 9 (Stretch) and newer systems, rc.local is disabled by default and requires some additional configurations to be used.

Edit the /etc/rc.local file and add the following content:

#!/bin/bash

INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`

aws cloudwatch put-metric-alarm \
    --alarm-name "alerta-horario-ec2-$INSTANCE_ID" \
    --alarm-description "Alarme sobre problemas no horario da EC2 $INSTANCE_ID devido ClockErrorBound acima de 1ms" \
    --metric-name ClockErrorBound \
    --namespace TimeDrift \
    --statistic Average \
    --period 300 \
    --threshold 1 \
    --comparison-operator GreaterThanThreshold  \
    --dimensions "Name=Instance,Value=$INSTANCE_ID" \
    --evaluation-periods 3 \
    --datapoints-to-alarm 3 \
    --region sa-east-1 \
    --alarm-actions arn:aws:sns:sa-east-1:123456789:alertas-devops-mind

This script is a Bash shell script that creates a metric alarm in Amazon CloudWatch to monitor time drift on a specific EC2 instance.

Let’s break down each part of the script:

  1. !/bin/bash: Specifies the command interpreter to be used, in this case, Bash.
  2. INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id): Uses the curl command to retrieve the ID of the current EC2 instance where the script is being executed. The IP address 169.254.169.254 is a special metadata interface for AWS EC2 instances, and /latest/meta-data/instance-id is an endpoint that returns the instance ID.
  3. aws cloudwatch put-metric-alarm …: Uses the AWS Command Line Interface (CLI) to create a metric alarm in Amazon CloudWatch with the following parameters:
  4. –alarm-name “alerta-horario-ec2-$INSTANCE_ID”: Sets the alarm name, which includes the instance ID to make it unique.
  5. –alarm-description “Alarm for EC2 $INSTANCE_ID time issues due to ClockErrorBound above 1ms”: Provides a description for the alarm, explaining its purpose.
  6. –metric-name ClockErrorBound: Specifies the name of the metric to be monitored (in this case, “ClockErrorBound”).
  7. –namespace TimeDrift: Defines the namespace of the metric. The namespace acts as a “container” for related metrics.
  8. –statistic Average: Specifies the statistic to be used for alarm evaluation (in this case, the average).
  9. –period 300: Specifies the time period, in seconds, for which the metric is evaluated (in this case, 300 seconds or 5 minutes).
  10. –threshold 1: Sets the threshold value at which the alarm will trigger (in this case, 1, indicating time drift above 1 millisecond).
  11. –comparison-operator GreaterThanThreshold: Defines the comparison operator to be used for evaluating the metric against the specified threshold (in this case, “greater than”).
  12. –dimensions “Name=Instance,Value=$INSTANCE_ID”: Specifies the dimensions of the metric to identify the specific instance to be monitored (using the previously obtained instance ID).
  13. –evaluation-periods 3: Defines the number of consecutive evaluation periods that the metric must exceed the threshold for the alarm to trigger (in this case, 3 periods of 5 minutes each, totaling 15 minutes).
  14. –datapoints-to-alarm 3: Specifies the number of data points that must meet the alarm condition for it to trigger (in this case, 3 consecutive data points).
  15. –region sa-east-1: Specifies the AWS region where the alarm will be created (in this case, “sa-east-1”).
  16. –alarm-actions arn:aws:sns:sa-east-1:123456789:alertas-devops-mind: Specifies the action to be taken when the alarm is triggered. In this case, the alarm will send a notification to the SNS (Simple Notification Service) topic “alertas-devops-mind” in the “sa-east-1” region with the AWS account number “123456789”.

Therefore, this script creates a metric alarm in Amazon CloudWatch to monitor time drift (“ClockErrorBound”) on a specific EC2 instance and sends a notification to an SNS topic when the time drift exceeds 1 millisecond for three consecutive evaluation periods. This allows for the identification of time synchronization issues on the instance and the implementation of corrective actions in response to triggered alarms.

Script Creates Metric

We also need to create a Shell Script that generates the metric based on the Chrony output.

The EC2 instance’s instance-id is obtained via the meta-data endpoint.

The value generated from Chrony is sent to AWS CloudWatch using the aws-cli.

Save the script at the path /devops/scripts/alerta-horario/timepublisher.sh with the following content:

#!/bin/bash

SYSTEM_TIME=""
ROOT_DELAY=""
ROOT_DISPERSION=""
INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`

output=$(chronyc tracking)

while read -r line; do 
# look for "System time", "Root delay", "Root dispersion".

 if [[ $line == "System time"* ]]
 then
 SYSTEM_TIME=`echo $line | cut -f2 -d":" | cut -f2 -d" "`
 elif [[ $line == "Root delay"* ]]
 then
 ROOT_DELAY=`echo $line | cut -f2 -d":" | cut -f2 -d" " `
 elif [[ $line == "Root dispersion"* ]]
 then
 ROOT_DISPERSION=`echo $line | cut -f2 -d":" | cut -f2 -d" " `
 fi
done <<< "$output"

CLOCK_ERROR_BOUND=`echo "($SYSTEM_TIME + (.5 * $ROOT_DELAY) + $ROOT_DISPERSION) * 1000" | bc `

# create or update a custom metric in CW.
aws cloudwatch put-metric-data \
    --metric-name ClockErrorBound \
    --dimensions Instance=$INSTANCE_ID \
    --namespace "TimeDrift" \
    --region sa-east-1 \
    --value $CLOCK_ERROR_BOUND

This Bash shell script aims to retrieve information about time drift from the EC2 instance on which it is executed and then send this value as a custom metric (ClockErrorBound) to Amazon CloudWatch.

Let’s break down step by step what the script does:

  1. !/bin/bash: Specifies the command interpreter to be used, in this case, Bash.
  2. Variable Definitions: SYSTEM_TIME, ROOT_DELAY, ROOT_DISPERSION, INSTANCE_ID. These variables will be used to store information about time drift and the EC2 instance ID.
  3. INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id): Uses the curl command to obtain the current EC2 instance ID where the script is being executed. The IP address 169.254.169.254 is a special metadata interface for AWS EC2 instances, and /latest/meta-data/instance-id is an endpoint that returns the instance ID.
  4. output=$(chronyc tracking): Executes the chronyc tracking command to retrieve information about the synchronization status of the Chrony client, responsible for time adjustment on the instance.
  5. while read -r line; do: Initiates a loop to iterate over each line of output from the chronyc tracking command.
  6. if [[ $line == “System time”* ]]: Checks if the line contains the “System time” information, and if so, extracts the system time value and stores it in the SYSTEM_TIME variable.
  7. elif [[ $line == “Root delay”* ]]: Checks if the line contains the “Root delay” information, and if so, extracts the root delay value and stores it in the ROOT_DELAY variable.
  8. elif [[ $line == “Root dispersion”* ]]: Checks if the line contains the “Root dispersion” information, and if so, extracts the root dispersion value and stores it in the ROOT_DISPERSION variable.
  9. done <<< “$output”: Ends the loop after processing all lines of output from the chronyc tracking command.
  10. CLOCK_ERROR_BOUND=…: Calculates the time drift value using the SYSTEM_TIME, ROOT_DELAY, and ROOT_DISPERSION variables. The calculation is based on a specific formula that uses these values to determine the clock error bound (ClockErrorBound).
  11. aws cloudwatch put-metric-data …: Uses the AWS CLI (Command Line Interface) to create or update a custom metric in Amazon CloudWatch with the following parameters:
  12. –metric-name ClockErrorBound: Sets the name of the custom metric as “ClockErrorBound.”
  13. –dimensions Instance=$INSTANCE_ID: Specifies the metric dimensions to identify the specific instance being monitored (using the previously obtained instance ID).
  14. –namespace “TimeDrift”: Defines the namespace of the custom metric as “TimeDrift.”
  15. –region sa-east-1: Specifies the AWS region where the metric will be created or updated (in this case, “sa-east-1”).
  16. –value $CLOCK_ERROR_BOUND: Sets the value of the custom metric as the result of the time drift calculation performed earlier.

Therefore, this script gathers information about time drift from the EC2 instance using the Chrony client, calculates a clock error bound based on this information, and sends the resulting value as a custom metric to Amazon CloudWatch. This process can be useful for monitoring the accuracy of the instance’s time synchronization and taking corrective actions if the time drift exceeds the defined threshold.

Adjusting the Crontab

To edit the crontab, you can use the crontab command. The crontab is the file that contains cron entries, and each user can have their own customized crontab. To edit the crontab for the current user, follow the steps below:

  1. Open a terminal in Linux.
  2. Type the following command to open the crontab for the current user in the system’s default text editor (usually nano, vim, or vi):
crontab -e
  1. If this is the first time you are editing the crontab for this user, the system may prompt you to choose the default editor. Select your desired editor if prompted.
  2. Once inside the editor, you will see existing cron entries or a blank file if there are no entries yet.
  3. Add, edit, or remove cron lines as needed. Each line represents a scheduled task, following the standard cron format.
  4. Save the changes and close the editor. For example, in the nano editor, you can save by pressing Ctrl + O and exit by pressing Ctrl + X.

To adjust the crontab to use the script that creates a metric at a specific period, add the following content:

*/5 * * * * /devops/scripts/alerta-horario/timepublisher.sh >/dev/null 2>&1

This line specifies that the script /devops/scripts/alerta-horario/timepublisher.sh will be executed every 5 minutes.

Let’s analyze the cron in detail:

*/5: The first field indicates the minutes when the script will be executed. */5 means “every 5 minutes.” In other words, the script will be executed when the minute is 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or 55.

: The second field indicates the hours when the script will be executed. The asterisk () means that the script will be executed at all hours, regardless of the hour.

: The third field indicates the day of the month when the script will be executed. The asterisk () means that the script will be executed on all days of the month, regardless of the day.

: The fourth field indicates the month when the script will be executed. The asterisk () means that the script will be executed in all months, regardless of the month.

: The fifth field indicates the day of the week when the script will be executed. The asterisk () means that the script will be executed on all days of the week (from Sunday to Saturday), regardless of the day of the week.

/devops/scripts/alerta-horario/timepublisher.sh: This is the path to the timepublisher.sh script that will be executed.

/dev/null 2>&1: This part redirects the output (stdout and stderr) of the script to /dev/null, which is a special device in the Linux system that discards all data sent to it. This means that any output generated by the script will not be displayed on the terminal or stored in log files.

In summary, this cron is used to schedule the execution of the timepublisher.sh script located in /devops/scripts/alerta-horario/ every 5 minutes. Any output generated by the script will be discarded and will not be displayed or logged.

Script Deletes Alarm – Pre Terminate

We need to ensure the SystemD configuration on the EC2 instance, so that the alarm is deleted during termination to avoid clutter in AWS CloudWatch regarding dead EC2 instances from the ASG.

The path where the script should reside and run before termination:

/devops/scripts/alerta-horario/delete-alarm-alerta-horario.sh

Required Script:

#!/bin/bash

INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`

aws cloudwatch delete-alarms \
    --region sa-east-1 \
    --alarm-names "alerta-horario-ec2-$INSTANCE_ID"

Path where the pre-terminate script should be configured:

/etc/systemd/system/aa-run-before-shutdown.service

The script should be executed during shutdown (pre-terminate):

[Unit]
Description=Deletar alarme
Requires=network-online.target
After=network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStop=/devops/scripts/alerta-horario/delete-alarm-alerta-horario.sh


[Install]
WantedBy=network.target

Here is an explanation of what each section of the file does:

  1. [Unit]: This section defines information about the service unit, including its description, required dependencies, and startup or shutdown orders. The key directives used are:
    • Description: A description of the service, used for documentation purposes.
    • Requires=network-online.target: Indicates that the service requires network availability (online network connection) to start.
    • After=network.target: Specifies that the service should start after the network service has been initialized.
  2. [Service]: In this section, the details of the service execution are specified. The key directives used are:
    • Type=oneshot: Indicates that it is a service that runs once and then exits. Typically used for tasks that are not continuous services but one-time tasks.
    • RemainAfterExit=yes: This directive tells systemd that, after the successful execution of the service, it should consider the service as “active” even after it has completed.
    • ExecStop=/devops/scripts/alerta-horario/delete-alarm-alerta-horario.sh: Specifies the command or script to be executed when the service is stopped (in this case, before shutdown). The path “/devops/scripts/alerta-horario/delete-alarm-alerta-horario.sh” points to the script that will be executed.
  3. [Install]: This section determines how the service will be installed and activated. The main directive used is:
    • WantedBy=network.target: Indicates that the service will be activated as part of the startup process when the network becomes available (network.target).

In summary, the systemd service file “/etc/systemd/system/aa-run-before-shutdown.service” defines a systemd service that will run before the system shutdown. This service is of type “oneshot” and, after execution, will remain “active” even after it completes. The execution of the service involves running the script “delete-alarm-alerta-horario.sh” located in “/devops/scripts/alerta-horario/.”

Note

In the scripts I provided, the region used was São Paulo (sa-east-1). Please remember to adjust it according to your environment.

Solution for Standalone EC2

For machines that are not part of an Auto Scaling Group (ASG), the procedure is simpler.

Steps

  1. Create an alarm in AWS CloudWatch.
  2. Create a Shell Script that generates the metric based on chronyc.
  3. Adjust the crontab to use the metric script at the defined time.

Supporting Material

  • Chrony – FAQ
  • AWS CloudWatch – Documentation

Conclusion

I hope the information presented here has been helpful and provided valuable insights into the importance of observability and monitoring time discrepancies in Linux using Chrony and AWS CloudWatch.

Proactive monitoring of EC2 instances is essential to ensure the stability and efficiency of your services in cloud environments. By detecting and alerting on potential time synchronization issues, we can take prompt action to prevent negative impacts on the systems.

AWS offers a variety of powerful tools for monitoring and managing cloud resources, and the combination of Chrony with AWS CloudWatch provides a reliable solution for addressing time discrepancies.

If you have any questions about the solution, please don’t hesitate to ask!

Fernando Müller Junior
Fernando Müller Junior

I am Fernando Müller, a Tech Lead SRE with 16 years of experience in IT, I currently work at Appmax, a fintech located in Brazil. Passionate about working with Cloud Native architectures and applications, Open Source tools and everything that exists in the SRE world, always looking to develop and learn constantly (Lifelong learning), working on innovative projects!

Articles: 28

Leave a Reply

Your email address will not be published. Required fields are marked *