Operation and Maintenance
Bitify monitoring infrastructure are central to Bitify own operation. It includes source code repository, server performance logging, constant check of service availablity etc.
However, these tools are within Bitify own server and cannot be extracted because it includes other server and project that are not related to SPOT.
As such, we outlined here suggestions that as a guide for GASSB to set up their own monitoring infrastructure.
Monitoring
Prometheus:
- Purpose: Prometheus is an open-source monitoring and alerting system designed to collect metrics from servers, applications, and devices in real time.
- How it Works: It periodically scrapes metrics from configured endpoints (e.g., servers or applications) and stores them in a time-series database for analysis.
- Key Features:
- Metrics collection with minimal performance overhead.
- Custom queries using its powerful PromQL (Prometheus Query Language).
- Rule-based alerts for potential issues.
Grafana:
- Purpose: Grafana is an open-source visualization and dashboard tool that integrates with Prometheus (and other data sources) to create interactive, real-time graphs and reports.
- How it Works: It queries Prometheus for stored metrics and displays them in user-friendly charts, gauges, or logs.
- Key Features:
- Highly customizable dashboards.
- Ability to combine data from multiple sources (e.g., Prometheus, MySQL, Elasticsearch).
- Alerts and notifications for critical conditions.
Using Prometheus and Grafana to Monitor Servers
What Can Be Monitored?
- CPU Usage:
- Percentage of CPU cores being utilized.
- Alerts if usage consistently exceeds 80% on any server.
- Memory Usage:
- Total and available RAM.
- Alerts if free memory drops below 20%.
- Disk Usage:
- Free and used disk space for all partitions.
- Alerts if disk usage exceeds 90%.
- Network Activity:
- Bandwidth usage, packet loss, and errors.
- Alerts for sustained high traffic or packet loss.
- System Uptime:
- Duration the servers have been running without downtime.
Setup Process:
-
Install Prometheus:
- Install Prometheus on one server (or a dedicated monitoring server) to act as the central monitoring system.
- Configure Prometheus to scrape metrics from the 5 target servers.
-
Install Node Exporter on Each Server:
- Node Exporter is a lightweight agent that collects server-level metrics such as CPU, memory, and disk usage.
- Configure each server to expose metrics on a specific endpoint for Prometheus to scrape (e.g.,
http://server-ip:9100/metrics).
-
Install Grafana:
- Set up Grafana on the same monitoring server or another system.
- Add Prometheus as a data source in Grafana.
- Use pre-built or custom dashboards to visualize metrics.
-
Configure Dashboards and Alerts:
- Create dashboards in Grafana to show CPU, memory, disk, and network metrics for all 5 servers.
- Define thresholds and set up email or Slack notifications for alerts.
Hardware Suitability:
- Prometheus and Grafana have minimal performance impact when monitoring systems.
- With 4GB RAM and 4 CPUs per server, the setup will comfortably monitor server-level metrics without affecting their primary workloads.
Benefits:
- Centralized Monitoring: View all server health metrics in a single dashboard.
- Proactive Alerts: Get notified of potential issues before they affect operations.
- Historical Analysis: Identify trends in resource usage for better capacity planning.
- Custom Reporting: Visualize critical metrics in ways tailored to your needs.
Example Use Case:
- If one of the servers starts running low on memory, Prometheus will detect this via Node Exporter. Grafana will display this trend visually, and an alert can notify you immediately, allowing you to add more memory or troubleshoot the issue before the server crashes.
Statping
How to Set Up and Use Statping
Statping is an open-source status page and monitoring tool that helps track the health of websites, applications, or services. It’s lightweight and simple to deploy.
Step 1: Prerequisites
Before you begin:
- Server Requirements:
- A server with a minimum of 1GB RAM and 1 CPU.
- Operating System: Ubuntu, CentOS, or similar.
- Software Requirements:
- Docker (recommended for containerized setup) or Golang (for manual build).
- Network:
- Ensure your server is accessible over the internet (optional for public status pages).
Step 2: Installation
Option 1: Using Docker
- Install Docker:
sudo apt update
sudo apt install docker.io -y
- Pull the Statping Docker image:
docker pull statping/statping
- Run the Docker container:
docker run -d \
-p 8080:8080 \
--name=statping \
statping/statping
- Access Statping:
- Open a browser and navigate to
http://<your-server-ip>:8080.
Option 2: Manual Installation
- Install Golang:
sudo apt install golang-go -y
- Clone the Statping repository:
git clone https://github.com/statping/statping.git
cd statping
- Build Statping:
go build
- Run Statping:
./statping
- Access Statping:
- Open a browser and navigate to
http://<your-server-ip>:8080.
Step 3: Initial Setup
- Log In:
- On the first visit, you’ll be prompted to create an admin account.
-
Enter a username, email, and password.
-
Configure Settings:
- Under Settings, customize:
- Status page title.
- Notification preferences (email, Slack, etc.).
- Language and timezone.
Step 4: Adding Services
- Navigate to Services in the menu.
- Click Add Service.
-
Fill in the following details:
- Name: A descriptive name for the service.
- Domain: URL or IP of the service to monitor.
- Expected Status: Expected HTTP status code (e.g.,
200for success). - Check Interval: How often to check the service (in seconds).
- Type: Select
HTTP,TCP,PING, or another protocol.
-
Save the service.
Step 5: Monitoring
- Statping will:
- Periodically check the configured services.
- Display their status on the dashboard (e.g., Up/Down).
- Log historical data for uptime, response time, etc.
Step 6: Notifications (Optional)
- Set up notifications to get alerts when a service goes down.
- Go to Notifications > Add Notification.
- Choose a method (e.g., email, Slack, Telegram).
- Enter the required details (e.g., webhook URL for Slack or email server settings).
- Test and save the notification.
Step 7: Customizing the Status Page
- Navigate to Settings > Appearance.
- Customize:
- Logo and favicon.
- Theme colors.
- Public/private visibility.
- Save changes.
Step 8: Advanced Usage
- Custom Plugins:
- Statping supports custom plugins to extend functionality (e.g., custom alerts or integrations).
- Exporting Data:
- Export logs and metrics for analysis using tools like Grafana or Prometheus.
- Load Balancing:
- Deploy multiple instances of Statping for high availability.
Usage Example
- Imagine you’re managing 5 websites. Add each website as a service in Statping with its respective domain.
- Set up alerts to notify you on Slack if any website becomes unreachable.
- View the status of all websites on a single dashboard to identify and address issues proactively.
Statping simplifies monitoring and keeps you informed about the health of your systems and services.
Troubleshooting
Mariadb Synchronization Drop
If MariaDB replication (sync) drops or fails between a master and its replicas, it’s important to diagnose and resolve the issue promptly to avoid data inconsistencies. Here’s how you can rectify the issue step by step:
1. Identify the Problem
Check the replication status on the replica using the following command:
SHOW SLAVE STATUS\G;
Look for:
- Slave_IO_Running: Should be
Yes. - Slave_SQL_Running: Should be
Yes. - Last_IO_Error and Last_SQL_Error: These fields will provide specific error messages.
Common issues:
- Missing binary logs on the master.
- Duplicate primary keys.
- Network connectivity issues.
- Incorrect or missing privileges.
2. Common Problems and Fixes
A. Mariadb Replication Lag or Stop
- Symptoms:
- Slave_IO_Running or Slave_SQL_Running shows
No. -
Errors like
Could not connect to the master. -
Solution:
- Ensure the master server is reachable:
ping <master-server-ip>
- Check the master’s binary logs:
SHOW MASTER STATUS;
- Restart the replication process:
STOP SLAVE;
START SLAVE;
- Verify the replication:
SHOW SLAVE STATUS\G;
D. Permissions Issue
- Symptoms:
-
Error:
Access denied for user 'replica_user'. -
Solution:
- Ensure the replica user has proper privileges on the master:
GRANT REPLICATION SLAVE ON *.* TO 'replica_user'@'<replica-ip>' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;
- Restart the replication process.
E. Network or Configuration Issue
- Symptoms:
-
Replication stops intermittently.
-
Solution:
- Check for firewall issues between master and replica.
- Ensure the
server_idinmy.cnfis unique for each server.
[mysqld]
server_id=2
log_bin=mysql-bin
Test and Validate
- Verify replication is running without errors:
SHOW SLAVE STATUS\G;
- Monitor replication lag:
SHOW SLAVE STATUS\G;
Look for the Seconds_Behind_Master value. It should be 0 or close to 0.
Full Reinitialization (If Necessary)
If the replication setup is severely broken, you may need to reinitialize the replica:
- On the Master:
- Create a fresh dump:
mysqldump --all-databases --master-data --single-transaction -u root -p > master-dump.sql
- Copy the Dump:
-
Transfer the dump to the replica server using
scpor similar. -
On the Replica:
- Import the dump:
mysql -u root -p < master-dump.sql
- Configure replication:
CHANGE MASTER TO
MASTER_HOST='<master-ip>',
MASTER_USER='replica_user',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='<log_file>',
MASTER_LOG_POS=<log_position>;
START SLAVE;
- Validate:
- Check the replication status:
SHOW SLAVE STATUS\G;
Ansible Mysql Error
If you encountered this error
ERROR! We were unable to read either as JSON nor YAML, these are the errors we got from each:
JSON: Expecting value: line 1 column 1 (char 0)
Syntax Error while loading YAML.
did not find expected '-' indicator
The error appears to be in 'trekker-client.git/01-install_mariadb.yml': line 11, column 5, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: create database for client {{ client }}
mysql_db:
^ here
You may not have community.mysql.mysql_db in your ansible galaxy collection list.
It is not included in ansible-core. To check whether it is installed, run ansible-galaxy collection list.
To install it, use:
ansible-galaxy collection install community.mysql
Refer to Ansible Community Documentation
Ansible Docker Error
Error with Docker
After awhile, docker or python or ansible may have been upgraded silently. Common error that may arise is as below.
{"changed": false, "msg": "Error connecting: Error while fetching server API version: Not supported URL scheme http+docker"}
Solution:
ansible-galaxy collection install --force community.docker