Prometheus: unexpected end of JSON input
In this blog post I will show you how I solwed the Prometheus: unexpected end of JSON input error.
The current log shows that tha prometheus can not start.
# systemctl status prometheus.service
● prometheus.service - Monitoring system and time series database
Loaded: loaded (/lib/systemd/system/prometheus.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-04-27 11:31:59 CEST; 9min ago
Docs: https://prometheus.io/docs/introduction/overview/
Process: 3428 ExecStart=/usr/bin/prometheus $ARGS (code=exited, status=1/FAILURE)
Main PID: 3428 (code=exited, status=1/FAILURE)
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Main process exited, code=exited, status=1/FAILURE
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Failed with result 'exit-code'.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Service RestartSec=100ms expired, scheduling restart.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 5.
Apr 27 11:31:59 prometheus systemd[1]: Stopped Monitoring system and time series database.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Start request repeated too quickly.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Failed with result 'exit-code'.
Apr 27 11:31:59 prometheus systemd[1]: Failed to start Monitoring system and time series database.
The error suggests there’a a broken JSON file to find.
# journalctl -u prometheus.service -f
-- Logs begin at Thu 2022-04-21 14:20:44 CEST. --
Apr 27 11:31:59 prometheus prometheus[3428]: level=info ts=2022-04-27T09:31:59.438273038Z caller=main.go:517 msg="Scrape manager stopped"
Apr 27 11:31:59 prometheus prometheus[3428]: level=error ts=2022-04-27T09:31:59.438550231Z caller=main.go:686 err="opening storage failed: block dir: \"/var/lib/prometheus/metrics2/01FYVDSN8KYK9SEFT6EAD7KRA6\": unexpected end of JSON input"
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Main process exited, code=exited, status=1/FAILURE
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Failed with result 'exit-code'.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Service RestartSec=100ms expired, scheduling restart.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 5.
Apr 27 11:31:59 prometheus systemd[1]: Stopped Monitoring system and time series database.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Start request repeated too quickly.
Apr 27 11:31:59 prometheus systemd[1]: prometheus.service: Failed with result 'exit-code'.
Apr 27 11:31:59 prometheus systemd[1]: Failed to start Monitoring system and time series database.
A google for the error turned up:
- Looks like the fix will be to remove the dodgy directory and restart prometheus. (prometheus issue 4058)
- Looks like there’s an issue due to be fixed in 2.9.0; this occurred with the bundled 2.8.1. (prometheus issue 5457, referring to 4058)
- If the file (meta.json) wasn’t empty, there’s a python incantation to validate it. (prometheus issue 4297)
- It seems to be (have been) a thing on NFS which is not recommended.
- Also seen on (I think) Openshift 4.1 running Prometheus 2.3.2 on Gluster (which is probably against recommendations, like NFS)
Hunt for root cause
- There’s enough free space to write files in this location.
- I don’t use NFS; the data is on XFS
- The previous (/var/log/prometheus/) log contains a clean startup
FIX
meta.json is empty for the recent directory, so I’ll delete.
ls -alh /var/lib/prometheus/metrics2/*/meta.json | awk '{print $5" - "$9}'
930 - /var/lib/prometheus/metrics2/01FXNNACW65B4XZ1P86W5C6MKT/meta.json
930 - /var/lib/prometheus/metrics2/01FXQK464WAPXAMHDE3RWXGCB0/meta.json
930 - /var/lib/prometheus/metrics2/01FXSGXCRGP0DE8ZQ4ZRND552C/meta.json
930 - /var/lib/prometheus/metrics2/01FXVEQ4AV8EH9EDMEKV1TRKZ3/meta.json
930 - /var/lib/prometheus/metrics2/01FXXCGMJZERBJCKVAQSYM4WCG/meta.json
930 - /var/lib/prometheus/metrics2/01FXZAA41ZAZYRHN415FE3CX56/meta.json
930 - /var/lib/prometheus/metrics2/01FY183K7PSVG2ZACXVTG6DJXM/meta.json
930 - /var/lib/prometheus/metrics2/01FY35XCQTFN1T931BT98KK99G/meta.json
930 - /var/lib/prometheus/metrics2/01FY53PVTA9NGW1YQQ9868NSTQ/meta.json
930 - /var/lib/prometheus/metrics2/01FY71GCRJF9202770248J335E/meta.json
930 - /var/lib/prometheus/metrics2/01FY8Z9TXDP194BXKEH3FAPN8P/meta.json
930 - /var/lib/prometheus/metrics2/01FYAX39CBRAX0VS7E93XZQ2FZ/meta.json
930 - /var/lib/prometheus/metrics2/01FYCTWT4M5QV6A0W1P34A02E1/meta.json
930 - /var/lib/prometheus/metrics2/01FYERPMJ38ZDN3XB4DQTNDTB9/meta.json
930 - /var/lib/prometheus/metrics2/01FYGPG00KFZRSVBV4JM9NFRRH/meta.json
930 - /var/lib/prometheus/metrics2/01FYJM9TK6TDTNQ64C399A46KZ/meta.json
930 - /var/lib/prometheus/metrics2/01FYMJ36ZTYFV09RWRYY3BN7PV/meta.json
930 - /var/lib/prometheus/metrics2/01FYPFWJRB5NAQACM4H2D8AK01/meta.json
930 - /var/lib/prometheus/metrics2/01FYRDP2N3EJ4XRPHKPXS6H49C/meta.json
930 - /var/lib/prometheus/metrics2/01FYTBFV9Q5XQ38WYF26HHKE5H/meta.json
308 - /var/lib/prometheus/metrics2/01FYV026RKCB7XRKNAGDG4ZW0D/meta.json
732 - /var/lib/prometheus/metrics2/01FYV02K8P30TVW73Y4RYZ3DH5/meta.json
308 - /var/lib/prometheus/metrics2/01FYV6XY0J1RW12NXC4FTMRDN0/meta.json
0 - /var/lib/prometheus/metrics2/01FYVDSN8KYK9SEFT6EAD7KRA6/meta.json
rm -rf /var/lib/prometheus/metrics2/01FYVDSN8KYK9SEFT6EAD7KRA6
Conclusion
I don’t think my guests are always going down cleanly at shutdown. Prometheus was compacting it’s data when KVM guest went away.
Looking at issue 4058, it’s noted that the code doesn’t do a sync before closing the file; it does everywhere else, and I guess this is the fix that’ll be in 2.9.0. That might have helped, though it’s a case, here, of putting on a tourniquet before shooting oneself in a foot.