cmd/alertmanager: add --config.auto-reload-interval flag by mihir-dixit2k27 · Pull Request #5222 · prometheus/alertmanager

mihir-dixit2k27 · 2026-05-01T19:28:10Z

Description

Adds a --config.auto-reload-interval flag. When set to a non-zero duration, a background goroutine polls the SHA256 checksum of the config file at the given interval and writes to the existing webReload channel on a detected change — the same path taken by SIGHUP and POST /-/reload — so no new reload logic is introduced.

Why polling instead of fsnotify?
Kubernetes kubelet uses AtomicWriter to update ConfigMap and Secret volume mounts via symlink swaps, which causes fsnotify to stop receiving events after the first update since the original inode is replaced. SHA256 polling works correctly regardless of how the underlying file is updated, which is also why Prometheus uses polling for its equivalent feature.

Behaviour

Default is 0s (disabled). Existing deployments are completely unaffected.
On a detected change, the reload follows the same code path as SIGHUP/POST /-/reload: invalid configs are rejected and logged without disrupting the running instance.
If a reload fails, lastChecksum is not updated, so the watcher retries on every subsequent tick until the configuration is valid again.
The watcher goroutine exits cleanly on SIGTERM.

Changes

File	Description
`cmd/alertmanager/main.go`	New flag, `configFileChecksum()` helper, `runConfigWatcher()` goroutine, watcher start block in `run()`
`cmd/alertmanager/main_test.go`	8 unit tests covering all edge cases
`docs/configuration.md`	New auto-reload section
`docs/management_api.md`	Cross-reference added under the reload section

fixes #5197

coderabbitai · 2026-05-01T19:28:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a polling-based auto-reload controlled by a new --config.auto-reload-interval duration flag. When non-zero, a background watcher computes SHA256 checksums of the config file, polls it at the interval, and triggers reloads via the existing reload coordination channel when changes are detected.

Changes

Config Auto-reload Feature

Layer / File(s)	Summary
CLI flag and dependencies `cmd/alertmanager/main.go`	Introduces `--config.auto-reload-interval` flag (default `0s`, off by default) and adds imports for SHA256 hashing (`crypto/sha256`), hex encoding (`encoding/hex`), and ticker polling (`time`).
Checksum computation and watcher implementation `cmd/alertmanager/main.go`	Adds `configFileChecksum()` to return SHA256 hex digest of the config file; adds `runConfigWatcher()` goroutine that polls the file at the configured interval, maintains `lastChecksum` baseline (seeding on initial startup failure without triggering reload), detects changes via digest comparison, and emits reload signals via non-blocking send on the existing `reloadCh`; updates `lastChecksum` only after successful reload and skips redundant sends if a reload is already queued.
Integration into main orchestration `cmd/alertmanager/main.go`	In `run()`, conditionally starts the watcher when `configAutoReloadInterval > 0`, passing context, interval, config path, reload channel, and logger; logs a warning if cluster mode is enabled alongside auto-reload.
Unit tests `cmd/alertmanager/main_test.go`	Tests `configFileChecksum()` for consistent hashing, differing digests on content change, and error on missing file; tests `runConfigWatcher()` for no reload when unchanged, reload signal on change, no retrigger after successful reload, graceful handling of unreadable/removed files, recovery from initial startup failure, and exit on context cancellation.
User documentation `docs/configuration.md`, `docs/management_api.md`	Documents `--config.auto-reload-interval` with polling semantics, default-off behavior, an example using `30s`, and notes that reload uses the same validation and error-handling path as `SIGHUP` / `POST /-/reload`; updates management API doc to link to the configuration guide.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check	❓ Inconclusive	All code changes directly support the auto-reload feature, but deployment considerations (cluster mode safety) discussed in PR comments remain unresolved, potentially affecting production suitability.	Address cluster-mode coordination concern raised by Spaceman1701: implement safeguard (warning or flag refusal) when operating as cluster node, or explicitly document single-node deployment requirement.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'cmd/alertmanager: add --config.auto-reload-interval flag' clearly and specifically describes the main change: adding a new CLI flag for automatic configuration reloading.
Description check	✅ Passed	The PR description comprehensively covers the feature rationale, behavior, implementation details, and file changes; it addresses the template checklist items and links to issue `#5197`.
Linked Issues check	✅ Passed	The PR implementation directly addresses all primary coding objectives from issue `#5197`: opt-in auto-reload mechanism, Kubernetes support via polling, existing reload semantics reuse, backwards compatibility, and transparent retry logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

cmd/alertmanager/main.go (1)
746-752: 💤 Low value

Minor: potential goroutine leak on shutdown edge case.

If context cancellation occurs while the watcher is blocked sending to reloadCh (line 747), the goroutine won't exit cleanly because there's no select around the channel send. This is an edge case that only matters during shutdown when a config change is detected simultaneously with SIGTERM.

Since the process is exiting anyway and this matches the pattern used by the HTTP reload handler, this is acceptable. Consider wrapping in a select with ctx.Done() if clean shutdown becomes a requirement.
♻️ Optional: Add context-aware channel send
 			// Trigger reload via the same channel that SIGHUP and POST /-/reload use.
 			errCh := make(chan error)
-			reloadCh <- errCh
+			select {
+			case reloadCh <- errCh:
+			case <-ctx.Done():
+				logger.Info("Auto-reload: watcher stopped during reload attempt", "file", configFile)
+				return
+			}
 			if err := <-errCh; err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/alertmanager/main.go` around lines 746 - 752, The send to reloadCh in the
watcher can block and leak a goroutine if context is cancelled while sending; to
fix, make the send context-aware by replacing the direct send of errCh (reloadCh
<- errCh) with a select that attempts to send errCh or returns/continues on
ctx.Done(), e.g., in the watcher goroutine surrounding the reloadCh send use
select { case reloadCh <- errCh: ... case <-ctx.Done(): cleanup/continue },
ensuring you reference reloadCh, errCh and ctx.Done() so the goroutine exits
cleanly on shutdown.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Around line 746-752: The send to reloadCh in the watcher can block and leak a
goroutine if context is cancelled while sending; to fix, make the send
context-aware by replacing the direct send of errCh (reloadCh <- errCh) with a
select that attempts to send errCh or returns/continues on ctx.Done(), e.g., in
the watcher goroutine surrounding the reloadCh send use select { case reloadCh
<- errCh: ... case <-ctx.Done(): cleanup/continue }, ensuring you reference
reloadCh, errCh and ctx.Done() so the goroutine exits cleanly on shutdown.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a804a2c-a6d9-4d5b-ad2a-7d1f7ad8c88d

📥 Commits

Reviewing files that changed from the base of the PR and between a5290a7 and 0501c5541d9e526f93a173f2169c60354a094719.

📒 Files selected for processing (4)

cmd/alertmanager/main.go
cmd/alertmanager/main_test.go
docs/configuration.md
docs/management_api.md

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/alertmanager/main.go`:
- Around line 717-722: The initial config checksum read can fail leaving
lastChecksum empty and causing the first subsequent successful read to always
look like a change; modify the auto-reload logic that calls configFileChecksum
so that when lastChecksum is empty (uninitialized) and a subsequent
configFileChecksum succeeds, you set lastChecksum = newChecksum and do NOT
trigger a reload; apply the same "seed baseline if uninitialized" check where
you compare newChecksum vs lastChecksum (the places using lastChecksum,
newChecksum, and configFileChecksum and the reload invocation) so only genuine
checksum changes trigger reloadConfig.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8dcd956a-d8ff-4066-bc98-4bc4095dc3ce

📥 Commits

Reviewing files that changed from the base of the PR and between 161829c78c84a928f138814f3d94832166d14407 and a2ceeb7a950c9b6c6c6d581e6b14ba02ab960f7f.

📒 Files selected for processing (4)

cmd/alertmanager/main.go
cmd/alertmanager/main_test.go
docs/configuration.md
docs/management_api.md

✅ Files skipped from review due to trivial changes (3)

docs/management_api.md
cmd/alertmanager/main_test.go
docs/configuration.md

TheMeier · 2026-05-11T04:41:36Z

Had a quick look, looks ok so far. But why is this better then using prometheus-config-reloader, which is re-usable and in fact used by many projects.

mihir-dixit2k27 · 2026-05-11T12:33:34Z

@TheMeier
The main motivation is avoiding an extra sidecar just for this prometheus-config-reloader needs its own image, RBAC, resource requests etc. which is a lot of overhead for simply watching one file.

Also Prometheus also added --enable-feature=auto-reload-config natively even thoughprometheus-config-reloaderexisted, so there's precedent for this in the ecosystem.

Happy to add a docs note pointing to prometheus-config-reloader for more advanced use cases if that helps!

Spaceman1701

Thanks for the contribution! I've looked briefly at the code and it looks ok to me, but unfortunately I'm a little concerned about this concept.

In cluster mode, Alertmanager nodes need to agree on their config or they will not reach consensus. Decoupling the replacement of the config file and the reload helps ensure that users can coordinate properly - it's not often possible to atomically update a file across many hosts, but it is easy enough to concurrently issue several reload requests.

I worry that this flag would end up being a footgun for clustered Alertmanager. If enabled, coordinating config reloading becomes very difficult.

mihir-dixit2k27 · 2026-05-11T17:52:47Z

@Spaceman1701 That's a really good point, I hadn't thought about it from the cluster coordination angle.
Would it make sense to either add a warning when cluster mode is detected, or explicitly document that this flag is intended for single-node deployments? Happy to go whichever direction you think is right . or if you feel it's a fundamental blocker I'm open to that too.

ultrotter · 2026-06-19T07:35:29Z

We could definitely go with a warning, or even refusing the flag for cluster nodes. What do you think @Spaceman1701 ?

coderabbitai

🧹 Nitpick comments (1)

cmd/alertmanager/main.go (1)
288-291: 💤 Low value

Consider adding a debug log when reload is skipped due to pending reload.

When the non-blocking send falls through to the default case, operators have no visibility into why a detected change didn't trigger an immediate reload. A debug-level log here would aid troubleshooting in scenarios where SIGHUP and auto-reload overlap.
💡 Optional improvement
 		select {
 		case reloadCh <- struct{}{}:
 			lastChecksum = checksum
 		default:
 			// A reload is already pending; the file change will be picked
 			// up after that reload completes.
+			logger.Debug("Auto-reload: skipping, reload already pending", "file", configFile)
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/alertmanager/main.go` around lines 288 - 291, Add a debug-level log
statement in the default case of the select statement (where the reload is
skipped) to inform operators that a file change was detected but not acted upon
because a reload is already pending. This will provide visibility when SIGHUP
and auto-reload operations overlap, helping with troubleshooting scenarios where
file changes are detected but don't immediately trigger a reload.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Around line 288-291: Add a debug-level log statement in the default case of
the select statement (where the reload is skipped) to inform operators that a
file change was detected but not acted upon because a reload is already pending.
This will provide visibility when SIGHUP and auto-reload operations overlap,
helping with troubleshooting scenarios where file changes are detected but don't
immediately trigger a reload.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eaa91377-4b82-4eb4-bf7a-7855be6ca47c

📥 Commits

Reviewing files that changed from the base of the PR and between f7ba5da and bf00b6b.

📒 Files selected for processing (4)

cmd/alertmanager/main.go
cmd/alertmanager/main_test.go
docs/configuration.md
docs/management_api.md

✅ Files skipped from review due to trivial changes (2)

docs/management_api.md
docs/configuration.md

🚧 Files skipped from review as they are similar to previous changes (1)

cmd/alertmanager/main_test.go

Signed-off-by: Mihir Dixit <dixitmihir1@gmail.com>

mihir-dixit2k27 requested a review from a team as a code owner May 1, 2026 19:28

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch 2 times, most recently from 161829c to a2ceeb7 Compare May 2, 2026 17:18

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

Comment thread cmd/alertmanager/main.go

mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch from a2ceeb7 to f7ba5da Compare May 2, 2026 17:27

Spaceman1701 reviewed May 11, 2026

View reviewed changes

mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch from f7ba5da to bf00b6b Compare June 19, 2026 11:57

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

cmd/alertmanager: add --config.auto-reload-interval flag

b45d342

Signed-off-by: Mihir Dixit <dixitmihir1@gmail.com>

mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch from bf00b6b to b45d342 Compare June 19, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cmd/alertmanager: add --config.auto-reload-interval flag#5222

cmd/alertmanager: add --config.auto-reload-interval flag#5222
mihir-dixit2k27 wants to merge 1 commit into
prometheus:mainfrom
mihir-dixit2k27:feat/config-auto-reload-interval

mihir-dixit2k27 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

coderabbitai Bot left a comment

coderabbitai Bot left a comment

Uh oh!

TheMeier commented May 11, 2026

mihir-dixit2k27 commented May 11, 2026

Spaceman1701 left a comment

mihir-dixit2k27 commented May 11, 2026

ultrotter commented Jun 19, 2026

coderabbitai Bot left a comment

Labels

4 participants

Uh oh!

Conversation

mihir-dixit2k27 commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Behaviour

Changes

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

coderabbitai Bot left a comment

Choose a reason for hiding this comment

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

TheMeier commented May 11, 2026

mihir-dixit2k27 commented May 11, 2026

Spaceman1701 left a comment

Choose a reason for hiding this comment

mihir-dixit2k27 commented May 11, 2026

ultrotter commented Jun 19, 2026

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Labels

4 participants

mihir-dixit2k27 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading