Skip to content

cmd/alertmanager: add --config.auto-reload-interval flag#5222

Open
mihir-dixit2k27 wants to merge 1 commit into
prometheus:mainfrom
mihir-dixit2k27:feat/config-auto-reload-interval
Open

cmd/alertmanager: add --config.auto-reload-interval flag#5222
mihir-dixit2k27 wants to merge 1 commit into
prometheus:mainfrom
mihir-dixit2k27:feat/config-auto-reload-interval

Conversation

@mihir-dixit2k27

@mihir-dixit2k27 mihir-dixit2k27 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a --config.auto-reload-interval flag. When set to a non-zero duration, a background goroutine polls the SHA256 checksum of the config file at the given interval and writes to the existing webReload channel on a detected change — the same path taken by SIGHUP and POST /-/reload — so no new reload logic is introduced.

Why polling instead of fsnotify?
Kubernetes kubelet uses AtomicWriter to update ConfigMap and Secret volume mounts via symlink swaps, which causes fsnotify to stop receiving events after the first update since the original inode is replaced. SHA256 polling works correctly regardless of how the underlying file is updated, which is also why Prometheus uses polling for its equivalent feature.

Behaviour

  • Default is 0s (disabled). Existing deployments are completely unaffected.
  • On a detected change, the reload follows the same code path as SIGHUP/POST /-/reload: invalid configs are rejected and logged without disrupting the running instance.
  • If a reload fails, lastChecksum is not updated, so the watcher retries on every subsequent tick until the configuration is valid again.
  • The watcher goroutine exits cleanly on SIGTERM.

Changes

File Description
cmd/alertmanager/main.go New flag, configFileChecksum() helper, runConfigWatcher() goroutine, watcher start block in run()
cmd/alertmanager/main_test.go 8 unit tests covering all edge cases
docs/configuration.md New auto-reload section
docs/management_api.md Cross-reference added under the reload section

fixes #5197

@mihir-dixit2k27 mihir-dixit2k27 requested a review from a team as a code owner May 1, 2026 19:28
@coderabbitai

coderabbitai Bot commented May 1, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a polling-based auto-reload controlled by a new --config.auto-reload-interval duration flag. When non-zero, a background watcher computes SHA256 checksums of the config file, polls it at the interval, and triggers reloads via the existing reload coordination channel when changes are detected.

Changes

Config Auto-reload Feature

Layer / File(s) Summary
CLI flag and dependencies
cmd/alertmanager/main.go
Introduces --config.auto-reload-interval flag (default 0s, off by default) and adds imports for SHA256 hashing (crypto/sha256), hex encoding (encoding/hex), and ticker polling (time).
Checksum computation and watcher implementation
cmd/alertmanager/main.go
Adds configFileChecksum() to return SHA256 hex digest of the config file; adds runConfigWatcher() goroutine that polls the file at the configured interval, maintains lastChecksum baseline (seeding on initial startup failure without triggering reload), detects changes via digest comparison, and emits reload signals via non-blocking send on the existing reloadCh; updates lastChecksum only after successful reload and skips redundant sends if a reload is already queued.
Integration into main orchestration
cmd/alertmanager/main.go
In run(), conditionally starts the watcher when configAutoReloadInterval > 0, passing context, interval, config path, reload channel, and logger; logs a warning if cluster mode is enabled alongside auto-reload.
Unit tests
cmd/alertmanager/main_test.go
Tests configFileChecksum() for consistent hashing, differing digests on content change, and error on missing file; tests runConfigWatcher() for no reload when unchanged, reload signal on change, no retrigger after successful reload, graceful handling of unreadable/removed files, recovery from initial startup failure, and exit on context cancellation.
User documentation
docs/configuration.md, docs/management_api.md
Documents --config.auto-reload-interval with polling semantics, default-off behavior, an example using 30s, and notes that reload uses the same validation and error-handling path as SIGHUP / POST /-/reload; updates management API doc to link to the configuration guide.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive All code changes directly support the auto-reload feature, but deployment considerations (cluster mode safety) discussed in PR comments remain unresolved, potentially affecting production suitability. Address cluster-mode coordination concern raised by Spaceman1701: implement safeguard (warning or flag refusal) when operating as cluster node, or explicitly document single-node deployment requirement.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'cmd/alertmanager: add --config.auto-reload-interval flag' clearly and specifically describes the main change: adding a new CLI flag for automatic configuration reloading.
Description check ✅ Passed The PR description comprehensively covers the feature rationale, behavior, implementation details, and file changes; it addresses the template checklist items and links to issue #5197.
Linked Issues check ✅ Passed The PR implementation directly addresses all primary coding objectives from issue #5197: opt-in auto-reload mechanism, Kubernetes support via polling, existing reload semantics reuse, backwards compatibility, and transparent retry logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/alertmanager/main.go (1)

746-752: 💤 Low value

Minor: potential goroutine leak on shutdown edge case.

If context cancellation occurs while the watcher is blocked sending to reloadCh (line 747), the goroutine won't exit cleanly because there's no select around the channel send. This is an edge case that only matters during shutdown when a config change is detected simultaneously with SIGTERM.

Since the process is exiting anyway and this matches the pattern used by the HTTP reload handler, this is acceptable. Consider wrapping in a select with ctx.Done() if clean shutdown becomes a requirement.

♻️ Optional: Add context-aware channel send
 			// Trigger reload via the same channel that SIGHUP and POST /-/reload use.
 			errCh := make(chan error)
-			reloadCh <- errCh
+			select {
+			case reloadCh <- errCh:
+			case <-ctx.Done():
+				logger.Info("Auto-reload: watcher stopped during reload attempt", "file", configFile)
+				return
+			}
 			if err := <-errCh; err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/alertmanager/main.go` around lines 746 - 752, The send to reloadCh in the
watcher can block and leak a goroutine if context is cancelled while sending; to
fix, make the send context-aware by replacing the direct send of errCh (reloadCh
<- errCh) with a select that attempts to send errCh or returns/continues on
ctx.Done(), e.g., in the watcher goroutine surrounding the reloadCh send use
select { case reloadCh <- errCh: ... case <-ctx.Done(): cleanup/continue },
ensuring you reference reloadCh, errCh and ctx.Done() so the goroutine exits
cleanly on shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Around line 746-752: The send to reloadCh in the watcher can block and leak a
goroutine if context is cancelled while sending; to fix, make the send
context-aware by replacing the direct send of errCh (reloadCh <- errCh) with a
select that attempts to send errCh or returns/continues on ctx.Done(), e.g., in
the watcher goroutine surrounding the reloadCh send use select { case reloadCh
<- errCh: ... case <-ctx.Done(): cleanup/continue }, ensuring you reference
reloadCh, errCh and ctx.Done() so the goroutine exits cleanly on shutdown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a804a2c-a6d9-4d5b-ad2a-7d1f7ad8c88d

📥 Commits

Reviewing files that changed from the base of the PR and between a5290a7 and 0501c5541d9e526f93a173f2169c60354a094719.

📒 Files selected for processing (4)
  • cmd/alertmanager/main.go
  • cmd/alertmanager/main_test.go
  • docs/configuration.md
  • docs/management_api.md
@mihir-dixit2k27 mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch 2 times, most recently from 161829c to a2ceeb7 Compare May 2, 2026 17:18

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/alertmanager/main.go`:
- Around line 717-722: The initial config checksum read can fail leaving
lastChecksum empty and causing the first subsequent successful read to always
look like a change; modify the auto-reload logic that calls configFileChecksum
so that when lastChecksum is empty (uninitialized) and a subsequent
configFileChecksum succeeds, you set lastChecksum = newChecksum and do NOT
trigger a reload; apply the same "seed baseline if uninitialized" check where
you compare newChecksum vs lastChecksum (the places using lastChecksum,
newChecksum, and configFileChecksum and the reload invocation) so only genuine
checksum changes trigger reloadConfig.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8dcd956a-d8ff-4066-bc98-4bc4095dc3ce

📥 Commits

Reviewing files that changed from the base of the PR and between 161829c78c84a928f138814f3d94832166d14407 and a2ceeb7a950c9b6c6c6d581e6b14ba02ab960f7f.

📒 Files selected for processing (4)
  • cmd/alertmanager/main.go
  • cmd/alertmanager/main_test.go
  • docs/configuration.md
  • docs/management_api.md
✅ Files skipped from review due to trivial changes (3)
  • docs/management_api.md
  • cmd/alertmanager/main_test.go
  • docs/configuration.md
Comment thread cmd/alertmanager/main.go
@mihir-dixit2k27 mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch from a2ceeb7 to f7ba5da Compare May 2, 2026 17:27
@TheMeier

Copy link
Copy Markdown
Contributor

Had a quick look, looks ok so far. But why is this better then using prometheus-config-reloader, which is re-usable and in fact used by many projects.

@mihir-dixit2k27

Copy link
Copy Markdown
Contributor Author

@TheMeier
The main motivation is avoiding an extra sidecar just for this prometheus-config-reloader needs its own image, RBAC, resource requests etc. which is a lot of overhead for simply watching one file.

Also Prometheus also added --enable-feature=auto-reload-config natively even thoughprometheus-config-reloaderexisted, so there's precedent for this in the ecosystem.

Happy to add a docs note pointing to prometheus-config-reloader for more advanced use cases if that helps!

@Spaceman1701 Spaceman1701 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I've looked briefly at the code and it looks ok to me, but unfortunately I'm a little concerned about this concept.

In cluster mode, Alertmanager nodes need to agree on their config or they will not reach consensus. Decoupling the replacement of the config file and the reload helps ensure that users can coordinate properly - it's not often possible to atomically update a file across many hosts, but it is easy enough to concurrently issue several reload requests.

I worry that this flag would end up being a footgun for clustered Alertmanager. If enabled, coordinating config reloading becomes very difficult.

@mihir-dixit2k27

Copy link
Copy Markdown
Contributor Author

@Spaceman1701 That's a really good point, I hadn't thought about it from the cluster coordination angle.
Would it make sense to either add a warning when cluster mode is detected, or explicitly document that this flag is intended for single-node deployments? Happy to go whichever direction you think is right . or if you feel it's a fundamental blocker I'm open to that too.

@ultrotter

Copy link
Copy Markdown
Contributor

We could definitely go with a warning, or even refusing the flag for cluster nodes. What do you think @Spaceman1701 ?

@mihir-dixit2k27 mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch from f7ba5da to bf00b6b Compare June 19, 2026 11:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cmd/alertmanager/main.go (1)

288-291: 💤 Low value

Consider adding a debug log when reload is skipped due to pending reload.

When the non-blocking send falls through to the default case, operators have no visibility into why a detected change didn't trigger an immediate reload. A debug-level log here would aid troubleshooting in scenarios where SIGHUP and auto-reload overlap.

💡 Optional improvement
 		select {
 		case reloadCh <- struct{}{}:
 			lastChecksum = checksum
 		default:
 			// A reload is already pending; the file change will be picked
 			// up after that reload completes.
+			logger.Debug("Auto-reload: skipping, reload already pending", "file", configFile)
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/alertmanager/main.go` around lines 288 - 291, Add a debug-level log
statement in the default case of the select statement (where the reload is
skipped) to inform operators that a file change was detected but not acted upon
because a reload is already pending. This will provide visibility when SIGHUP
and auto-reload operations overlap, helping with troubleshooting scenarios where
file changes are detected but don't immediately trigger a reload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Around line 288-291: Add a debug-level log statement in the default case of
the select statement (where the reload is skipped) to inform operators that a
file change was detected but not acted upon because a reload is already pending.
This will provide visibility when SIGHUP and auto-reload operations overlap,
helping with troubleshooting scenarios where file changes are detected but don't
immediately trigger a reload.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eaa91377-4b82-4eb4-bf7a-7855be6ca47c

📥 Commits

Reviewing files that changed from the base of the PR and between f7ba5da and bf00b6b.

📒 Files selected for processing (4)
  • cmd/alertmanager/main.go
  • cmd/alertmanager/main_test.go
  • docs/configuration.md
  • docs/management_api.md
✅ Files skipped from review due to trivial changes (2)
  • docs/management_api.md
  • docs/configuration.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/alertmanager/main_test.go
Signed-off-by: Mihir Dixit <dixitmihir1@gmail.com>
@mihir-dixit2k27 mihir-dixit2k27 force-pushed the feat/config-auto-reload-interval branch from bf00b6b to b45d342 Compare June 19, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants