cmd/alertmanager: add --config.auto-reload-interval flag#5222
cmd/alertmanager: add --config.auto-reload-interval flag#5222mihir-dixit2k27 wants to merge 1 commit into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a polling-based auto-reload controlled by a new ChangesConfig Auto-reload Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/alertmanager/main.go (1)
746-752: 💤 Low valueMinor: potential goroutine leak on shutdown edge case.
If context cancellation occurs while the watcher is blocked sending to
reloadCh(line 747), the goroutine won't exit cleanly because there's no select around the channel send. This is an edge case that only matters during shutdown when a config change is detected simultaneously with SIGTERM.Since the process is exiting anyway and this matches the pattern used by the HTTP reload handler, this is acceptable. Consider wrapping in a select with
ctx.Done()if clean shutdown becomes a requirement.♻️ Optional: Add context-aware channel send
// Trigger reload via the same channel that SIGHUP and POST /-/reload use. errCh := make(chan error) - reloadCh <- errCh + select { + case reloadCh <- errCh: + case <-ctx.Done(): + logger.Info("Auto-reload: watcher stopped during reload attempt", "file", configFile) + return + } if err := <-errCh; err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/alertmanager/main.go` around lines 746 - 752, The send to reloadCh in the watcher can block and leak a goroutine if context is cancelled while sending; to fix, make the send context-aware by replacing the direct send of errCh (reloadCh <- errCh) with a select that attempts to send errCh or returns/continues on ctx.Done(), e.g., in the watcher goroutine surrounding the reloadCh send use select { case reloadCh <- errCh: ... case <-ctx.Done(): cleanup/continue }, ensuring you reference reloadCh, errCh and ctx.Done() so the goroutine exits cleanly on shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Around line 746-752: The send to reloadCh in the watcher can block and leak a
goroutine if context is cancelled while sending; to fix, make the send
context-aware by replacing the direct send of errCh (reloadCh <- errCh) with a
select that attempts to send errCh or returns/continues on ctx.Done(), e.g., in
the watcher goroutine surrounding the reloadCh send use select { case reloadCh
<- errCh: ... case <-ctx.Done(): cleanup/continue }, ensuring you reference
reloadCh, errCh and ctx.Done() so the goroutine exits cleanly on shutdown.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1a804a2c-a6d9-4d5b-ad2a-7d1f7ad8c88d
📥 Commits
Reviewing files that changed from the base of the PR and between a5290a7 and 0501c5541d9e526f93a173f2169c60354a094719.
📒 Files selected for processing (4)
cmd/alertmanager/main.gocmd/alertmanager/main_test.godocs/configuration.mddocs/management_api.md
161829c to
a2ceeb7
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/alertmanager/main.go`:
- Around line 717-722: The initial config checksum read can fail leaving
lastChecksum empty and causing the first subsequent successful read to always
look like a change; modify the auto-reload logic that calls configFileChecksum
so that when lastChecksum is empty (uninitialized) and a subsequent
configFileChecksum succeeds, you set lastChecksum = newChecksum and do NOT
trigger a reload; apply the same "seed baseline if uninitialized" check where
you compare newChecksum vs lastChecksum (the places using lastChecksum,
newChecksum, and configFileChecksum and the reload invocation) so only genuine
checksum changes trigger reloadConfig.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8dcd956a-d8ff-4066-bc98-4bc4095dc3ce
📥 Commits
Reviewing files that changed from the base of the PR and between 161829c78c84a928f138814f3d94832166d14407 and a2ceeb7a950c9b6c6c6d581e6b14ba02ab960f7f.
📒 Files selected for processing (4)
cmd/alertmanager/main.gocmd/alertmanager/main_test.godocs/configuration.mddocs/management_api.md
✅ Files skipped from review due to trivial changes (3)
- docs/management_api.md
- cmd/alertmanager/main_test.go
- docs/configuration.md
a2ceeb7 to
f7ba5da
Compare
|
Had a quick look, looks ok so far. But why is this better then using prometheus-config-reloader, which is re-usable and in fact used by many projects. |
|
@TheMeier Also Prometheus also added Happy to add a docs note pointing to |
Spaceman1701
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I've looked briefly at the code and it looks ok to me, but unfortunately I'm a little concerned about this concept.
In cluster mode, Alertmanager nodes need to agree on their config or they will not reach consensus. Decoupling the replacement of the config file and the reload helps ensure that users can coordinate properly - it's not often possible to atomically update a file across many hosts, but it is easy enough to concurrently issue several reload requests.
I worry that this flag would end up being a footgun for clustered Alertmanager. If enabled, coordinating config reloading becomes very difficult.
|
@Spaceman1701 That's a really good point, I hadn't thought about it from the cluster coordination angle. |
|
We could definitely go with a warning, or even refusing the flag for cluster nodes. What do you think @Spaceman1701 ? |
f7ba5da to
bf00b6b
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cmd/alertmanager/main.go (1)
288-291: 💤 Low valueConsider adding a debug log when reload is skipped due to pending reload.
When the non-blocking send falls through to the
defaultcase, operators have no visibility into why a detected change didn't trigger an immediate reload. A debug-level log here would aid troubleshooting in scenarios where SIGHUP and auto-reload overlap.💡 Optional improvement
select { case reloadCh <- struct{}{}: lastChecksum = checksum default: // A reload is already pending; the file change will be picked // up after that reload completes. + logger.Debug("Auto-reload: skipping, reload already pending", "file", configFile) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/alertmanager/main.go` around lines 288 - 291, Add a debug-level log statement in the default case of the select statement (where the reload is skipped) to inform operators that a file change was detected but not acted upon because a reload is already pending. This will provide visibility when SIGHUP and auto-reload operations overlap, helping with troubleshooting scenarios where file changes are detected but don't immediately trigger a reload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cmd/alertmanager/main.go`:
- Around line 288-291: Add a debug-level log statement in the default case of
the select statement (where the reload is skipped) to inform operators that a
file change was detected but not acted upon because a reload is already pending.
This will provide visibility when SIGHUP and auto-reload operations overlap,
helping with troubleshooting scenarios where file changes are detected but don't
immediately trigger a reload.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eaa91377-4b82-4eb4-bf7a-7855be6ca47c
📒 Files selected for processing (4)
cmd/alertmanager/main.gocmd/alertmanager/main_test.godocs/configuration.mddocs/management_api.md
✅ Files skipped from review due to trivial changes (2)
- docs/management_api.md
- docs/configuration.md
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/alertmanager/main_test.go
Signed-off-by: Mihir Dixit <dixitmihir1@gmail.com>
bf00b6b to
b45d342
Compare
Description
Adds a
--config.auto-reload-intervalflag. When set to a non-zero duration, a background goroutine polls the SHA256 checksum of the config file at the given interval and writes to the existingwebReloadchannel on a detected change — the same path taken bySIGHUPandPOST /-/reload— so no new reload logic is introduced.Why polling instead of fsnotify?
Kubernetes
kubeletusesAtomicWriterto update ConfigMap and Secret volume mounts via symlink swaps, which causes fsnotify to stop receiving events after the first update since the original inode is replaced. SHA256 polling works correctly regardless of how the underlying file is updated, which is also why Prometheus uses polling for its equivalent feature.Behaviour
0s(disabled). Existing deployments are completely unaffected.SIGHUP/POST /-/reload: invalid configs are rejected and logged without disrupting the running instance.lastChecksumis not updated, so the watcher retries on every subsequent tick until the configuration is valid again.SIGTERM.Changes
cmd/alertmanager/main.goconfigFileChecksum()helper,runConfigWatcher()goroutine, watcher start block inrun()cmd/alertmanager/main_test.godocs/configuration.mddocs/management_api.mdfixes #5197