Skip to content

Conversation

@AlanCoding
Copy link
Member

@AlanCoding AlanCoding commented Jan 19, 2026

SUMMARY

Still WIP but this has the basics.

Next up: validate that /api/v2/metrics/ can return the new keys, for what those are see:

https://github.com/ansible/dispatcherd/blob/main/dispatcherd/service/metrics.py

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

Note

Medium Risk
Adds a runtime HTTP call to the metrics endpoint and changes dispatcher service configuration, which could affect metrics availability/latency if misconfigured or if the dispatcherd endpoint is unreachable.

Overview
/api/v2/metrics/ now appends Prometheus output scraped directly from the local dispatcherd metrics HTTP endpoint (via new _get_dispatcherd_metrics), while keeping the existing Redis-backed subsystem metrics generation.

Dispatcher-related metric keys are removed from DispatcherMetrics.METRICSLIST and filtering is optimized so the HTTP scrape is skipped when node excludes the local host or when the requested metric set doesn’t overlap dispatcher metrics; failures/timeouts are swallowed with debug logging.

dispatcherd service config is updated to pass metrics_kwargs (host/port) from METRICS_SUBSYSTEM_CONFIG, test-mode broker naming is updated, functional tests cover the new node/metric filter behavior, and the dispatcherd dependency is bumped to 2026.01.27.

Written by Cursor Bugbot for commit 23af6f0. This will update automatically on new commits. Configure here.

@github-actions github-actions bot added component:api dependencies Pull requests that update a dependency file labels Jan 19, 2026
@AlanCoding AlanCoding changed the title Enable new fancy asyncio metrics for dispatcherd Jan 19, 2026
@AlanCoding AlanCoding requested a review from fosterseth January 19, 2026 19:47
@AlanCoding
Copy link
Member Author

from /api/v2/metrics/

dispatcher_pool_scale_up_events{node="awx-1"} 0
# HELP dispatcher_pool_active_task_count Number of active tasks in the worker pool when last task was submitted
# TYPE dispatcher_pool_active_task_count gauge
dispatcher_pool_active_task_count{node="awx-1"} 0
# HELP dispatcher_pool_max_worker_count Highest number of workers in worker pool in last collection interval, about 20s
# TYPE dispatcher_pool_max_worker_count gauge
dispatcher_pool_max_worker_count{node="awx-1"} 0
# HELP dispatcher_availability Fraction of time (in last collection interval) dispatcher was able to receive messages
# TYPE dispatcher_availability gauge
dispatcher_availability{node="awx-1"} 0.0
# HELP subsystem_metrics_pipe_execute_seconds Time spent saving metrics to redis
# TYPE subsystem_metrics_pipe_execute_seconds gauge

So this looks wrong and will look into what the deal is.

@AlanCoding
Copy link
Member Author

With the latest change, those old numbers going away and I am seeing:

# HELP dispatcher_messages_received_total Number of messages received by dispatchermain
# TYPE dispatcher_messages_received_total counter
dispatcher_messages_received_total 2.0
# HELP dispatcher_control_messages_count_total Number of control messages received.
# TYPE dispatcher_control_messages_count_total counter
dispatcher_control_messages_count_total 0.0
# HELP dispatcher_worker_count_total Number of workers running.
# TYPE dispatcher_worker_count_total counter
dispatcher_worker_count_total 4.0
@AlanCoding AlanCoding requested a review from kdelee January 19, 2026 20:34
@AlanCoding
Copy link
Member Author

With the last push the tests are passing now. The security notice would apply generally to the whole idea of serving metrics locally which I was fairly sure was already happening.

@AlanCoding
Copy link
Member Author

Currently, I think that this posts dispatcher metrics through redis

https://github.com/ansible/awx/blob/devel/awx/main/dispatch/worker/base.py#L192-L202

For metrics that aren't actually dispatcher related, like task manager stuff, this will still be the case. But that specific method will be removed in #16209, and this new stuff is its replacement.

@AlanCoding AlanCoding force-pushed the dispatcherd_metrics branch 2 times, most recently from e2b45d0 to 958249a Compare January 26, 2026 15:05
super().__init__(settings.METRICS_SERVICE_CALLBACK_RECEIVER, *args, **kwargs)


def _get_dispatcherd_metrics():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason we can't get metrics via dispatcherctl? seems unix socket might be more reliable / robust than http server

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatcherctl is just the client. The service is dispatcherd. All dispatcherctl does (now) is send pg_notify messages to get data. So to the question, can metrics go over pg_notify, yeah, I think that's going to be a "no". They expect data served over a local port. Without adding this stuff, there is no port dispatcherd is serving from. So based on the OS first-principles, yeah, we have to add a port dispatcherd listens to. Because that's just how metrics collection works.

@AlanCoding
Copy link
Member Author

I see task manager metrics are still being gathered.

# HELP task_manager_get_tasks_seconds Time spent in loading tasks from db
# TYPE task_manager_get_tasks_seconds gauge
task_manager_get_tasks_seconds{node="awx-1"} 0.013124978635460138
# HELP task_manager_start_task_seconds Time spent starting task
# TYPE task_manager_start_task_seconds gauge
task_manager_start_task_seconds{node="awx-1"} 0.010372716002166271
# HELP task_manager_process_running_tasks_seconds Time spent processing running tasks
# TYPE task_manager_process_running_tasks_seconds gauge
task_manager_process_running_tasks_seconds{node="awx-1"} 6.309710443019867e-07
# HELP task_manager_process_pending_tasks_seconds Time spent processing pending tasks
# TYPE task_manager_process_pending_tasks_seconds gauge
task_manager_process_pending_tasks_seconds{node="awx-1"} 0.01111389696598053
# HELP task_manager__schedule_seconds Time spent in running the entire _schedule
# TYPE task_manager__schedule_seconds gauge
task_manager__schedule_seconds{node="awx-1"} 0.03407588880509138
# HELP task_manager__schedule_calls Number of calls to _schedule, after lock is acquired
# TYPE task_manager__schedule_calls gauge
task_manager__schedule_calls{node="awx-1"} 34
# HELP task_manager_recorded_timestamp Unix timestamp when metrics were last recorded
# TYPE task_manager_recorded_timestamp gauge
task_manager_recorded_timestamp{node="awx-1"} 1769529661.0170753

However, I should disclose that I don't understand how.

@AlanCoding
Copy link
Member Author

Ok, multiple times, I have unambiguously confirmed that the task manager metrics are still updating. I admit that I do not know how this is working, but somehow it appears to be working.

This is still a bit of a separate subject from this.

I'm still iffy if people are okay with me adding this data without formalizing this schema. But marking as ready for review. I don't have another approach, even if this feels hacky, it does what's needed.

@AlanCoding AlanCoding marked this pull request as ready for review January 27, 2026 21:57
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

return payload.decode('utf-8')
except (urllib.error.URLError, UnicodeError, socket.timeout, TimeoutError, http.client.HTTPException) as exc:
logger.debug(f"Failed to collect dispatcherd metrics from {url}: {exc}")
return ''
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatcherd metrics ignores metric query parameter filter

Low Severity

The _get_dispatcherd_metrics function respects the node query parameter filter but doesn't filter by the metric query parameter. The existing generate_metrics method filters metrics by both node and metric parameters. When users request specific metrics via ?metric=<name>, they receive filtered Redis-based metrics but ALL dispatcherd metrics, creating inconsistent API behavior. The API documentation explicitly shows metric= as a supported filter parameter.

Fix in Cursor Fix in Web

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
2 Security Hotspots
67.2% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:api dependencies Pull requests that update a dependency file

2 participants