feat(aci): Add error DetectorGroup chunked backfill task and method #104377

kcons · 2025-12-04T05:42:07Z

The existing DetectorGroup backfill job is impractically slow.
This adds a function (intended to be triggered by a job) to produce roughly equal ranges of IDs in the Projects table, which then will be used to trigger a new task that backfills the projects in that range.

This distributes all of the slow bits into chunks we can control the size of, and the processing pool used to execute them can be gradually dialed up as we gain confidence in correctness and capacity cost.
The expectation is that this should allow backfill to finish completely in a day or so without blocking any jobs or hand-holding.

codecov · 2025-12-04T05:53:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #104377      +/-   ##
===========================================
- Coverage   80.57%    80.40%   -0.17%     
===========================================
  Files        9345      9345              
  Lines      399518    401299    +1781     
  Branches    25600     25600              
===========================================
+ Hits       321894    322664     +770     
- Misses      77171     78182    +1011     
  Partials      453       453

tests/sentry/workflow_engine/processors/test_backfill.py

markstory · 2025-12-05T20:21:43Z

src/sentry/workflow_engine/tasks/backfill.py

+    namespace=bulk_backfill_tasks,
+    processing_deadline_duration=300,
+    silo_mode=SiloMode.REGION,
+    retry=Retry(times=3, delay=6),


Are there any exception types that you want to retry on other than TimeoutError?

None that I know about, but I probably want to retry Exception and deadline timeouts. I think that's what the @retry decorator is doing.

I think it will only retry TimeoutError unless you explicitly add Exception here

wedamija

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?

Similar to

sentry/src/sentry/tasks/weekly_escalating_forecast.py

Lines 62 to 74 in f8a9b05

    
           for until_escalating_groups in chunked( 
        
               RangeQuerySetWrapper( 
        
                   Group.objects.filter( 
        
                       status=GroupStatus.IGNORED, 
        
                       substatus=GroupSubStatus.UNTIL_ESCALATING, 
        
                       project_id__in=project_ids, 
        
                       last_seen__gte=datetime.now(UTC) - timedelta(days=7), 
        
                   ), 
        
                   step=ITERATOR_CHUNK, 
        
               ), 
        
               ITERATOR_CHUNK, 
        
           ): 
        
               generate_and_save_forecasts(groups=until_escalating_groups)

kcons · 2025-12-05T22:52:43Z

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?

Similar to

sentry/src/sentry/tasks/weekly_escalating_forecast.py

Lines 62 to 74 in f8a9b05

for until_escalating_groups in chunked(

RangeQuerySetWrapper(

Group.objects.filter(

status=GroupStatus.IGNORED,

substatus=GroupSubStatus.UNTIL_ESCALATING,

project_id__in=project_ids,

last_seen__gte=datetime.now(UTC) - timedelta(days=7),

),

step=ITERATOR_CHUNK,

),

ITERATOR_CHUNK,

):

generate_and_save_forecasts(groups=until_escalating_groups)

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here.
get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects.
I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

wedamija · 2025-12-05T22:57:38Z

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?
Similar to

sentry/src/sentry/tasks/weekly_escalating_forecast.py

Lines 62 to 74 in f8a9b05

for until_escalating_groups in chunked(

RangeQuerySetWrapper(

Group.objects.filter(

status=GroupStatus.IGNORED,

substatus=GroupSubStatus.UNTIL_ESCALATING,

project_id__in=project_ids,

last_seen__gte=datetime.now(UTC) - timedelta(days=7),

),

step=ITERATOR_CHUNK,

),

ITERATOR_CHUNK,

):

generate_and_save_forecasts(groups=until_escalating_groups)

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.

I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way

kcons · 2025-12-05T23:50:41Z

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?
Similar to

sentry/src/sentry/tasks/weekly_escalating_forecast.py

Lines 62 to 74 in f8a9b05

for until_escalating_groups in chunked(

RangeQuerySetWrapper(

Group.objects.filter(

status=GroupStatus.IGNORED,

substatus=GroupSubStatus.UNTIL_ESCALATING,

project_id__in=project_ids,

last_seen__gte=datetime.now(UTC) - timedelta(days=7),

),

step=ITERATOR_CHUNK,

),

ITERATOR_CHUNK,

):

generate_and_save_forecasts(groups=until_escalating_groups)

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.

I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way

Ah, I gotcha. Yeah, it's not really necessary. It just seemed like an efficient and easy way to chunk the id space. I can just drop the fuction and plan on having the job chunk in Python; I don't expect the perf difference to be meaningful.

getsantry · 2025-12-27T08:10:07Z

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

src/sentry/workflow_engine/processors/backfill.py

cursor · 2025-12-31T01:36:51Z

src/sentry/workflow_engine/processors/backfill.py

+
+    existing_detector_groups_subquery = DetectorGroup.objects.filter(
+        detector_id=detector_id, group_id=OuterRef("id")
+    )


Subquery filter mismatches unique constraint causing potential failures

The existing_detector_groups_subquery filters by both detector_id and group_id, but the DetectorGroup model has a unique constraint only on group. If a group has a DetectorGroup associated with a different detector (due to prior data inconsistency), this subquery won't exclude it. The subsequent get_or_create would then fail with an IntegrityError because the unique constraint prevents creating a second DetectorGroup for the same group. Removing the detector_id filter from the subquery would make the exclusion match the actual constraint.

wedamija · 2026-01-07T20:36:39Z

src/sentry/workflow_engine/tasks/backfill.py

+    namespace=bulk_backfill_tasks,
+    processing_deadline_duration=300,
+    silo_mode=SiloMode.REGION,
+    retry=Retry(times=3, delay=6),


I think it will only retry TimeoutError unless you explicitly add Exception here

wedamija · 2026-01-07T20:44:37Z

src/sentry/workflow_engine/processors/backfill.py

+
+    created_count = 0
+
+    for group in RangeQuerySetWrapper(groups_needing_detector_groups):


I have a feeling that this query might time out... It might be a good idea to get the sql (including the pagination and sorting that RangeQuerySetWrapper uses) and test it out on a project with a lot of groups to see how well it runs.

If it's really slow, then it might end up being better to skip the sub query and just iterate over all_unresolved_groups. I think you might also need to sort the RangeQuerySetWrapper by last_seen so that it uses the appropriate index.

sentry/src/sentry/models/group.py

Line 687 in c291f44

models.Index(fields=("project", "status", "type", "last_seen", "id")),

wedamija · 2026-01-07T20:45:07Z

src/sentry/workflow_engine/processors/backfill.py

+        if created:
+            detector_group.date_added = group.first_seen
+            detector_group.save(update_fields=["date_added"])
+            created_count += 1


These auto_now_add fields are kind of annoying for backfills...

getsantry · 2026-01-29T08:00:24Z

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!

"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Dec 4, 2025

This comment was marked as off-topic.

Sign in to view

kcons force-pushed the kcons/trimmed branch from 2b12f93 to a721746 Compare December 5, 2025 01:26

kcons marked this pull request as ready for review December 5, 2025 01:26

kcons requested review from a team as code owners December 5, 2025 01:26

vercel bot deployed to Preview December 5, 2025 01:29 View deployment

cursor bot reviewed Dec 5, 2025

View reviewed changes

tests/sentry/workflow_engine/processors/test_backfill.py Outdated Show resolved Hide resolved

markstory reviewed Dec 5, 2025

View reviewed changes

wedamija reviewed Dec 5, 2025

View reviewed changes

getsantry bot added the Stale label Dec 27, 2025

kcons added 2 commits December 30, 2025 16:51

feat(aci): Add error DetectorGroup chunked backfill task infra

41ecbc9

drop the fancy chunking

7872c4a

kcons force-pushed the kcons/trimmed branch from a721746 to 7872c4a Compare December 31, 2025 01:29

sentry bot reviewed Dec 31, 2025

View reviewed changes

src/sentry/workflow_engine/processors/backfill.py Show resolved Hide resolved

vercel bot deployed to Preview December 31, 2025 01:32 View deployment

cursor bot reviewed Dec 31, 2025

View reviewed changes

getsantry bot removed the Stale label Dec 31, 2025

wedamija reviewed Jan 7, 2026

View reviewed changes

getsantry bot added the Stale label Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(aci): Add error DetectorGroup chunked backfill task and method #104377

feat(aci): Add error DetectorGroup chunked backfill task and method #104377

kcons commented Dec 4, 2025

codecov bot commented Dec 4, 2025 •

edited

Loading

This comment was marked as off-topic.

Uh oh!

Uh oh!

markstory Dec 5, 2025

kcons Dec 5, 2025

wedamija Jan 7, 2026

wedamija left a comment

kcons commented Dec 5, 2025

wedamija commented Dec 5, 2025

kcons commented Dec 5, 2025

getsantry bot commented Dec 27, 2025

Uh oh!

cursor bot Dec 31, 2025

wedamija Jan 7, 2026

wedamija Jan 7, 2026

wedamija Jan 7, 2026

getsantry bot commented Jan 29, 2026

Labels

5 participants

	for until_escalating_groups in chunked(
	RangeQuerySetWrapper(
	Group.objects.filter(
	status=GroupStatus.IGNORED,
	substatus=GroupSubStatus.UNTIL_ESCALATING,
	project_id__in=project_ids,
	last_seen__gte=datetime.now(UTC) - timedelta(days=7),
	),
	step=ITERATOR_CHUNK,
	),
	ITERATOR_CHUNK,
	):
	generate_and_save_forecasts(groups=until_escalating_groups)


		created_count = 0

		for group in RangeQuerySetWrapper(groups_needing_detector_groups):

Uh oh!

feat(aci): Add error DetectorGroup chunked backfill task and method #104377

Are you sure you want to change the base?

feat(aci): Add error DetectorGroup chunked backfill task and method #104377

Conversation

kcons commented Dec 4, 2025

codecov bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

This comment was marked as off-topic.

Uh oh!

Uh oh!

markstory Dec 5, 2025

Choose a reason for hiding this comment

kcons Dec 5, 2025

Choose a reason for hiding this comment

wedamija Jan 7, 2026

Choose a reason for hiding this comment

wedamija left a comment

Choose a reason for hiding this comment

kcons commented Dec 5, 2025

wedamija commented Dec 5, 2025

kcons commented Dec 5, 2025

getsantry bot commented Dec 27, 2025

Uh oh!

cursor bot Dec 31, 2025

Choose a reason for hiding this comment

Subquery filter mismatches unique constraint causing potential failures

wedamija Jan 7, 2026

Choose a reason for hiding this comment

wedamija Jan 7, 2026

Choose a reason for hiding this comment

wedamija Jan 7, 2026

Choose a reason for hiding this comment

getsantry bot commented Jan 29, 2026

Labels

5 participants

codecov bot commented Dec 4, 2025 •

edited

Loading