Skip to content

Conversation

@kcons
Copy link
Member

@kcons kcons commented Dec 4, 2025

The existing DetectorGroup backfill job is impractically slow.
This adds a function (intended to be triggered by a job) to produce roughly equal ranges of IDs in the Projects table, which then will be used to trigger a new task that backfills the projects in that range.

This distributes all of the slow bits into chunks we can control the size of, and the processing pool used to execute them can be gradually dialed up as we gain confidence in correctness and capacity cost.
The expectation is that this should allow backfill to finish completely in a day or so without blocking any jobs or hand-holding.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Dec 4, 2025
@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #104377      +/-   ##
===========================================
- Coverage   80.57%    80.40%   -0.17%     
===========================================
  Files        9345      9345              
  Lines      399518    401299    +1781     
  Branches    25600     25600              
===========================================
+ Hits       321894    322664     +770     
- Misses      77171     78182    +1011     
  Partials      453       453              
lyxserviceOS

This comment was marked as off-topic.

@kcons kcons marked this pull request as ready for review December 5, 2025 01:26
@kcons kcons requested review from a team as code owners December 5, 2025 01:26
namespace=bulk_backfill_tasks,
processing_deadline_duration=300,
silo_mode=SiloMode.REGION,
retry=Retry(times=3, delay=6),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any exception types that you want to retry on other than TimeoutError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None that I know about, but I probably want to retry Exception and deadline timeouts. I think that's what the @retry decorator is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will only retry TimeoutError unless you explicitly add Exception here

Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?

Similar to

for until_escalating_groups in chunked(
RangeQuerySetWrapper(
Group.objects.filter(
status=GroupStatus.IGNORED,
substatus=GroupSubStatus.UNTIL_ESCALATING,
project_id__in=project_ids,
last_seen__gte=datetime.now(UTC) - timedelta(days=7),
),
step=ITERATOR_CHUNK,
),
ITERATOR_CHUNK,
):
generate_and_save_forecasts(groups=until_escalating_groups)

@kcons
Copy link
Member Author

kcons commented Dec 5, 2025

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?

Similar to

for until_escalating_groups in chunked(
RangeQuerySetWrapper(
Group.objects.filter(
status=GroupStatus.IGNORED,
substatus=GroupSubStatus.UNTIL_ESCALATING,
project_id__in=project_ids,
last_seen__gte=datetime.now(UTC) - timedelta(days=7),
),
step=ITERATOR_CHUNK,
),
ITERATOR_CHUNK,
):
generate_and_save_forecasts(groups=until_escalating_groups)

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here.
get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects.
I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

@wedamija
Copy link
Member

wedamija commented Dec 5, 2025

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?
Similar to

for until_escalating_groups in chunked(
RangeQuerySetWrapper(
Group.objects.filter(
status=GroupStatus.IGNORED,
substatus=GroupSubStatus.UNTIL_ESCALATING,
project_id__in=project_ids,
last_seen__gte=datetime.now(UTC) - timedelta(days=7),
),
step=ITERATOR_CHUNK,
),
ITERATOR_CHUNK,
):
generate_and_save_forecasts(groups=until_escalating_groups)

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.

I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way

@kcons
Copy link
Member Author

kcons commented Dec 5, 2025

I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?
Similar to

for until_escalating_groups in chunked(
RangeQuerySetWrapper(
Group.objects.filter(
status=GroupStatus.IGNORED,
substatus=GroupSubStatus.UNTIL_ESCALATING,
project_id__in=project_ids,
last_seen__gte=datetime.now(UTC) - timedelta(days=7),
),
step=ITERATOR_CHUNK,
),
ITERATOR_CHUNK,
):
generate_and_save_forecasts(groups=until_escalating_groups)

Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. get_project_id_ranges_for_backfill is intended to be run from a Job to pick project ranges to trigger backfill_error_detector_groups with, and that task processes the detectors for this chunk of projects. I was initially doing a task per project, but it's not too much harder to chunk, and chunking should let us schedule and process an order of magnitude fewer tasks.

Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible.

I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way

Ah, I gotcha. Yeah, it's not really necessary. It just seemed like an efficient and easy way to chunk the id space. I can just drop the fuction and plan on having the job chunk in Python; I don't expect the perf difference to be meaningful.

@getsantry
Copy link
Contributor

getsantry bot commented Dec 27, 2025

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Dec 27, 2025

existing_detector_groups_subquery = DetectorGroup.objects.filter(
detector_id=detector_id, group_id=OuterRef("id")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subquery filter mismatches unique constraint causing potential failures

The existing_detector_groups_subquery filters by both detector_id and group_id, but the DetectorGroup model has a unique constraint only on group. If a group has a DetectorGroup associated with a different detector (due to prior data inconsistency), this subquery won't exclude it. The subsequent get_or_create would then fail with an IntegrityError because the unique constraint prevents creating a second DetectorGroup for the same group. Removing the detector_id filter from the subquery would make the exclusion match the actual constraint.

Fix in Cursor Fix in Web

@getsantry getsantry bot removed the Stale label Dec 31, 2025
namespace=bulk_backfill_tasks,
processing_deadline_duration=300,
silo_mode=SiloMode.REGION,
retry=Retry(times=3, delay=6),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will only retry TimeoutError unless you explicitly add Exception here


created_count = 0

for group in RangeQuerySetWrapper(groups_needing_detector_groups):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that this query might time out... It might be a good idea to get the sql (including the pagination and sorting that RangeQuerySetWrapper uses) and test it out on a project with a lot of groups to see how well it runs.

If it's really slow, then it might end up being better to skip the sub query and just iterate over all_unresolved_groups. I think you might also need to sort the RangeQuerySetWrapper by last_seen so that it uses the appropriate index.

models.Index(fields=("project", "status", "type", "last_seen", "id")),

Comment on lines +53 to +56
if created:
detector_group.date_added = group.first_seen
detector_group.save(update_fields=["date_added"])
created_count += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These auto_now_add fields are kind of annoying for backfills...

@getsantry
Copy link
Contributor

getsantry bot commented Jan 29, 2026

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

@getsantry getsantry bot added the Stale label Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components Stale

5 participants