-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
feat(aci): Add error DetectorGroup chunked backfill task and method #104377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #104377 +/- ##
===========================================
- Coverage 80.57% 80.40% -0.17%
===========================================
Files 9345 9345
Lines 399518 401299 +1781
Branches 25600 25600
===========================================
+ Hits 321894 322664 +770
- Misses 77171 78182 +1011
Partials 453 453 |
| namespace=bulk_backfill_tasks, | ||
| processing_deadline_duration=300, | ||
| silo_mode=SiloMode.REGION, | ||
| retry=Retry(times=3, delay=6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any exception types that you want to retry on other than TimeoutError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None that I know about, but I probably want to retry Exception and deadline timeouts. I think that's what the @retry decorator is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will only retry TimeoutError unless you explicitly add Exception here
wedamija
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know too much about this task, but is there any reason we can't use RangeQuerySetWrapper to iterate all projects and fire a task per project, or chunk of projects?
Similar to
sentry/src/sentry/tasks/weekly_escalating_forecast.py
Lines 62 to 74 in f8a9b05
| for until_escalating_groups in chunked( | |
| RangeQuerySetWrapper( | |
| Group.objects.filter( | |
| status=GroupStatus.IGNORED, | |
| substatus=GroupSubStatus.UNTIL_ESCALATING, | |
| project_id__in=project_ids, | |
| last_seen__gte=datetime.now(UTC) - timedelta(days=7), | |
| ), | |
| step=ITERATOR_CHUNK, | |
| ), | |
| ITERATOR_CHUNK, | |
| ): | |
| generate_and_save_forecasts(groups=until_escalating_groups) |
Unless I'm misunderstanding the question, that is roughly what we're trying to set up here. |
Right, I was mostly wondering if we needed the custom sql that we have there, or can we follow the existing patterns we use elsewhere in the codebase? Just generally when I see raw sql I want to avoid it if possible. I don't mind too much whether we chunk or do individual tasks. We should be able to control the concurrency of the queue so it shouldn't be too much of a problem either way |
Ah, I gotcha. Yeah, it's not really necessary. It just seemed like an efficient and easy way to chunk the id space. I can just drop the fuction and plan on having the job chunk in Python; I don't expect the perf difference to be meaningful. |
|
This pull request has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you add the label "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
|
|
||
| existing_detector_groups_subquery = DetectorGroup.objects.filter( | ||
| detector_id=detector_id, group_id=OuterRef("id") | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subquery filter mismatches unique constraint causing potential failures
The existing_detector_groups_subquery filters by both detector_id and group_id, but the DetectorGroup model has a unique constraint only on group. If a group has a DetectorGroup associated with a different detector (due to prior data inconsistency), this subquery won't exclude it. The subsequent get_or_create would then fail with an IntegrityError because the unique constraint prevents creating a second DetectorGroup for the same group. Removing the detector_id filter from the subquery would make the exclusion match the actual constraint.
| namespace=bulk_backfill_tasks, | ||
| processing_deadline_duration=300, | ||
| silo_mode=SiloMode.REGION, | ||
| retry=Retry(times=3, delay=6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will only retry TimeoutError unless you explicitly add Exception here
|
|
||
| created_count = 0 | ||
|
|
||
| for group in RangeQuerySetWrapper(groups_needing_detector_groups): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling that this query might time out... It might be a good idea to get the sql (including the pagination and sorting that RangeQuerySetWrapper uses) and test it out on a project with a lot of groups to see how well it runs.
If it's really slow, then it might end up being better to skip the sub query and just iterate over all_unresolved_groups. I think you might also need to sort the RangeQuerySetWrapper by last_seen so that it uses the appropriate index.
sentry/src/sentry/models/group.py
Line 687 in c291f44
| models.Index(fields=("project", "status", "type", "last_seen", "id")), |
| if created: | ||
| detector_group.date_added = group.first_seen | ||
| detector_group.save(update_fields=["date_added"]) | ||
| created_count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These auto_now_add fields are kind of annoying for backfills...
|
This pull request has gone three weeks without activity. In another week, I will close it. But! If you comment or otherwise update it, I will reset the clock, and if you add the label "A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀 |
The existing DetectorGroup backfill job is impractically slow.
This adds a function (intended to be triggered by a job) to produce roughly equal ranges of IDs in the Projects table, which then will be used to trigger a new task that backfills the projects in that range.
This distributes all of the slow bits into chunks we can control the size of, and the processing pool used to execute them can be gradually dialed up as we gain confidence in correctness and capacity cost.
The expectation is that this should allow backfill to finish completely in a day or so without blocking any jobs or hand-holding.