-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
fix(workflow_engine): Add a cache for Workflows to reduce DB load #106925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…che if a relationship to the detector or environment changes.
|
|
||
| if workflows is None: | ||
| environment_filter = ( | ||
| (Q(environment_id=None) | Q(environment_id=environment.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 -- this line from the original query will likely cause our cache size to be a lot larger than needed. we could normalize this by having the "env filter = None" results as 1 cached query, and by environment as another.
Would like to see the impact of this / the size of the cache before diving into changing functionality of this query though.
… processing on changes
| .distinct() | ||
| ) | ||
|
|
||
| cache.set(cache_key, workflows, timeout=CACHE_TTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what our observability of caching is here.
I know in traces one type of cache (django? is this django cache or only sometimes?) doesn't show up, and that's been a bit of a pain for debugging.
Also, it'd be nice if we could have counters for hit/miss so we can brag about how many queries we're avoiding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i kinda purposefully was avoiding obs / counters thus far. 😅
did you have any specific obs in mind? i'm thinking a metric for cache hit / miss / invalidation.
🤔 maybe debug logs for cache miss and when we invalidate? (thinking a stack trace might be handy with signals. could at least see which models are causing invalidations etc)
| This method uses a read-through cache, and returns which workflows to evaluate. | ||
| """ | ||
| env_id = environment.id if environment is not None else None | ||
| cache_key = processing_workflow_cache_key(detector.id, env_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you like barely justfied abstractions, we have CacheAccess[T] thing.
The idea is that you define a subclass like
class _ProcessingWorkflowCacheAccess(CacheAcess[set[Workflow]]):
def __init__(self, ..., ttl=DEFAULT_TTL) -> None:
# verify params, save key
def key(self) -> str:
return self._key
...
cache_access = _ProcessingWorkflowCacheAccess(detector, environment)
workflows = cache_access.get()
..
cache_access.set(workflows)
Not game changing, but this was after we were using the wrong key in one place and had some wrong type assumptions about cached values, so it seemed appropriate to try for an abstraction that ensures consistent key use and type safety.
(it doesn't have delete, but it should).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 -- i like it. i was thinking of something similar tbh 🤣 i always fear text based keys.
| from sentry.workflow_engine.models import Detector | ||
|
|
||
|
|
||
| @receiver(post_save, sender=Detector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why did we end up going with post_save signals on detector? Is it cause the lack of SOPA?
| from sentry.workflow_engine.models import DetectorWorkflow | ||
|
|
||
|
|
||
| @receiver(post_migrate, sender=DetectorWorkflow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why are these signals on post_migrate and pre_save for invalidation?
| if detector_id is None: | ||
| detector_id = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: This part is prob still in progress but I'm confused on why we're putting a wildcard here since it doesn't look like we ever set one in the initial cache population? Also we should prob add a comment or seperate out the function to be so that if both detector_id + env_id being None = clear everything.
I guess is it possible only one of the params would be None, and if so when would that happen?
Description
We select workflows from the DB very frequently. This has added substantial load to our DB, even though the query is very fast / efficient.
This PR introduces a caching layer for this high frequency db query.