Skip to content

Conversation

@saponifi3d
Copy link
Contributor

Description

We select workflows from the DB very frequently. This has added substantial load to our DB, even though the query is very fast / efficient.

This PR introduces a caching layer for this high frequency db query.

…che if a relationship to the detector or environment changes.
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 23, 2026

if workflows is None:
environment_filter = (
(Q(environment_id=None) | Q(environment_id=environment.id))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 -- this line from the original query will likely cause our cache size to be a lot larger than needed. we could normalize this by having the "env filter = None" results as 1 cached query, and by environment as another.

Would like to see the impact of this / the size of the cache before diving into changing functionality of this query though.

.distinct()
)

cache.set(cache_key, workflows, timeout=CACHE_TTL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what our observability of caching is here.
I know in traces one type of cache (django? is this django cache or only sometimes?) doesn't show up, and that's been a bit of a pain for debugging.

Also, it'd be nice if we could have counters for hit/miss so we can brag about how many queries we're avoiding.

Copy link
Contributor Author

@saponifi3d saponifi3d Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i kinda purposefully was avoiding obs / counters thus far. 😅

did you have any specific obs in mind? i'm thinking a metric for cache hit / miss / invalidation.

🤔 maybe debug logs for cache miss and when we invalidate? (thinking a stack trace might be handy with signals. could at least see which models are causing invalidations etc)

This method uses a read-through cache, and returns which workflows to evaluate.
"""
env_id = environment.id if environment is not None else None
cache_key = processing_workflow_cache_key(detector.id, env_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you like barely justfied abstractions, we have CacheAccess[T] thing.
The idea is that you define a subclass like

class _ProcessingWorkflowCacheAccess(CacheAcess[set[Workflow]]):
    def __init__(self, ..., ttl=DEFAULT_TTL) -> None:
           # verify params, save key
    def key(self) -> str:
           return self._key

...
cache_access = _ProcessingWorkflowCacheAccess(detector, environment)

workflows = cache_access.get()
..

cache_access.set(workflows)

Not game changing, but this was after we were using the wrong key in one place and had some wrong type assumptions about cached values, so it seemed appropriate to try for an abstraction that ensures consistent key use and type safety.

(it doesn't have delete, but it should).

Copy link
Contributor Author

@saponifi3d saponifi3d Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 -- i like it. i was thinking of something similar tbh 🤣 i always fear text based keys.

from sentry.workflow_engine.models import Detector


@receiver(post_save, sender=Detector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why did we end up going with post_save signals on detector? Is it cause the lack of SOPA?

from sentry.workflow_engine.models import DetectorWorkflow


@receiver(post_migrate, sender=DetectorWorkflow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why are these signals on post_migrate and pre_save for invalidation?

Comment on lines +26 to +27
if detector_id is None:
detector_id = "*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: This part is prob still in progress but I'm confused on why we're putting a wildcard here since it doesn't look like we ever set one in the initial cache population? Also we should prob add a comment or seperate out the function to be so that if both detector_id + env_id being None = clear everything.

I guess is it possible only one of the params would be None, and if so when would that happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

4 participants