Stop long-running scheduled procedures from starving scheduled reducers by Leonardo-Rocha · Pull Request #5224 · clockworklabs/SpacetimeDB

Leonardo-Rocha · 2026-06-04T14:43:38Z

[Assisted by Claude Opus 4.8 (1M)] Disclaimer: I'm not well versed in the codebase and just wanted to give it a spin to figure if the problem was easy to solve or not. Feel free to disregard the PR if the claims don't make sense.

Description of Changes

Stop long-running scheduled procedures from starving scheduled reducers

The scheduler actor (SchedulerActor::handle_queued) awaited every scheduled function to completion before pulling the next due item from its DelayQueue. A scheduled #[procedure] that runs for a long time (e.g. one that calls ctx.sleep_until in a loop) therefore parked the actor and prevented every other due scheduled function -- reducers and procedures alike -- from being dispatched for as long as the procedure was alive. No error was logged; the reducer's schedule row simply never fired.

Procedures already execute on their own pooled instances (call_pooled), separate from the main reducer executor, so awaiting them inline in the scheduler bought nothing but head-of-line blocking. Dispatch scheduled procedures on their own tokio::spawned task and let the actor loop keep draining the queue. Interval-scheduled procedures route their reschedule back to the actor through a new SchedulerMessage::Reschedule, since the spawned task cannot touch the actor-owned queue/key_map.

Reducers keep their inline-await path (they cannot yield and run on the main executor, so this preserves their dispatch ordering).

Schedule and the new Reschedule now share enqueue_scheduled, which removes any existing queued entry for the id before inserting -- without this, a row update or reschedule racing an already-queued entry would leak an orphaned DelayQueue entry and fire a duplicate dispatch.

Consequence: scheduled procedures now run concurrently with scheduled reducers and with each other rather than strictly one-at-a-time. Transactional correctness is still enforced by the datastore's serializable isolation; only dispatch ordering relaxes. Concurrent execution remains bounded by the procedure instance pool.

API and ABI breaking changes

None.

Expected complexity level and risk

3 — localized to the scheduler actor, but it changes scheduled-execution concurrency semantics.

Testing

Automated (SDK procedure-concurrency suite)

#4955 has since merged, so this branch is updated from master and now flips the SDK test that encoded the starvation as expected behavior. In sdks/rust/tests (mod rust_procedure_concurrency):

Renamed scheduled_procedure_scheduled_reducer_not_interleaved → scheduled_procedure_scheduled_reducer_interleaves (test fn, make_test run selector, and the client handler).
Flipped the assertion from before < after < scheduled_reducer (procedure runs to completion first) to before < scheduled_reducer < after (the reducer interleaves between the procedure's two inserts), and updated the docstrings to match.

The scenario: a scheduled procedure inserts scheduled_procedure_before, sleeps, then inserts scheduled_procedure_after; a scheduled reducer comes due during the sleep and inserts scheduled_reducer. The insertion order pins down whether the reducer was starved.

Verified both directions (TDD), built against a spacetimedb-standalone from this branch:

With the fix: cargo test -p spacetimedb-sdk --test test rust_procedure_concurrency::scheduled_procedure_scheduled_reducer_interleaves → ok. The full mod passes: test result: ok. 4 passed; 0 failed.
With the fix disabled (procedure dispatched via inline await like a reducer instead of tokio::spawn): the same test fails with got 1 < 3 < 2 — i.e. scheduled_procedure_before(1) < scheduled_procedure_after(2) < scheduled_reducer(3), the reducer starved to last. Confirms the renamed test actually gates the fix rather than passing vacuously.

Manual reproduction

Reproduction: https://github.com/Qilvo-Tech/spacetimedb-scheduler-starvation-repro

That repo is a minimal module with two scheduled tables: a #[procedure] that loops on ctx.sleep_until at a 500 ms cadence, and a #[reducer] on a 200 ms interval. The procedure is deliberately the slower ticker, so the reducer's deadline is always sooner — ruling out earliest-deadline or CPU-saturation explanations.

Before this fix (stock 2.4.x): over a 15 s run, procedure_loop iter= logs ~30 times as expected, while reducer_tick fired logs 0 times — the reducer is completely starved for as long as the procedure is alive, with no error emitted.
After this fix (built spacetimedb-standalone from this branch, published the repro module to it): procedure_loop keeps its 500 ms cadence and reducer_tick fired now logs at the full 5 Hz, interleaved with the procedure — e.g. 46 procedure ticks and 110 reducer fires over the same window. Log tail shows them interleaving cleanly.

See #4954.

CLAassistant · 2026-06-04T14:44:12Z

All committers have signed the CLA.

…uled reducers The scheduler actor (`SchedulerActor::handle_queued`) awaited every scheduled function to completion before pulling the next due item from its `DelayQueue`. A scheduled `#[procedure]` that runs for a long time (e.g. one that calls `ctx.sleep_until` in a loop) therefore parked the actor and prevented every other due scheduled function -- reducers and procedures alike -- from being dispatched for as long as the procedure was alive. No error was logged; the reducer's schedule row simply never fired. Procedures already execute on their own pooled instances (`call_pooled`), separate from the main reducer executor, so awaiting them inline in the scheduler bought nothing but head-of-line blocking. Dispatch scheduled procedures on their own `tokio::spawn`ed task and let the actor loop keep draining the queue. Interval-scheduled procedures route their reschedule back to the actor through a new `SchedulerMessage::Reschedule`, since the spawned task cannot touch the actor-owned `queue`/`key_map`. Reducers keep their inline-await path (they cannot yield and run on the main executor, so this preserves their dispatch ordering). `Schedule` and the new `Reschedule` now share `enqueue_scheduled`, which removes any existing queued entry for the id before inserting -- without this, a row update or reschedule racing an already-queued entry would leak an orphaned `DelayQueue` entry and fire a duplicate dispatch. Consequence: scheduled procedures now run concurrently with scheduled reducers and with each other rather than strictly one-at-a-time. Transactional correctness is still enforced by the datastore's serializable isolation; only dispatch ordering relaxes. Concurrent execution remains bounded by the procedure instance pool. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Leonardo-Rocha · 2026-06-04T14:58:34Z

                effective_at,
                real_at,
            } => {
-                // Incase of row update, remove the existing entry from queue first


this function was just encapsulated in enqueue_scheduled to be reused

Leonardo-Rocha · 2026-06-04T15:06:19Z

+            // than one-at-a-time; the datastore's serializable isolation still applies.)
+            ScheduledFunctionKind::Procedure => {
+                let tx = self.tx.clone();
+                tokio::spawn(async move {


This tokio::spawn runs on the host/control runtime, not the per-database executor — but it's only the await-coordinator. The actual wasm work is dispatched to the database's SingleThreadedExecutor inside call_scheduled_procedure → call_pooled → run_async_job. So procedure execution stays on the per-DB pool (bounded by the procedure-instance semaphore); this task just waits for the result and forwards the interval reschedule.

…dure-starvation # Conflicts: # crates/core/src/host/scheduler.rs

The scheduler now dispatches scheduled procedures concurrently, so a long-running scheduled procedure no longer starves a scheduled reducer whose deadline falls during the procedure's sleep. Flip the assertion to expect interleaving (before < scheduled_reducer < after) and rename the test scheduled_procedure_scheduled_reducer_not_interleaved -> scheduled_procedure_scheduled_reducer_interleaves (plus its run selector and client handler) to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Leonardo-Rocha force-pushed the fix/scheduler-procedure-starvation branch from ce369bb to 27e7bcf Compare June 4, 2026 14:51

Leonardo-Rocha force-pushed the fix/scheduler-procedure-starvation branch from 27e7bcf to 5521e64 Compare June 4, 2026 14:52

joshua-spacetime self-requested a review June 4, 2026 14:54

Leonardo-Rocha commented Jun 4, 2026

View reviewed changes

Leonardo-Rocha changed the title ~~fix(core): allow long running procedures~~ Jun 4, 2026

Leonardo-Rocha commented Jun 4, 2026

View reviewed changes

Leonardo-Rocha marked this pull request as ready for review June 4, 2026 15:13

Leonardo-Rocha and others added 3 commits June 8, 2026 18:31

Merge branch 'master' into fix/scheduler-procedure-starvation

414423d

Merge remote-tracking branch 'origin/master' into fix/scheduler-proce…

08f4b20

…dure-starvation # Conflicts: # crates/core/src/host/scheduler.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stop long-running scheduled procedures from starving scheduled reducers#5224

Stop long-running scheduled procedures from starving scheduled reducers#5224
Leonardo-Rocha wants to merge 4 commits into
clockworklabs:masterfrom
Qilvo-Tech:fix/scheduler-procedure-starvation

Leonardo-Rocha commented Jun 4, 2026 •

edited

Loading

CLAassistant commented Jun 4, 2026 •

edited

Loading

Leonardo-Rocha Jun 4, 2026

Leonardo-Rocha Jun 4, 2026

Labels

3 participants

Uh oh!

Conversation

Leonardo-Rocha commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

Automated (SDK procedure-concurrency suite)

Manual reproduction

CLAassistant commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Leonardo-Rocha Jun 4, 2026

Choose a reason for hiding this comment

Leonardo-Rocha Jun 4, 2026

Choose a reason for hiding this comment

Labels

3 participants

Leonardo-Rocha commented Jun 4, 2026 •

edited

Loading

CLAassistant commented Jun 4, 2026 •

edited

Loading