Skip to content

Stop long-running scheduled procedures from starving scheduled reducers#5224

Open
Leonardo-Rocha wants to merge 4 commits into
clockworklabs:masterfrom
Qilvo-Tech:fix/scheduler-procedure-starvation
Open

Stop long-running scheduled procedures from starving scheduled reducers#5224
Leonardo-Rocha wants to merge 4 commits into
clockworklabs:masterfrom
Qilvo-Tech:fix/scheduler-procedure-starvation

Conversation

@Leonardo-Rocha

@Leonardo-Rocha Leonardo-Rocha commented Jun 4, 2026

Copy link
Copy Markdown

[Assisted by Claude Opus 4.8 (1M)] Disclaimer: I'm not well versed in the codebase and just wanted to give it a spin to figure if the problem was easy to solve or not. Feel free to disregard the PR if the claims don't make sense.

Description of Changes

Stop long-running scheduled procedures from starving scheduled reducers

The scheduler actor (SchedulerActor::handle_queued) awaited every scheduled function to completion before pulling the next due item from its DelayQueue. A scheduled #[procedure] that runs for a long time (e.g. one that calls ctx.sleep_until in a loop) therefore parked the actor and prevented every other due scheduled function -- reducers and procedures alike -- from being dispatched for as long as the procedure was alive. No error was logged; the reducer's schedule row simply never fired.

Procedures already execute on their own pooled instances (call_pooled), separate from the main reducer executor, so awaiting them inline in the scheduler bought nothing but head-of-line blocking. Dispatch scheduled procedures on their own tokio::spawned task and let the actor loop keep draining the queue. Interval-scheduled procedures route their reschedule back to the actor through a new SchedulerMessage::Reschedule, since the spawned task cannot touch the actor-owned queue/key_map.

Reducers keep their inline-await path (they cannot yield and run on the main executor, so this preserves their dispatch ordering).

Schedule and the new Reschedule now share enqueue_scheduled, which removes any existing queued entry for the id before inserting -- without this, a row update or reschedule racing an already-queued entry would leak an orphaned DelayQueue entry and fire a duplicate dispatch.

Consequence: scheduled procedures now run concurrently with scheduled reducers and with each other rather than strictly one-at-a-time. Transactional correctness is still enforced by the datastore's serializable isolation; only dispatch ordering relaxes. Concurrent execution remains bounded by the procedure instance pool.

API and ABI breaking changes

None.

Expected complexity level and risk

3 — localized to the scheduler actor, but it changes scheduled-execution concurrency semantics.

Testing

Automated (SDK procedure-concurrency suite)

#4955 has since merged, so this branch is updated from master and now flips the SDK test that encoded the starvation as expected behavior. In sdks/rust/tests (mod rust_procedure_concurrency):

  • Renamed scheduled_procedure_scheduled_reducer_not_interleavedscheduled_procedure_scheduled_reducer_interleaves (test fn, make_test run selector, and the client handler).
  • Flipped the assertion from before < after < scheduled_reducer (procedure runs to completion first) to before < scheduled_reducer < after (the reducer interleaves between the procedure's two inserts), and updated the docstrings to match.

The scenario: a scheduled procedure inserts scheduled_procedure_before, sleeps, then inserts scheduled_procedure_after; a scheduled reducer comes due during the sleep and inserts scheduled_reducer. The insertion order pins down whether the reducer was starved.

Verified both directions (TDD), built against a spacetimedb-standalone from this branch:

  • With the fix: cargo test -p spacetimedb-sdk --test test rust_procedure_concurrency::scheduled_procedure_scheduled_reducer_interleavesok. The full mod passes: test result: ok. 4 passed; 0 failed.
  • With the fix disabled (procedure dispatched via inline await like a reducer instead of tokio::spawn): the same test fails with got 1 < 3 < 2 — i.e. scheduled_procedure_before(1) < scheduled_procedure_after(2) < scheduled_reducer(3), the reducer starved to last. Confirms the renamed test actually gates the fix rather than passing vacuously.

Manual reproduction

Reproduction: https://github.com/Qilvo-Tech/spacetimedb-scheduler-starvation-repro

That repo is a minimal module with two scheduled tables: a #[procedure] that loops on ctx.sleep_until at a 500 ms cadence, and a #[reducer] on a 200 ms interval. The procedure is deliberately the slower ticker, so the reducer's deadline is always sooner — ruling out earliest-deadline or CPU-saturation explanations.

  • Before this fix (stock 2.4.x): over a 15 s run, procedure_loop iter= logs ~30 times as expected, while reducer_tick fired logs 0 times — the reducer is completely starved for as long as the procedure is alive, with no error emitted.
  • After this fix (built spacetimedb-standalone from this branch, published the repro module to it): procedure_loop keeps its 500 ms cadence and reducer_tick fired now logs at the full 5 Hz, interleaved with the procedure — e.g. 46 procedure ticks and 110 reducer fires over the same window. Log tail shows them interleaving cleanly.

See #4954.

@CLAassistant

CLAassistant commented Jun 4, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@Leonardo-Rocha Leonardo-Rocha force-pushed the fix/scheduler-procedure-starvation branch from ce369bb to 27e7bcf Compare June 4, 2026 14:51
…uled reducers

The scheduler actor (`SchedulerActor::handle_queued`) awaited every scheduled
function to completion before pulling the next due item from its `DelayQueue`.
A scheduled `#[procedure]` that runs for a long time (e.g. one that calls
`ctx.sleep_until` in a loop) therefore parked the actor and prevented every
other due scheduled function -- reducers and procedures alike -- from being
dispatched for as long as the procedure was alive. No error was logged; the
reducer's schedule row simply never fired.

Procedures already execute on their own pooled instances (`call_pooled`),
separate from the main reducer executor, so awaiting them inline in the
scheduler bought nothing but head-of-line blocking. Dispatch scheduled
procedures on their own `tokio::spawn`ed task and let the actor loop keep
draining the queue. Interval-scheduled procedures route their reschedule back
to the actor through a new `SchedulerMessage::Reschedule`, since the spawned
task cannot touch the actor-owned `queue`/`key_map`.

Reducers keep their inline-await path (they cannot yield and run on the main
executor, so this preserves their dispatch ordering).

`Schedule` and the new `Reschedule` now share `enqueue_scheduled`, which removes
any existing queued entry for the id before inserting -- without this, a row
update or reschedule racing an already-queued entry would leak an orphaned
`DelayQueue` entry and fire a duplicate dispatch.

Consequence: scheduled procedures now run concurrently with scheduled reducers
and with each other rather than strictly one-at-a-time. Transactional
correctness is still enforced by the datastore's serializable isolation;
only dispatch ordering relaxes. Concurrent execution remains bounded by the
procedure instance pool.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Leonardo-Rocha Leonardo-Rocha force-pushed the fix/scheduler-procedure-starvation branch from 27e7bcf to 5521e64 Compare June 4, 2026 14:52
@joshua-spacetime joshua-spacetime self-requested a review June 4, 2026 14:54
effective_at,
real_at,
} => {
// Incase of row update, remove the existing entry from queue first

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function was just encapsulated in enqueue_scheduled to be reused

@Leonardo-Rocha Leonardo-Rocha changed the title fix(core): allow long running procedures Jun 4, 2026
// than one-at-a-time; the datastore's serializable isolation still applies.)
ScheduledFunctionKind::Procedure => {
let tx = self.tx.clone();
tokio::spawn(async move {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tokio::spawn runs on the host/control runtime, not the per-database executor — but it's only the await-coordinator. The actual wasm work is dispatched to the database's SingleThreadedExecutor inside call_scheduled_procedure → call_pooled → run_async_job. So procedure execution stays on the per-DB pool (bounded by the procedure-instance semaphore); this task just waits for the result and forwards the interval reschedule.

@Leonardo-Rocha Leonardo-Rocha marked this pull request as ready for review June 4, 2026 15:13
Leonardo-Rocha and others added 3 commits June 8, 2026 18:31
…dure-starvation

# Conflicts:
#	crates/core/src/host/scheduler.rs
The scheduler now dispatches scheduled procedures concurrently, so a long-running scheduled procedure no longer starves a scheduled reducer whose deadline falls during the procedure's sleep. Flip the assertion to expect interleaving (before < scheduled_reducer < after) and rename the test scheduled_procedure_scheduled_reducer_not_interleaved -> scheduled_procedure_scheduled_reducer_interleaves (plus its run selector and client handler) to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants