Skip to content

[Feature] Implement Dynamic gRPC Heap Size Control and Load Shedding #13485

@hanahmily

Description

@hanahmily

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

Motivation

From a recent performance test led by @mrproliu , the BanyanDB liaison gRPC server is vulnerable to OOM errors when subjected to high-throughput write traffic. If a client sends data faster than the server can process and persist it, the gRPC library's internal buffers will grow indefinitely, consuming all available heap memory. This is frequently observed in profiling measure.Recv() under heavy load.

To ensure server stability and prevent crashes, we need to introduce mechanisms that:

  1. Actively shed load when the system is under high memory pressure.
  2. Intelligently configure gRPC's network buffers to provide backpressure before the heap is exhausted.

Proposed Solution

This proposal outlines a two-pronged approach to control heap usage by integrating the existing protector service with the gRPC server's lifecycle and configuration.

1. Load Shedding via Protector State

We will implement a gRPC Stream Server Interceptor that queries the protector's state before allowing a new stream to be handled.

  • Dependency Injection: The liaison/grpc server will need a reference to the protector service, which should be passed in during initialization.
  • Interceptor Logic:
    • For each new incoming stream, the interceptor will check the current system state by calling protector.State().
    • If protector.State() returns StateHigh, it indicates that system memory usage has crossed the configured high-water mark.
    • In this StateHigh condition, the interceptor will immediately reject the new stream with a codes.ResourceExhausted gRPC status. This provides clear, immediate backpressure to the client, signaling that the server is temporarily unable to accept new workloads.
    • If the state is StateLow, the stream will be processed as normal.
// pseudocode for the interceptor
func (s *server) protectorLoadSheddingInterceptor(...) error {
    if s.protector.State() == protector.StateHigh {
        s.log.Warn().Msg("rejecting new stream due to high memory pressure")
        return status.Errorf(codes.ResourceExhausted, "server is busy, please retry later")
    }
    return handler(srv, ss)
}

2. Dynamic gRPC Buffer Sizing Based on Available Memory

Instead of using fixed, static buffer sizes, we will dynamically calculate the gRPC HTTP/2 flow control windows at server startup based on the available system memory reported by the protector.

  • Startup Logic: During the Serve() phase of the gRPC server, it will query the system's available memory. This can be done by calling the protector.
  • Configuration: Introduce a new configuration flag, e.g., grpc.buffer.memory-ratio (defaulting to 0.10 for 10%). This will determine what fraction of the available system memory should be allocated to gRPC's connection-level buffers.
  • Heuristic for Window Calculation:
    • totalBufferSize = availableMemory * memoryRatio
    • InitialConnWindowSize = totalBufferSize * 2 / 3
    • InitialWindowSize = totalBufferSize * 1 / 3
    • This 2:1 ratio ensures the connection-level buffer is larger than any single stream's buffer, which is a common and effective practice.
  • Applying the Options: The calculated values will be passed to grpc.NewServer() using the grpc.InitialWindowSize() and grpc.InitialConnWindowSize() server options.
  • Override Mechanism: The existing static configuration flags for window sizes (grpc.InitialWindowSize, etc.) should take precedence. If a user sets a specific value, the dynamic calculation will be skipped. This allows for expert manual tuning.

Use case

No response

Related issues

No response

Are you willing to submit a pull request to implement this on your own?

  • Yes I am willing to submit a pull request on my own!

Code of Conduct

Metadata

Metadata

Assignees

Labels

databaseBanyanDB - SkyWalking native databasefeatureNew feature

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions