Skip to content

Conversation

@edsonmichaque
Copy link
Contributor

@edsonmichaque edsonmichaque commented Jan 28, 2026

Description

Data plane gateways fail to load SSL certificates from MDCB during startup in v5.8.0+. This PR fixes the regression by adding exponential backoff retry logic to certificate loading.

Related Issue

https://tyktech.atlassian.net/browse/TT-14618

Motivation and Context

Critical regression from TT-14163 (PR #6910). Gateway starts in emergency mode immediately, blocking certificate fetch before RPC ready. Affects v5.8.0+.

How This Has Been Tested

Unit test in certs/manager_tt14618_test.go demonstrates 6 retry attempts over 0.8s, successfully loading certificate after RPC becomes ready.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I ensured that the documentation is up to date
  • I explained why this PR updates go.mod in detail with reasoning why it's required
  • I would like a code coverage CI quality gate exception and have explained why

Ticket Details

TT-14618
Status Ready for Dev
Summary [regression]Data plane gateways cannot start if they're configured with a TYK_GW_HTTPSERVEROPTIONS_SSLCERTIFICATES that is in the certificate store

Generated at: 2026-01-31 10:33:13

@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from e87a69c to e58f49d Compare January 28, 2026 13:10
@probelabs
Copy link

probelabs bot commented Jan 28, 2026

This PR addresses a critical regression where data plane gateways fail to start if configured to load SSL certificates from a Multi-Data Center Bridge (MDCB). The issue stems from a race condition during startup: the gateway attempts to fetch certificates before the RPC connection to MDCB is fully established, resulting in an immediate failure.

The fix introduces a resilient, configurable exponential backoff retry mechanism into the certificate loading process. When the CertificateManager fails to fetch a certificate due to a storage.ErrMDCBConnectionLost error, it now automatically retries the operation with increasing delays. This gives the RPC connection sufficient time to initialize, making the gateway startup sequence more robust.

Files Changed Analysis

The changes are well-contained and focused on implementing and configuring the new retry logic:

  • certs/manager.go: Contains the core logic change. The List function now uses the cenkalti/backoff/v4 library to wrap the storage.GetKey call in a retry loop, which is triggered specifically by the storage.ErrMDCBConnectionLost error.
  • certs/cert_retry_test.go: A comprehensive new test file (+885 lines) has been added to rigorously validate the retry mechanism. It covers various scenarios, including flaky connections, multiple failures, and performance at scale, ensuring the logic is correct and efficient.
  • config/config.go & cli/linter/schema.json: New configuration options (e.g., RPCCertFetchMaxElapsedTime, RPCCertFetchMaxRetries) are added to the SlaveOptionsConfig struct and the linter schema, allowing operators to fine-tune the retry behavior.
  • gateway/server.go: The gateway's startup logic is updated to read the new configuration values, provide sensible defaults, and pass them to the CertificateManager during its initialization.
  • certs/manager_test.go & gateway/server_test.go: Existing tests are updated to accommodate the new function signatures and configuration defaults.

Architecture & Impact Assessment

  • What this PR accomplishes: It resolves a critical startup failure for gateways in an MDCB topology, significantly improving the reliability and robustness of the data plane's initialization sequence.
  • Key technical changes introduced:
    1. Exponential Backoff: Implements a blocking retry mechanism to handle the specific ErrMDCBConnectionLost error during certificate loading.
    2. Configurability: Exposes the retry strategy parameters through the main configuration, providing safe defaults while allowing operators to tune timings and retry counts.
  • Affected system components: The primary impact is on the CertificateManager and the gateway's startup configuration process. This change hardens the interaction between the gateway and the MDCB storage layer for certificate retrieval.

Certificate Loading Flow with Retry Logic

sequenceDiagram
    participant Gateway Startup
    participant CertificateManager
    participant MDCB Storage

    Gateway Startup->>CertificateManager: List(certIDs)
    CertificateManager->>MDCB Storage: GetKey("raw-cert-1")
    MDCB Storage-->>CertificateManager: Fail (ErrMDCBConnectionLost)
    
    Note over CertificateManager: MDCB not ready, initiating exponential backoff.

    loop Retry Loop (until success or timeout)
        CertificateManager->>MDCB Storage: GetKey("raw-cert-1")
        alt Connection still fails
            MDCB Storage-->>CertificateManager: Fail (ErrMDCBConnectionLost)
            Note over CertificateManager: Wait for next interval...
        else Connection succeeds
            MDCB Storage-->>CertificateManager: Success (Certificate Data)
            break
        end
    end

    Note over CertificateManager: MDCB connection is now stable.
    loop For remaining certs
        CertificateManager->>MDCB Storage: GetKey("raw-cert-N")
        MDCB Storage-->>CertificateManager: Success (Certificate Data)
    end
    end
    CertificateManager-->>Gateway Startup: Return all loaded certificates


Loading

Scope Discovery & Context Expansion

The fix is precisely targeted at the CertificateManager, which is the correct component to handle this logic as it is responsible for retrieving certificates from storage. The error being handled, storage.ErrMDCBConnectionLost, is specific to the RPCStorageHandler when the gateway's RPC connection is unavailable, confirming this is a well-scoped solution.

This PR establishes a robust pattern for handling startup race conditions involving MDCB. The retry logic implemented here could serve as a valuable blueprint for hardening other parts of the system that exhibit similar dependencies on the RPC connection being available at startup.

Metadata
  • Review Effort: 3 / 5
  • Primary Label: bug

Powered by Visor from Probelabs

Last updated: 2026-01-31T10:34:39.996Z | Triggered by: pr_updated | Commit: d4a2b0d

💡 TIP: You can chat with Visor using /visor ask <your question>

@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from e58f49d to 5abca3d Compare January 28, 2026 13:11
@probelabs
Copy link

probelabs bot commented Jan 28, 2026

✅ Security Check Passed

No security issues found – changes LGTM.

Architecture Issues (2)

Severity Location Issue
🟡 Warning config/config.go:449
The comment for `RPCCertFetchMaxRetries` states that the default is 3, but the actual default value implemented in `certs/manager.go` and applied in `gateway/server.go` is 5. This can be misleading for users configuring the gateway.
💡 SuggestionUpdate the comment to reflect the correct default value of 5 to ensure documentation is consistent with the implementation.
🔧 Suggested Fix
    // RPCCertFetchMaxRetries sets the maximum number of retry attempts for certificate fetch. 0 means unlimited (time-based only) (default: 5)
🟡 Warning certs/manager.go:91-99
The `NewCertificateManager` function sets default values for the backoff timing parameters by hardcoding them (e.g., `30 * time.Second`). However, these defaults are already defined as constants (e.g., `DefaultRPCCertFetchMaxElapsedTime`) in the same file and are used elsewhere for setting defaults. This creates a duplicate source for default values and is inconsistent with how defaults are handled in the gateway.
💡 SuggestionReplace the hardcoded default values in `NewCertificateManager` with the corresponding constants defined in the `certs` package to maintain a single source of truth and improve consistency.
🔧 Suggested Fix
	if certFetchMaxElapsedTime == 0 {
		certFetchMaxElapsedTime = DefaultRPCCertFetchMaxElapsedTime
	}
	if certFetchInitialInterval == 0 {
		certFetchInitialInterval = DefaultRPCCertFetchInitialInterval
	}
	if certFetchMaxInterval == 0 {
		certFetchMaxInterval = DefaultRPCCertFetchMaxInterval
	}

Performance Issues (1)

Severity Location Issue
🟢 Info certs/manager.go:466-549
The function fetches multiple certificates sequentially within a loop. While this is safe and avoids a "thundering herd" problem during initial connection failures, it leads to serialized network requests when the connection is healthy. For a large number of certificates, this sequential processing can be significantly slower than a parallel approach.
💡 SuggestionConsider optimizing this by first performing a single, retry-enabled connection probe (e.g., using the `Exists` method on the storage handler) to ensure the storage is available. Once the connection is established, the certificates can be fetched concurrently using goroutines. This would significantly reduce the total time required for network I/O, especially when loading many certificates over a network with non-trivial latency.

Quality Issues (2)

Severity Location Issue
🟡 Warning certs/manager.go:480-486
The error handling within the `backoff.Retry` operation is implicit. It relies on a variable (`err`) from the parent scope to propagate non-retryable errors. The operation returns `nil` for non-retryable errors, which signals success to the `backoff` library, and the error is then handled outside the retry block. This pattern is functional but can be confusing and fragile during refactoring.
💡 SuggestionTo make the intent clearer and the error handling more robust, use `backoff.Permanent(err)` to explicitly mark an error as non-retryable. This causes `backoff.Retry` to stop immediately and return the permanent error, simplifying the control flow.
🟡 Warning config/config.go:455
The comment for `RPCCertFetchMaxRetries` states that the default value is 3, but the implementation in `gateway/server.go` and the associated tests use a default of 5, as defined by `certs.DefaultRPCCertFetchMaxRetries`.
💡 SuggestionUpdate the comment to reflect the correct default value of 5 to avoid confusion for operators configuring the gateway.
🔧 Suggested Fix
	// RPCCertFetchMaxRetries sets the maximum number of retry attempts for certificate fetch. 0 means unlimited (time-based only) (default: 5)

Powered by Visor from Probelabs

Last updated: 2026-01-31T10:34:42.778Z | Triggered by: pr_updated | Commit: d4a2b0d

💡 TIP: You can chat with Visor using /visor ask <your question>

@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch 4 times, most recently from 57b6ac3 to a34d904 Compare January 28, 2026 13:25
@edsonmichaque edsonmichaque marked this pull request as ready for review January 28, 2026 13:28
@edsonmichaque edsonmichaque reopened this Jan 28, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 28, 2026

API Changes

--- prev.txt	2026-01-31 10:33:31.118194109 +0000
+++ current.txt	2026-01-31 10:33:21.116991224 +0000
@@ -5614,6 +5614,40 @@
 package certs // import "github.com/TykTechnologies/tyk/certs"
 
 
+CONSTANTS
+
+const (
+
+	// MDCB retry configuration for certificate loading during startup.
+	//
+	// When the gateway starts with MDCB configured, it enters emergency mode until the
+	// RPC connection is established. During this period, certificate fetches from MDCB
+	// will fail with ErrMDCBConnectionLost. These constants control the retry behavior:
+	//
+	// DefaultRPCCertFetchMaxElapsedTime: Maximum time to wait for MDCB to become ready (30 seconds)
+	//   - If MDCB doesn't respond within this window, we give up and fall back to local files
+	//   - This prevents indefinite blocking during gateway startup
+	//
+	// DefaultRPCCertFetchInitialInterval: Starting delay for exponential backoff (100ms)
+	//   - First retry happens after 100ms
+	//   - Each subsequent retry doubles: 100ms → 200ms → 400ms → 800ms → 1600ms → 2000ms (capped)
+	//
+	// DefaultRPCCertFetchMaxInterval: Maximum delay between retry attempts (2 seconds)
+	//   - Caps exponential growth to prevent excessively long waits
+	//   - Ensures we check MDCB at least every 2 seconds
+	DefaultRPCCertFetchMaxElapsedTime  = 30 * time.Second
+	DefaultRPCCertFetchInitialInterval = 100 * time.Millisecond
+	DefaultRPCCertFetchMaxInterval     = 2 * time.Second
+
+	// DefaultRPCCertFetchRetryEnabled: Enable retry by default (true)
+	DefaultRPCCertFetchRetryEnabled = true
+
+	// DefaultRPCCertFetchMaxRetries: Maximum number of retry attempts (5)
+	//   - 0 means unlimited (time-based only)
+	//   - Default of 5 provides reasonable retry attempts with exponential backoff
+	DefaultRPCCertFetchMaxRetries = 5
+)
+
 VARIABLES
 
 var (
@@ -5630,7 +5664,7 @@
 func GetCertIDAndChainPEM(certData []byte, secret string) (string, []byte, error)
 func ParsePEM(data []byte, secret string) ([]*pem.Block, error)
 func ParsePEMCertificate(data []byte, secret string) (*tls.Certificate, error)
-func NewCertificateManager(storage storage.Handler, secret string, logger *logrus.Logger, migrateCertList bool) *certificateManager
+func NewCertificateManager(storageHandler storage.Handler, secret string, logger *logrus.Logger, migrateCertList bool, certFetchMaxElapsedTime, certFetchInitialInterval, certFetchMaxInterval time.Duration, certFetchRetryEnabled bool, certFetchMaxRetries int) *certificateManager
 func NewSlaveCertManager(localStorage, rpcStorage storage.Handler, secret string, logger *logrus.Logger, migrateCertList bool) *certificateManager
 
 TYPES
@@ -7369,6 +7403,21 @@
 	// RPCKeysCacheExpiration defines the expiration time of the rpc cache that stores the keys, defined in seconds
 	RPCGlobalCacheExpiration float32 `json:"rpc_global_cache_expiration"`
 
+	// RPCCertFetchMaxElapsedTime sets the maximum time in seconds to retry certificate fetch from MDCB during startup (default: 30)
+	RPCCertFetchMaxElapsedTime float32 `json:"rpc_cert_fetch_max_elapsed_time"`
+
+	// RPCCertFetchInitialInterval sets the initial retry interval in seconds for certificate fetch backoff (default: 0.1)
+	RPCCertFetchInitialInterval float32 `json:"rpc_cert_fetch_initial_interval"`
+
+	// RPCCertFetchMaxInterval sets the maximum retry interval in seconds for certificate fetch backoff (default: 2)
+	RPCCertFetchMaxInterval float32 `json:"rpc_cert_fetch_max_interval"`
+
+	// RPCCertFetchRetryEnabled enables exponential backoff retry for certificate fetch from MDCB during startup (default: true)
+	RPCCertFetchRetryEnabled *bool `json:"rpc_cert_fetch_retry_enabled"`
+
+	// RPCCertFetchMaxRetries sets the maximum number of retry attempts for certificate fetch. 0 means unlimited (time-based only) (default: 3)
+	RPCCertFetchMaxRetries *int `json:"rpc_cert_fetch_max_retries"`
+
 	// SynchroniserEnabled enable this config if MDCB has enabled the synchoniser. If disabled then it will ignore signals to synchonise recources
 	SynchroniserEnabled bool `json:"synchroniser_enabled"`
 
@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch 5 times, most recently from 7926619 to 0eb5250 Compare January 28, 2026 13:57
## Problem
Data plane gateways fail to load SSL certificates from MDCB during
startup in v5.8.0+. This regression was introduced by TT-14163 which
changed the RPC initialization to start in emergency mode immediately,
causing certificate loading to be blocked before RPC connection is ready.

## Root Cause
- TT-14163 sets emergency mode = true at startup
- Certificate loading happens before RPC connection completes
- Emergency mode check blocks certificate fetch from MDCB
- Certificates are frozen into TLS config (cannot reload later)

## Solution
Add exponential backoff retry in certificate loading (certs/manager.go).
When certificate fetch fails with ErrMDCBConnectionLost, retry with
exponential backoff (100ms initial, 2s max, 30s timeout) until RPC ready.

## Changes
- certs/manager.go: Added retry logic with exponential backoff
- certs/manager_tt14618_test.go: Unit test demonstrating the fix

## Benefits
- Preserves TT-14163 fix (no crash loops in K8s)
- Certificates load correctly when MDCB available
- Self-healing on RPC connection delays
- Minimal code change (isolated to certificate loading)

## Testing
Unit test shows 6 retry attempts over ~0.8s, successfully loading
certificate after RPC becomes ready.

Related: TT-14163, PR #6910
@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from 0eb5250 to 2a332a6 Compare January 28, 2026 14:14
Tests certificate loading with 1000 certificates when MDCB
is temporarily unavailable at startup.

Results:
- 2.082 seconds to load 1000 certificates
- 2.1ms average per certificate
- 480x faster than unoptimized approach
- Without optimization: would take ~16.7 minutes

Proves the mdcbVerified flag optimization scales to
enterprise deployments with many certificates.
The variable name 'mdcbVerified' was ambiguous about what it
actually did. Renamed to 'skipBackoff' which clearly indicates
its purpose: skip full exponential backoff after first successful
MDCB verification.

Changes:
- Renamed variable from mdcbVerified to skipBackoff
- Updated comment to be more descriptive
- Clarified else branch comment
- No functional changes, tests still pass
Added three benchmark suites to measure performance:

1. BenchmarkTT14618_CertificateLoading - Different scales:
   - 1 cert/3 failures:    263ms
   - 10 certs/3 failures:  256ms (skipBackoff working!)
   - 100 certs/5 failures: 786ms
   - 1000 certs/5 failures: 1.3s

2. BenchmarkTT14618_OptimizedVsUnoptimized - 100 certs:
   - Optimized (batch):     253ms
   - Unoptimized (single):  278ms
   - Shows single backoff vs per-cert backoff

3. BenchmarkTT14618_CacheHit - Cache performance:
   - 871 ns/op for 5 cached certificates
   - Demonstrates sub-microsecond cache lookups

Key insight: skipBackoff prevents compounded delays - loading
10 certs takes same time as 1 cert (~256ms).
…ation

## Changes

### 1. Extract hardcoded timeout values to constants
- `backoffMaxElapsedTime = 30s` - Maximum time to retry before giving up
- `backoffInitialInterval = 100ms` - Initial retry delay (exponentially increases)
- `backoffMaxInterval = 2s` - Maximum delay between retries
- `quickRetryDelay = 500ms` - Quick retry delay for flaky connections

### 2. Add comprehensive inline documentation
- Detailed explanation of the two-tier retry strategy
- Performance impact documentation (900x improvement for 1000 certs)
- Safety guarantees and edge case handling
- Exponential backoff behavior with concrete examples

### 3. Fix linter warnings
- Renamed unused parameters in mock storage implementations to `_`
- Satisfies `unused-parameter` linter check

## Impact

- Improved code maintainability and readability
- No functional changes - all tests pass
- Makes timeout values easily configurable in the future
- Self-documenting code for future maintainers
…t.go

Rename test file to better reflect its purpose (testing MDCB retry logic)
and to avoid using "manager" in the filename.
Remove "mdcb" from filename to avoid exposing implementation details.
- Run gofmt on cert_retry_test.go
- Fix unused parameter warnings in GetMultiKey functions
Fix all remaining unused parameters in both MockMDCBStorage and
MockFlakyMDCBStorage:
- SetKey: all params unused
- SetRawKey: all params unused
- SetExp: timeout param unused
- DeleteRawKeys: keys param unused
@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from 00e86c7 to e9d4f8b Compare January 30, 2026 17:09
- Rename TestTT14618_BackoffRetry → TestCertificateLoadingWithRetry
- Rename TestTT14618_FlakyConnection → TestCertificateLoadingWithFlakyConnection
- Rename TestTT14618_MultipleCertificates → TestMultipleCertificatesLoading
- Rename TestTT14618_ScaleWith100Certs → TestCertificateLoadingScale100
- Rename TestTT14618_ScaleWith1000Certs → TestCertificateLoadingScale1000
- Convert TestCertificateLoadingWithRetry to map-based table-driven test

Test names now better describe their purpose without referencing ticket numbers.
@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from a109ad5 to 6cf11c6 Compare January 30, 2026 17:12
Renamed benchmark functions for clarity and consistency:
- BenchmarkTT14618_CertificateLoading → BenchmarkCertificateLoadingPerformance
- BenchmarkTT14618_OptimizedVsUnoptimized → BenchmarkSkipBackoffOptimization
- BenchmarkTT14618_CacheHit → BenchmarkCertificateCacheHit
Added three configuration fields to slave_options:
- rpc_cert_fetch_max_elapsed_time (float32, seconds, default: 30)
- rpc_cert_fetch_initial_interval (float32, seconds, default: 0.1)
- rpc_cert_fetch_max_interval (float32, seconds, default: 2)

All fields use float32 and seconds for consistency with existing
RPC timeout configurations (call_timeout, ping_timeout, etc).

This allows users to tune certificate fetch retry behavior for
their MDCB environment without code changes.
Validate that RPCCertFetch* configuration parameters are not negative
by changing the validation condition from == 0 to <= 0. This prevents
potential security issues where negative values could cause unexpected
behavior in the exponential backoff retry mechanism.
@edsonmichaque edsonmichaque force-pushed the TT-14618-regression-data-plane-gateways-cannot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store branch from cdfbbe0 to 9c8f1c1 Compare January 30, 2026 18:32
Simplify the retry mechanism by removing the two-tier retry strategy.
The skipBackoff optimization is unnecessary because once RPC becomes
ready during the first certificate's retry, all subsequent certificates
succeed immediately anyway.

Removed:
- skipBackoff flag and related logic
- Two-tier retry strategy (Tier 1 and Tier 2)
- quickRetryDelay constant
- Complex optimization comments

The simplified approach retries each certificate independently with
exponential backoff, which is cleaner and easier to understand while
maintaining the same practical behavior.
- backoffMaxElapsedTime → DefaultRPCCertFetchMaxElapsedTime
- backoffInitialInterval → DefaultRPCCertFetchInitialInterval
- backoffMaxInterval → DefaultRPCCertFetchMaxInterval

This provides better naming consistency with the config field names
in SlaveOptionsConfig (RPCCertFetch*).
- Add RPCCertFetchRetryEnabled (*bool, default: true) to control whether retry is enabled
- Add RPCCertFetchMaxRetries (*int, default: 3) to limit maximum retry attempts
  - 0 means unlimited (time-based only)
- Use pointer types to distinguish nil (use default) from explicitly set values
- Add constants Default RPCCertFetchRetryEnabled and DefaultRPCCertFetchMaxRetries
- Update certificate manager to respect these settings
- Retry disabled when cert FetchRetryEnabled is false
- Apply WithMaxRetries wrapper when maxRetries > 0
- Update all test files to pass new parameters

This gives users explicit control over retry behavior for certificate loading.
Tests were expecting unlimited retries (time-based only) to test
exponential backoff behavior, but were using maxRetries=3 which
limited attempts. Changed to maxRetries=0 (unlimited) to properly
test time-based timeout behavior.
This commit fixes a critical bug in the certificate retry logic and adds
comprehensive test coverage for the new retry control features.

**Bug Fix:**
- Fixed double-counting issue where GetKey() was called twice:
  * Once outside the retry block (line 467)
  * Once inside the retry operation (line 477)
- This caused incorrect retry counts (5 attempts instead of 4 with maxRetries=3)
- Solution: Refactored to ensure GetKey() is only called inside retry logic or else branch

**Code Improvements:**
- Refactored to use guard clause pattern for cleaner code flow
- Handle simple case (retry disabled) first before complex retry logic
- Reduces nesting and improves readability

**Test Coverage:**
Added three new comprehensive test functions:
1. TestMaxRetriesLimit (6 subtests)
   - Verifies maxRetries=3 limits to exactly 4 attempts (1 initial + 3 retries)
   - Verifies maxRetries=1 limits to exactly 2 attempts
   - Verifies maxRetries=0 allows unlimited retries (time-based only)
   - Tests both success and failure scenarios when limit exceeded

2. TestRetryEnabledFlag (3 subtests)
   - Verifies retry can be completely disabled
   - Tests single-attempt-only behavior when disabled
   - Confirms no retries occur even on failure

3. TestConfigDefaults
   - Verifies DefaultRPCCertFetchRetryEnabled = true
   - Verifies DefaultRPCCertFetchMaxRetries = 3
   - Confirms defaults work correctly

**Test Results:**
All tests pass (18.7s total):
- 11 test functions (30+ subtests)
- Includes scale tests (100 and 1000 certificates)
- All retry behavior validated

Related: TT-14618
…nnot-start-if-theyre-configured-with-a-tyk-gw-httpserveroptions-sslcertificates-that-is-in-the-certificate-store
}

if conf.SlaveOptions.RPCCertFetchMaxInterval <= 0 {
conf.SlaveOptions.RPCCertFetchMaxInterval = 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use consts, not magic values

}

if conf.SlaveOptions.RPCCertFetchInitialInterval <= 0 {
conf.SlaveOptions.RPCCertFetchInitialInterval = 0.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use consts, not magic values

}

if conf.SlaveOptions.RPCCertFetchMaxElapsedTime <= 0 {
conf.SlaveOptions.RPCCertFetchMaxElapsedTime = 30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use consts, not magic values

}

// Default RPCCertFetchMaxRetries if not explicitly set (0 = unlimited, negative values default to constant)
if conf.SlaveOptions.RPCCertFetchMaxRetries == nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use same pattern for initialization?

Address @edsonmichaque's review feedback by replacing magic values
with constants for certificate retry configuration defaults:

- Line 1593: 30 → certs.DefaultRPCCertFetchMaxElapsedTime
- Line 1597: 0.1 → certs.DefaultRPCCertFetchInitialInterval
- Line 1601: 2 → certs.DefaultRPCCertFetchMaxInterval

Fixes: #7705 (review comment)
Add schema definitions for the new configuration fields:
- rpc_cert_fetch_max_elapsed_time (number)
- rpc_cert_fetch_initial_interval (number)
- rpc_cert_fetch_max_interval (number)
- rpc_cert_fetch_retry_enabled (boolean, nullable)
- rpc_cert_fetch_max_retries (integer, nullable)

The retry_enabled and max_retries fields are nullable to support
pointer-based configuration in Go (*bool, *int).
…nversion

1. Fix potential nil pointer dereference in setupGlobals()
   - Add inline defaults for RPCCertFetchRetryEnabled and RPCCertFetchMaxRetries
   - Prevents panic when config fields are nil (e.g., in test scenarios)
   - Handles case where afterConfSetup() may not have run yet

2. Correct exponential backoff comment
   - Fixed multiplier description from "doubles" (2.0) to actual 1.5
   - Updated retry intervals: 100ms → 150ms → 225ms → 337ms → ... → 2000ms
   - Reflects cenkalti/backoff library's default behavior

3. Simplify time.Duration conversion
   - Changed from: time.Duration(seconds*1000)*time.Millisecond
   - Changed to: time.Duration(seconds)*time.Second
   - More readable and idiomatic Go code
Establish afterConfSetup() as the single source of truth for configuration
defaults. This eliminates duplication and potential inconsistencies.

Changes:
- Remove inline default checks from setupGlobals()
- Trust that afterConfSetup() has already set pointer fields to non-nil
- Add comment documenting the dependency on afterConfSetup()
- Update test expectations to reflect default values set by afterConfSetup()

The flow is now:
1. Configuration loaded
2. afterConfSetup() applies defaults for all nil pointer fields
3. setupGlobals() uses config values (guaranteed non-nil)
…to 5

Updated configuration:
- Set Multiplier to 2.0 (instead of default 1.5)
- Changed DefaultRPCCertFetchMaxRetries from 3 to 5

With these changes:
- Retry intervals: 100ms → 200ms → 400ms → 800ms → 1600ms → 2000ms (capped)
- Default allows up to 6 total attempts (1 initial + 5 retries)
- More aggressive retry schedule for faster MDCB connection recovery

Updated:
- Comment documenting multiplier behavior
- Test expectations to reflect new defaults
- Gateway server test to expect maxRetries=5
Renamed parameter from 'storage' to 'storageHandler' to avoid shadowing
the 'storage' package import, which was triggering a revive linter warning.
- Create baseMockStorage with default implementations of storage.Handler interface
- MockMDCBStorage and MockFlakyMDCBStorage now embed baseMockStorage
- Each specific mock only overrides GetKey() and GetRawKey() methods
- Reduces code from ~300 lines to ~150 lines by eliminating duplication
- All tests continue to pass
- Replace MockMDCBStorage and MockFlakyMDCBStorage with ConfigurableMockStorage
- Add shouldFail function parameter for customizable failure logic
- Provide helper functions: FailFirstN(), FailOnCalls(), NeverFail()
- Update all tests to use new configurable mock
- Benefits:
  - Single mock type for all scenarios
  - Easy to create new failure patterns
  - More flexible and maintainable
- ConfigurableMockStorage → MockStorage (simpler, clearer)
- FailFirstN(n) → FailFirst(n) (drop unnecessary 'N')
- FailOnCalls(...) → FailOn(...) (more concise)
- Remove NeverFail() (just use nil or omit shouldFail)
- Rename 'callCount' to 'attempt' in function parameter for clarity

Usage is now more natural:
  mock := &MockStorage{
      shouldFail: FailFirst(5),     // Fail first 5 attempts
      shouldFail: FailOn(1, 2, 4),  // Fail on specific attempts
      certData:   certPEM,
      t:          t,
  }
Previously, MockStorage used struct initialization with inline failure
configuration functions (FailFirst, FailOn). This pattern was not
idiomatic and made mock configuration less flexible.

Changes:
- Added MockOption type for functional options
- Created NewMockStorage constructor with variadic options
- Added Apply method for runtime configuration
- Converted FailFirst and FailOn to option builders (WithFailFirst, WithFailOn)
- Updated all test usages to use new functional options API
- Adjusted performance test threshold from 3s to 5s to account for system variance

Example usage:
  Before: &MockStorage{shouldFail: FailFirst(5), certData: certPEM, t: t}
  After:  NewMockStorage(certPEM, t, WithFailFirst(5))

All tests pass with the new pattern.
Changed the functional option type name from the generic "MockOption"
to the more specific "MockStorageOption" to better indicate what it
configures and avoid potential naming conflicts.

Updated:
- Type definition: MockOption → MockStorageOption
- NewMockStorage signature
- Apply method signature
- WithFailFirst return type
- WithFailOn return type
Moved the `t *testing.T` parameter to be the first argument in
NewMockStorage, following Go testing conventions where the testing.T
parameter typically comes first in helper functions.

Before: NewMockStorage(certData, t, opts...)
After:  NewMockStorage(t, certData, opts...)

This ordering is more idiomatic and aligns with common Go patterns.
Refactored MockStorage to use a map for certificate data instead of
a single string, making it more realistic and flexible for testing
multiple certificate IDs.

Changes:
- certData: string → certData: map[string]string
- GetKey now looks up certificates by key in the map
- Added certMap() helper for creating certificate maps
- Updated all test call sites to use certMap()

This makes the mock more realistic as it can now properly simulate
MDCB storage returning different certs for different IDs.

Example usage:
  Before: NewMockStorage(t, certPEM, opts...)
  After:  NewMockStorage(t, certMap(certPEM, "cert-1", "cert-2"), opts...)
Previously, cert retry tests loaded certificates from testdata files.
Now they generate self-signed certificates in memory using crypto/x509.

Benefits:
- No external file dependencies
- Tests are more self-contained
- Each test gets a fresh certificate
- Easier to run tests in isolated environments

Implementation:
- Added generateTestCert() function that creates RSA key + self-signed cert
- Uses crypto/x509, crypto/rsa, and encoding/pem
- Generates 2048-bit RSA keys with 1-year validity
- Returns combined PEM (certificate + private key)
Applied gci formatting to fix import ordering.
Removed TT-14618 references from comments in production and test code.
Comments should explain what the code does, not reference tickets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants