High-performance regex matching for R using Rust via extendr. Compares multiple implementations including base R, parallel R (mirai/furrr), and Rust (regex and fancy-regex engines).
- 10-80x faster than stringr: High-performance Rust implementation
- Auto-optimization: Engine selection, parallel strategy, and memory chunking happen automatically
- Simple API: Single
string_detect()function with smart defaults - Fixed matching mode: Fast literal string matching with
fixed = TRUE - Ecosystem compatible: Drop-in replacement for
stringr::str_detect()
# Install Rust first: https://rustup.rs/
# Then install the package
devtools::install()library(stringrs)
# Simple detection with smart defaults
strings <- c("apple pie", "banana bread", "cherry tart")
pattern <- "a[aeiou]"
string_detect(strings, pattern)
# Multiple patterns - returns wide tibble
patterns <- c("a[aeiou]", "[bt]read", "^ch")
result <- string_detect(strings, patterns)
# Returns: tibble with columns: string, a.aeiou., X.bt.read., X.ch
# Fixed (literal) matching - bypasses regex for speed
strings <- c("Hello world", "hello there", "HELLO")
string_detect(strings, "hello", fixed = TRUE) # case-sensitive literal matchstring_detect(
strings, # Character vector of strings to search
pattern, # Character vector of patterns (single or multiple)
fixed = FALSE # Use literal matching (bypasses regex)
)Parameters:
strings: Character vector of strings to searchpattern: Character vector of patterns (single pattern or multiple patterns)fixed: Logical. IfTRUE, use literal string matching (bypasses regex engine for speed). Default isFALSE(regex matching).
Returns:
- Single pattern: Logical vector (same length as
strings) - Multiple patterns: Wide tibble with columns for each pattern plus
stringcolumn
Auto-optimization (all transparent):
- Engine selection: Auto-detects fancy regex features (backrefs, lookaheads) and chooses appropriate engine
- Parallel strategy: Auto-selects optimal parallelization based on data dimensions
- Memory chunking: Auto-enabled when result would exceed memory threshold
# Quick test (reduced test cases, single iteration)
Rscript benchmark/run.R --quick
# Full benchmark suite (19 scenarios)
Rscript benchmark/run.R
# Custom workers and iterations
Rscript benchmark/run.R --workers 8 --iterations 5# Include comparison to previous runs
Rscript benchmark/run.R --compareThe benchmark suite tests various scales:
| Test | Strings | Patterns | Description |
|---|---|---|---|
| 1Kx1 | 1,000 | 1 | Small single pattern |
| 10Kx1 | 10,000 | 1 | Medium single pattern |
| 100Kx1 | 100,000 | 1 | Large single pattern |
| 1Kx10 | 1,000 | 10 | Small multi-pattern |
| 10Kx10 | 10,000 | 10 | Medium multi-pattern |
| 100Kx10 | 100,000 | 10 | Large multi-pattern |
| 10Kx50 | 10,000 | 50 | Many patterns |
| 10Kx100 | 10,000 | 100 | Very many patterns |
- R Base:
grepl(),stringr::str_detect(),stringi::stri_detect_regex() - R Parallel (mirai): Parallel over strings or patterns
- R Parallel (furrr):
future_map()over strings or patterns - Rust (regex): Standard regex crate, single and parallel
- Rust (fancy-regex): Supports backrefs/lookaheads, single and parallel
Set fixed = TRUE when:
- You're searching for literal strings (not regex patterns)
- You need exact substring matching
- You want maximum speed (bypasses regex engine entirely)
Example: string_detect(text, "ERROR", fixed = TRUE)
The package automatically optimizes based on your data:
- Engine selection: Backreferences
(.)\1or lookaheads(?=...)auto-detect fancy-regex need - Parallel strategy: Large datasets (>10K strings or >50 patterns) auto-enable parallel processing
- Memory chunking: Results >100MB automatically use chunked processing
Advanced users can tune via R options:
# Adjust chunking threshold (in bytes)
options(stringrs.chunk_threshold = 50 * 1024 * 1024) # 50MB
# Reset to default
options(stringrs.chunk_threshold = NULL)All performance decisions happen transparently:
Engine Selection
- Byte-level pattern analysis detects fancy regex features
- Standard regex: Fast DFA engine for 90% of patterns
- Fancy-regex: Backtracking engine for backrefs/lookaheads only when needed
Parallel Strategy
- Data-driven heuristics based on string count × pattern count
- String-parallel: Best for large string vectors with few patterns
- Pattern-parallel: Best for many patterns with smaller string vectors
- Sequential: Small datasets where overhead isn't worth it
Memory Chunking
- Result size estimation triggers automatic chunking
- Default threshold: 100MB (configurable via options)
- Balances overhead vs cache locality for optimal throughput
stringrs/
├── R/
│ ├── extendr-wrappers.R # Auto-generated Rust bindings
│ └── string_detect.R # Main R API (single function)
├── src/
│ └── rust/
│ ├── src/lib.rs # Rust implementation
│ └── Cargo.toml # Rust dependencies
├── benchmark/
│ ├── run.R # Benchmark runner
│ ├── engines.R # Engine implementations
│ ├── scenarios.R # Test scenarios
│ ├── metrics.R # Performance metrics
│ ├── visualize.R # Result visualization
│ └── compare.R # Baseline comparison
└── tests/
└── testthat/
└── test-detect.R # Unit tests
To add a new regex implementation:
- Add function to
src/rust/src/lib.rs - Export via
extendr_module! - Add R wrapper in
R/extendr-wrappers.R - Add benchmark function in
inst/benchmark_functions.R - Include in benchmark runner
MIT + file LICENSE