stringrs: Fast Regex Matching with Rust

High-performance regex matching for R using Rust via extendr. Compares multiple implementations including base R, parallel R (mirai/furrr), and Rust (regex and fancy-regex engines).

Features

10-80x faster than stringr: High-performance Rust implementation
Auto-optimization: Engine selection, parallel strategy, and memory chunking happen automatically
Simple API: Single string_detect() function with smart defaults
Fixed matching mode: Fast literal string matching with fixed = TRUE
Ecosystem compatible: Drop-in replacement for stringr::str_detect()

Installation

# Install Rust first: https://rustup.rs/
# Then install the package
devtools::install()

Quick Start

library(stringrs)

# Simple detection with smart defaults
strings <- c("apple pie", "banana bread", "cherry tart")
pattern <- "a[aeiou]"
string_detect(strings, pattern)

# Multiple patterns - returns wide tibble
patterns <- c("a[aeiou]", "[bt]read", "^ch")
result <- string_detect(strings, patterns)
# Returns: tibble with columns: string, a.aeiou., X.bt.read., X.ch

# Fixed (literal) matching - bypasses regex for speed
strings <- c("Hello world", "hello there", "HELLO")
string_detect(strings, "hello", fixed = TRUE)  # case-sensitive literal match

API Reference

string_detect()

string_detect(
    strings,           # Character vector of strings to search
    pattern,           # Character vector of patterns (single or multiple)
    fixed = FALSE      # Use literal matching (bypasses regex)
)

Parameters:

strings: Character vector of strings to search
pattern: Character vector of patterns (single pattern or multiple patterns)
fixed: Logical. If TRUE, use literal string matching (bypasses regex engine for speed). Default is FALSE (regex matching).

Returns:

Single pattern: Logical vector (same length as strings)
Multiple patterns: Wide tibble with columns for each pattern plus string column

Auto-optimization (all transparent):

Engine selection: Auto-detects fancy regex features (backrefs, lookaheads) and chooses appropriate engine
Parallel strategy: Auto-selects optimal parallelization based on data dimensions
Memory chunking: Auto-enabled when result would exceed memory threshold

Benchmarking

Run Benchmarks

# Quick test (reduced test cases, single iteration)
Rscript benchmark/run.R --quick

# Full benchmark suite (19 scenarios)
Rscript benchmark/run.R

# Custom workers and iterations
Rscript benchmark/run.R --workers 8 --iterations 5

Compare to Historical Baseline

# Include comparison to previous runs
Rscript benchmark/run.R --compare

Test Configurations

The benchmark suite tests various scales:

Test	Strings	Patterns	Description
1Kx1	1,000	1	Small single pattern
10Kx1	10,000	1	Medium single pattern
100Kx1	100,000	1	Large single pattern
1Kx10	1,000	10	Small multi-pattern
10Kx10	10,000	10	Medium multi-pattern
100Kx10	100,000	10	Large multi-pattern
10Kx50	10,000	50	Many patterns
10Kx100	10,000	100	Very many patterns

Implementations Tested

R Base: grepl(), stringr::str_detect(), stringi::stri_detect_regex()
R Parallel (mirai): Parallel over strings or patterns
R Parallel (furrr): future_map() over strings or patterns
Rust (regex): Standard regex crate, single and parallel
Rust (fancy-regex): Supports backrefs/lookaheads, single and parallel

Performance Tips

When to Use Fixed Matching

Set fixed = TRUE when:

You're searching for literal strings (not regex patterns)
You need exact substring matching
You want maximum speed (bypasses regex engine entirely)

Example: string_detect(text, "ERROR", fixed = TRUE)

When Auto-Optimization Helps

The package automatically optimizes based on your data:

Engine selection: Backreferences (.)\1 or lookaheads (?=...) auto-detect fancy-regex need
Parallel strategy: Large datasets (>10K strings or >50 patterns) auto-enable parallel processing
Memory chunking: Results >100MB automatically use chunked processing

Power User Configuration

Advanced users can tune via R options:

# Adjust chunking threshold (in bytes)
options(stringrs.chunk_threshold = 50 * 1024 * 1024)  # 50MB

# Reset to default
options(stringrs.chunk_threshold = NULL)

Architecture

Auto-Optimization Strategy

All performance decisions happen transparently:

Engine Selection

Byte-level pattern analysis detects fancy regex features
Standard regex: Fast DFA engine for 90% of patterns
Fancy-regex: Backtracking engine for backrefs/lookaheads only when needed

Parallel Strategy

Data-driven heuristics based on string count × pattern count
String-parallel: Best for large string vectors with few patterns
Pattern-parallel: Best for many patterns with smaller string vectors
Sequential: Small datasets where overhead isn't worth it

Memory Chunking

Result size estimation triggers automatic chunking
Default threshold: 100MB (configurable via options)
Balances overhead vs cache locality for optimal throughput

Development

Project Structure

stringrs/
├── R/
│   ├── extendr-wrappers.R    # Auto-generated Rust bindings
│   └── string_detect.R        # Main R API (single function)
├── src/
│   └── rust/
│       ├── src/lib.rs          # Rust implementation
│       └── Cargo.toml          # Rust dependencies
├── benchmark/
│   ├── run.R                   # Benchmark runner
│   ├── engines.R               # Engine implementations
│   ├── scenarios.R             # Test scenarios
│   ├── metrics.R               # Performance metrics
│   ├── visualize.R             # Result visualization
│   └── compare.R               # Baseline comparison
└── tests/
    └── testthat/
        └── test-detect.R       # Unit tests

Adding New Implementations

To add a new regex implementation:

Add function to src/rust/src/lib.rs
Export via extendr_module!
Add R wrapper in R/extendr-wrappers.R
Add benchmark function in inst/benchmark_functions.R
Include in benchmark runner

License

MIT + file LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
R		R
benchmark		benchmark
inst		inst
man		man
src		src
target/rust-analyzer/flycheck0		target/rust-analyzer/flycheck0
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DASHMAP_CHANGES.md		DASHMAP_CHANGES.md
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
demo_api.R		demo_api.R
stringrs.Rproj		stringrs.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stringrs: Fast Regex Matching with Rust

Features

Installation

Quick Start

API Reference

string_detect()

Benchmarking

Run Benchmarks

Compare to Historical Baseline

Test Configurations

Implementations Tested

Performance Tips

When to Use Fixed Matching

When Auto-Optimization Helps

Power User Configuration

Architecture

Auto-Optimization Strategy

Development

Project Structure

Adding New Implementations

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

stringrs: Fast Regex Matching with Rust

Features

Installation

Quick Start

API Reference

string_detect()

Benchmarking

Run Benchmarks

Compare to Historical Baseline

Test Configurations

Implementations Tested

Performance Tips

When to Use Fixed Matching

When Auto-Optimization Helps

Power User Configuration

Architecture

Auto-Optimization Strategy

Development

Project Structure

Adding New Implementations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages