Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: swiftlang/swift-experimental-string-processing
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: swift-5.9.2-RELEASE
Choose a base ref
...
head repository: swiftlang/swift-experimental-string-processing
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: swift-5.10-RELEASE
Choose a head ref
  • 5 commits
  • 50 files changed
  • 8 contributors

Commits on Apr 5, 2023

  1. Update swift/main with recent changes (#651)

    * Atomically load the lowered program (#610)
    
    Since we're atomically initializing the compiled program in
    `Regex.Program`, we need to pair that with an atomic load.
    
    Resolves #609.
    
    * Add tests for line start/end word boundary diffs (#616)
    
    The `default` and `simple` word boundaries have different behaviors
    at the start and end of strings/lines. These tests validate that we
    have the correct behavior implemented. Related to issue #613.
    
    * Add tweaks for Android
    
    * Fix documentation typo (#615)
    
    * Fix abstract for Regex.dotMatchesNewlines(_:). (#614)
    
    The old version looks like it was accidentally duplicated from
    anchorsMatchLineEndings(_:) just below it.
    
    * Remove `RegexConsumer` and fix its dependencies (#617)
    
    * Remove `RegexConsumer` and fix its dependencies
    
    This eliminates the RegexConsumer type and rewrites its users to call
    through to other, existing functionality on Regex or in the Algorithms
    implementations. RegexConsumer doesn't take account of the dual
    subranges required for matching, so it can produce results that are
    inconsistent with matches(of:) and ranges(of:), which were rewritten
    earlier.
    
    rdar://102841216
    
    * Remove remaining from-end algorithm methods
    
    This removes methods that are left over from when we were considering
    from-end algorithms. These aren't tested and may not have the correct
    semantics, so it's safer to remove them entirely.
    
    * Improve StringProcessing and RegexBuilder documentation (#611)
    
    This includes documentation improvements for core types/methods,
    RegexBuilder types along with their generated variadic initializers,
    and adds some curation. It also includes tests of the documentation
    code samples.
    
    * Set availability for inverted character class test (#621)
    
    This feature depends on running with a Swift 5.7 stdlib, and fails
    when that isn't available.
    
    * Add type annotations in RegexBuilder tests
    
    These changes work around a change to the way result builders are
    compiled that removes the ability for result builder closure outputs
    to affect the overload resolution elsewhere in an expression.
    
    Workarounds for rdar://104881395 and rdar://104645543
    
    * Workaround for fileprivate array issue
    
    A recent compiler change results in fileprivate arrays sometimes
    not keeping their buffers around long enough. This change avoids that
    issue by removing the fileprivate annotations from the affected type.
    
    * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>
    
    * Stop at end of search string in TwoWaySearcher (#631)
    
    When searching for a substring that doesn't exist, it was possible
    for TwoWaySearcher to advance beyond the end of the search string,
    causing a crash. This change adds a `limitedBy:` parameter to that
    index movement, avoiding the invalid movement.
    
    Fixes rdar://105154010
    
    * Correct misspelling in DSL renderer (#627)
    
    vertial -> vertical
    
    rdar://104602317
    
    * Fix output type mismatch with RegexBuilder (#626)
    
    Some regex literals (and presumably other `Regex` instances) lose
    their output type information when used in a RegexBuilder closure due
    to the way the concatenating builder calls are overloaded. In
    particular, any output type with labeled tuples or where the sum of
    tuple components in the accumulated and new output types is greater
    than 10 will be ignored.
    
    Regex internals don't make this distinction, however, so there ends up
    being a mismatch between what a `Regex.Match` instance tries to
    produce and the output type of the outermost regex. For example, this
    code results in a crash, because `regex` is a `Regex<Substring>`
    but the match tries to produce a `(Substring, number: Substring)`:
    
        let regex = Regex {
            ZeroOrMore(.whitespace)
            /:(?<number>\d+):/
            ZeroOrMore(.whitespace)
        }
        let match = try regex.wholeMatch(in: " :21: ")
        print(match!.output)
    
    To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
    to mark situations where the output type is discarded. This status
    is propagated through the capture list into the match's storage,
    which lets us produce the correct output type. Note that we can't just
    drop the capture groups when building the compiled program because
    (1) different parts of the regex might reference the capture group
    and (2) all capture groups are available if a developer converts the
    output to `AnyRegexOutput`.
    
        let anyOutput = AnyRegexOutput(match)
        // anyOutput[1] == "21"
        // anyOutput["number"] == Optional("21")
    
    Fixes #625. rdar://104823356
    
    Note: Linux seems to crash on different tests when the two customTest
    overloads have `internal` visibility or are called. Switching one of the
    functions to be generic over a RegexComponent works around the issue.
    
    * Revert "Merge pull request #628 from apple/result_builder_changes_workaround"
    
    This reverts commit 7e059b7, reversing
    changes made to 3ca8b13.
    
    * Use `some` syntax in variadics
    
    This supports a type checker fix after the change in how result
    builder closure parameters are type-checked.
    
    * Type checker workaround: adjust test
    
    * Further refactor to work around type checker regression
    
    * Align availability macro with OS versions (#641)
    
    * Speed up general character class matching (#642)
    
    Short-circuit Character.isASCII checks inside built in character class matching.
    
    Also, make benchmark try a few more times before giving up.
    
    * Test for \s matching CRLF when scalar matching (#648)
    
    * General ascii fast paths for character classes (#644)
    
    General ASCII fast-paths for builtin character classes
    
    * Remove the unsupported `anyScalar` case (#650)
    
    We decided not to support the `anyScalar` character class, which would
    match a single Unicode scalar regardless of matching mode. However,
    its representation was still included in the various character class
    types in the regex engine, leading to unreachable code and unclear
    requirements when changing or adding new code. This change removes
    that representation where possible.
    
    The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
    is marked `@_spi(RegexBuilder) public`. Any use of that enum case
    is handled with a `fatalError("Unsupported")`, and it isn't produced
    on any code path.
    
    ---------
    
    Co-authored-by: Nate Cook <natecook@apple.com>
    Co-authored-by: Butta <repo@butta.fastem.com>
    Co-authored-by: Ole Begemann <ole@oleb.net>
    Co-authored-by: Alex Martini <amartini@apple.com>
    Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
    Co-authored-by: David Ewing <dewing@apple.com>
    Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
    8 people authored Apr 5, 2023
    Configuration menu
    Copy the full SHA
    836f327 View commit details
    Browse the repository at this point in the history

Commits on May 25, 2023

  1. Update swift main (#674)

    * Atomically load the lowered program (#610)
    
    Since we're atomically initializing the compiled program in
    `Regex.Program`, we need to pair that with an atomic load.
    
    Resolves #609.
    
    * Add tests for line start/end word boundary diffs (#616)
    
    The `default` and `simple` word boundaries have different behaviors
    at the start and end of strings/lines. These tests validate that we
    have the correct behavior implemented. Related to issue #613.
    
    * Add tweaks for Android
    
    * Fix documentation typo (#615)
    
    * Fix abstract for Regex.dotMatchesNewlines(_:). (#614)
    
    The old version looks like it was accidentally duplicated from
    anchorsMatchLineEndings(_:) just below it.
    
    * Remove `RegexConsumer` and fix its dependencies (#617)
    
    * Remove `RegexConsumer` and fix its dependencies
    
    This eliminates the RegexConsumer type and rewrites its users to call
    through to other, existing functionality on Regex or in the Algorithms
    implementations. RegexConsumer doesn't take account of the dual
    subranges required for matching, so it can produce results that are
    inconsistent with matches(of:) and ranges(of:), which were rewritten
    earlier.
    
    rdar://102841216
    
    * Remove remaining from-end algorithm methods
    
    This removes methods that are left over from when we were considering
    from-end algorithms. These aren't tested and may not have the correct
    semantics, so it's safer to remove them entirely.
    
    * Improve StringProcessing and RegexBuilder documentation (#611)
    
    This includes documentation improvements for core types/methods,
    RegexBuilder types along with their generated variadic initializers,
    and adds some curation. It also includes tests of the documentation
    code samples.
    
    * Set availability for inverted character class test (#621)
    
    This feature depends on running with a Swift 5.7 stdlib, and fails
    when that isn't available.
    
    * Add type annotations in RegexBuilder tests
    
    These changes work around a change to the way result builders are
    compiled that removes the ability for result builder closure outputs
    to affect the overload resolution elsewhere in an expression.
    
    Workarounds for rdar://104881395 and rdar://104645543
    
    * Workaround for fileprivate array issue
    
    A recent compiler change results in fileprivate arrays sometimes
    not keeping their buffers around long enough. This change avoids that
    issue by removing the fileprivate annotations from the affected type.
    
    * Fix an issue where named character classes weren't getting converted in the result builder. <rdar://104480703>
    
    * Stop at end of search string in TwoWaySearcher (#631)
    
    When searching for a substring that doesn't exist, it was possible
    for TwoWaySearcher to advance beyond the end of the search string,
    causing a crash. This change adds a `limitedBy:` parameter to that
    index movement, avoiding the invalid movement.
    
    Fixes rdar://105154010
    
    * Correct misspelling in DSL renderer (#627)
    
    vertial -> vertical
    
    rdar://104602317
    
    * Fix output type mismatch with RegexBuilder (#626)
    
    Some regex literals (and presumably other `Regex` instances) lose
    their output type information when used in a RegexBuilder closure due
    to the way the concatenating builder calls are overloaded. In
    particular, any output type with labeled tuples or where the sum of
    tuple components in the accumulated and new output types is greater
    than 10 will be ignored.
    
    Regex internals don't make this distinction, however, so there ends up
    being a mismatch between what a `Regex.Match` instance tries to
    produce and the output type of the outermost regex. For example, this
    code results in a crash, because `regex` is a `Regex<Substring>`
    but the match tries to produce a `(Substring, number: Substring)`:
    
        let regex = Regex {
            ZeroOrMore(.whitespace)
            /:(?<number>\d+):/
            ZeroOrMore(.whitespace)
        }
        let match = try regex.wholeMatch(in: " :21: ")
        print(match!.output)
    
    To fix this, we add a new `ignoreCapturesInTypedOutput` DSLTree node
    to mark situations where the output type is discarded. This status
    is propagated through the capture list into the match's storage,
    which lets us produce the correct output type. Note that we can't just
    drop the capture groups when building the compiled program because
    (1) different parts of the regex might reference the capture group
    and (2) all capture groups are available if a developer converts the
    output to `AnyRegexOutput`.
    
        let anyOutput = AnyRegexOutput(match)
        // anyOutput[1] == "21"
        // anyOutput["number"] == Optional("21")
    
    Fixes #625. rdar://104823356
    
    Note: Linux seems to crash on different tests when the two customTest
    overloads have `internal` visibility or are called. Switching one of the
    functions to be generic over a RegexComponent works around the issue.
    
    * Revert "Merge pull request #628 from apple/result_builder_changes_workaround"
    
    This reverts commit 7e059b7, reversing
    changes made to 3ca8b13.
    
    * Use `some` syntax in variadics
    
    This supports a type checker fix after the change in how result
    builder closure parameters are type-checked.
    
    * Type checker workaround: adjust test
    
    * Further refactor to work around type checker regression
    
    * Align availability macro with OS versions (#641)
    
    * Speed up general character class matching (#642)
    
    Short-circuit Character.isASCII checks inside built in character class matching.
    
    Also, make benchmark try a few more times before giving up.
    
    * Test for \s matching CRLF when scalar matching (#648)
    
    * General ascii fast paths for character classes (#644)
    
    General ASCII fast-paths for builtin character classes
    
    * Remove the unsupported `anyScalar` case (#650)
    
    We decided not to support the `anyScalar` character class, which would
    match a single Unicode scalar regardless of matching mode. However,
    its representation was still included in the various character class
    types in the regex engine, leading to unreachable code and unclear
    requirements when changing or adding new code. This change removes
    that representation where possible.
    
    The `DSLTree.Atom.CharacterClass` enum is left unchanged, since it
    is marked `@_spi(RegexBuilder) public`. Any use of that enum case
    is handled with a `fatalError("Unsupported")`, and it isn't produced
    on any code path.
    
    * Fix range-based quantification fast path (#653)
    
    The fast path for quantification incorrectly discards the last save
    position when the quantification used up all possible trips, which is
    only possible with range-based quantifications (e.g. `{0,3}`). This
    bug shows up when a range-based quantifier matches the maximum - 1
    repetitions of the preceding pattern.
    
    For example, the regex `/a{0,2}a/` should succeed as a full match any
    of the strings "aa", "aaa", or "aaaa". However, the pattern fails
    to match "aaa", since the save point allowing a single "a" to match
    the first `a{0,2}` part of the regex is discarded.
    
    This change only discards the last save position when advancing the
    quantifier fails due to a failure to match, not maxing out the number
    of trips.
    
    * Add in ASCII fast-path for anyNonNewline (#654)
    
    * Avoid long expression type checks (#657)
    
    These changes remove several seconds of type-checking time from the
    RegexBuilder test cases, bringing all expressions under 150ms (on
    the tested computer).
    
    * Processor cleanup (#655)
    
    Clean up and refactor the processor
    
    * Simplify instruction fetching
    
    * Refactor metrics out, and void their storage in release builds
    
    *Put operations onto String
    
    * Fix `firstRange(of:)` search (#656)
    
    Calls to `ranges(of:)` and `firstRange(of:)` with a string parameter
    actually use two different string searching algorithms. `ranges(of:)`
    uses the "z-searcher" algorithm, while `firstRange(of:)` uses a
    two-way search. Since it's better to align on a single path for these
    searches, the z-searcher has lower requirements, and the two-way
    search implementation has a correctness bug, this change removes
    the two-way search algorithm and uses z-search for `firstRange(of:)`.
    
    The correctness bug in `firstRange(of:)` appears only when searching
    for the second (or later) occurrence of a substring, which you have
    to be fairly deliberate about. In the example below, the substring
    at offsets `7..<12` is missed:
    
        let text = "ADACBADADACBADACB"
        //          =====  -----=====
        let pattern = "ADACB"
        let firstRange = text.firstRange(of: pattern)!
        // firstRange ~= 0..<5
        let secondRange = text[firstRange.upperBound...].firstRange(of: pattern)!
        // secondRange ~= 12..<17
    
    This change also removes some unrelated, unused code in Split.swift,
    in addition to removing an (unused) usage of `TwoWaySearcher`.
    
    rdar://92794248
    
    * Bug fix and hot path for quantified `.` (#658)
    
    Bug fix in newline hot path, and apply hot path to quantified dot
    
    * Run scalar-semantic benchmark variants (#659)
    
    Run scalar semantic benchmarks
    
    * Refactor operations to be on String (#664)
    
    Finish refactoring logic onto String
    
    * Provide unique generic method parameter names (#669)
    
    This is getting warned on in the 5.9 compiler, will be an error
    starting in Swift 6.
    
    * Enable quantification optimizations for scalar semantics (#671)
    
    *  Quantified scalar semantic matching
    
    * Remove redundant test
    
    ---------
    
    Co-authored-by: Nate Cook <natecook@apple.com>
    Co-authored-by: Butta <repo@butta.fastem.com>
    Co-authored-by: Ole Begemann <ole@oleb.net>
    Co-authored-by: Alex Martini <amartini@apple.com>
    Co-authored-by: Alejandro Alonso <alejandro_alonso@apple.com>
    Co-authored-by: David Ewing <dewing@apple.com>
    Co-authored-by: Dave Ewing <96321608+DaveEwing@users.noreply.github.com>
    8 people authored May 25, 2023
    Configuration menu
    Copy the full SHA
    2aababb View commit details
    Browse the repository at this point in the history

Commits on Jul 20, 2023

  1. Optimize search for start-anchored regexes (#683)

    When a regex is anchored to the start of a subject, there's no need
    to search throughout a string for the pattern when searching for the
    first match: a prefix match is sufficient.
    
    This adds a regex compilation-time check about whether a match can
    only be found at the start of a subject, and then uses that to
    choose whether to defer to `prefixMatch` from within `firstMatch`.
    natecook1000 authored Jul 20, 2023
    Configuration menu
    Copy the full SHA
    cc96bb5 View commit details
    Browse the repository at this point in the history

Commits on Oct 3, 2023

  1. [swift/main] Substring boundaries during matching (#697)

    * Handle boundaries when matching in substrings (#675)
    
    * Handle boundaries when matching in substrings
    
    Some of our existing matching routines use the start/endIndex
    of the input, which is basically never the right thing to do.
    
    This change revises those checks to use the search bounds, by
    either moving the boundary check out of the matching method, or
    if the boundary is a part of what needs to be matched (e.g.
    word boundaries have different behavior at the start/end than
    in the middle of a string) the search bounds are passed into
    the matching method.
    
    Testing is currently handled by piggy-backing on the existing
    match tests; we should add more tests to handle substring-
    specific edge cases.
    
    * Handle sub-character substring boundaries
    
    This change passes the end boundary down into matching methods, and
    uses it to find the actual character that is part of the input
    substring, even if the substring's end boundary is in the middle of
    a grapheme cluster.
    
    Substrings cannot have sub-Unicode scalar boundaries as of Swift
    5.7; we can remove a check for this when matching an individual
    scalar.
    
    * Add test for substring replacement
    natecook1000 authored Oct 3, 2023
    Configuration menu
    Copy the full SHA
    355027f View commit details
    Browse the repository at this point in the history

Commits on Dec 6, 2023

  1. Adds SPI for a NSRE compatibility mode option (#698) (#702)

    NSRegularExpression matches at the Unicode scalar level, but also
    matches `\r\n` sequences with a single `.` when single-line mode is
    enabled. This adds a `_nsreCompatibility` property that enables both
    of those behaviors, and implements support for the special case
    handling of `.`.
    natecook1000 authored Dec 6, 2023
    Configuration menu
    Copy the full SHA
    d56f16a View commit details
    Browse the repository at this point in the history
Loading