Skip to content

Add the Uri\Rfc3986\Uri class to ext/uri without wither support #18836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

kocsismate
Copy link
Member

@kocsismate kocsismate commented Jun 11, 2025

There are a few WIP parts but the PR can already be reviewed.

Relates to #14461 and https://wiki.php.net/rfc/url_parsing_api

@kocsismate kocsismate requested a review from dstogov as a code owner June 11, 2025 17:57
@kocsismate kocsismate requested review from TimWolla and nielsdos and removed request for dstogov June 11, 2025 17:58
@kocsismate kocsismate marked this pull request as draft June 11, 2025 17:58
@kocsismate kocsismate force-pushed the ext-url6 branch 2 times, most recently from 43c1d9f to 155e070 Compare June 11, 2025 18:07
@kocsismate kocsismate changed the title Add Uri\Rfc3986\Uri class to ext/uri Jun 11, 2025
Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursory glance, I see there's some todos wrt memory management as well so maybe I'll wait a bit on that

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first remarks from just looking at the diff on GitHub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I'm only noticing now: I think this should rather be called ext/uri/parser_rfc3986.c and php_lexbor.c would be ext/uri/parser_whatwg.c. The php_uriparser.c name is confusing to me, because uriparser is an extremely generic term.

To give another comparison with ext/random, since it is architecturally similar: Each engine has its own engine_enginename.c file, e.g. engine_xoshiro256starstar.c.

efree(uriparser_uris);
}

const uri_handler_t uriparser_uri_handler = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And similarly the public symbols in the file should also be prefixed with php_uri_. Just uriparser_ can conflict with the uriparser library itself.

Here I would suggest php_uri_rfc3986_handler.

But I'm also happy to leave this to a follow-up to not introduce too much churn.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this and the file renaming is something that I'm happy to do after most parts are merged.

@TimWolla

This comment was marked as resolved.

@kocsismate
Copy link
Member Author

The following patch fixes the leaks / memory management for me:

@TimWolla Thanks for the patch, it's awesome! TBH I didn't even think about using the UriMakeOwnerA() function, because I wanted to avoid its performance penalty. But you are right, this is the right call, it simplifies the code so much. I love it!

@TimWolla
Copy link
Member

@kocsismate When stack allocating the struct the code should become even simpler (and possibly faster, due to fewer pointer indirection).

@kocsismate kocsismate marked this pull request as ready for review June 16, 2025 07:59
@kocsismate kocsismate force-pushed the ext-url6 branch 2 times, most recently from 0b3cfd8 to 062dc4d Compare June 20, 2025 06:04
@kocsismate kocsismate force-pushed the ext-url6 branch 2 times, most recently from 9b63691 to e0b40a7 Compare June 26, 2025 12:03
@@ -153,7 +153,10 @@ int URI_FUNC(CopyUriMm)(URI_TYPE(Uri) * destUri,
*(destUri->hostData.ip6) = *(sourceUri->hostData.ip6);
}

if (URI_FUNC(CopyRangeAsNeeded)(&destUri->hostData.ipFuture, &sourceUri->hostData.ipFuture, URI_FALSE, memory) == URI_FALSE) {
if (sourceUri->hostData.ipFuture.first != NULL && sourceUri->hostText.first == sourceUri->hostData.ipFuture.first) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielEScherzer
Copy link
Member

I have no further suggestions, but don't feel qualified to approve this

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly okay for an initial merge now. The additional clean-up can happen in a follow-up to move forward. The leak should be fixed, though.

Comment on lines -107 to -166
PHP_METHOD(Uri_WhatWg_InvalidUrlException, __construct)
{
zend_string *message = NULL;
zval *errors = NULL;
zend_long code = 0;
zval *previous = NULL;

ZEND_PARSE_PARAMETERS_START(0, 4)
Z_PARAM_OPTIONAL
Z_PARAM_STR(message)
Z_PARAM_ARRAY(errors)
Z_PARAM_LONG(code)
Z_PARAM_OBJECT_OF_CLASS_OR_NULL(previous, zend_ce_throwable)
ZEND_PARSE_PARAMETERS_END();

if (zend_update_exception_properties(INTERNAL_FUNCTION_PARAM_PASSTHRU, message, code, previous) == FAILURE) {
RETURN_THROWS();
}

if (errors == NULL) {
zval tmp;
ZVAL_EMPTY_ARRAY(&tmp);
zend_update_property(uri_whatwg_invalid_url_exception_ce, Z_OBJ_P(ZEND_THIS), ZEND_STRL("errors"), &tmp);
} else {
zend_update_property(uri_whatwg_invalid_url_exception_ce, Z_OBJ_P(ZEND_THIS), ZEND_STRL("errors"), errors);
}
if (EG(exception)) {
RETURN_THROWS();
}
}

PHP_METHOD(Uri_WhatWg_UrlValidationError, __construct)
{
zend_string *context;
zval *type;
bool failure;

ZEND_PARSE_PARAMETERS_START(3, 3)
Z_PARAM_STR(context)
Z_PARAM_OBJECT_OF_CLASS(type, uri_whatwg_url_validation_error_type_ce)
Z_PARAM_BOOL(failure)
ZEND_PARSE_PARAMETERS_END();

zend_update_property_str(uri_whatwg_url_validation_error_ce, Z_OBJ_P(ZEND_THIS), ZEND_STRL("context"), context);
if (EG(exception)) {
RETURN_THROWS();
}

zend_update_property_ex(uri_whatwg_url_validation_error_ce, Z_OBJ_P(ZEND_THIS), ZSTR_KNOWN(ZEND_STR_TYPE), type);
if (EG(exception)) {
RETURN_THROWS();
}

zval failure_zv;
ZVAL_BOOL(&failure_zv, failure);
zend_update_property(uri_whatwg_url_validation_error_ce, Z_OBJ_P(ZEND_THIS), ZEND_STRL("failure"), &failure_zv);
if (EG(exception)) {
RETURN_THROWS();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like that you reordered these. It makes the diff hard to read.

Comment on lines +443 to +455
PHP_METHOD(Uri_Rfc3986_Uri, equals)
{
zend_object *that_object;
zend_object *comparison_mode = NULL;

ZEND_PARSE_PARAMETERS_START(1, 2)
Z_PARAM_OBJ_OF_CLASS(that_object, uri_rfc3986_uri_ce)
Z_PARAM_OPTIONAL
Z_PARAM_OBJ_OF_CLASS(comparison_mode, uri_comparison_mode_ce)
ZEND_PARSE_PARAMETERS_END();

uri_equals(INTERNAL_FUNCTION_PARAM_PASSTHRU, that_object, comparison_mode);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the other remark about file naming and organization, it might make sense to move the PHP_METHOD implementations into the matching source file to keep php_uri.c slim.

Comment on lines +47 to +51
if (uriparser_uris->normalized_uri_initialized == false) {
uriparser_copy_uri(&uriparser_uris->normalized_uri, &uriparser_uris->uri);
uriparser_normalize_uri(&uriparser_uris->normalized_uri);
uriparser_uris->normalized_uri_initialized = true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (uriparser_uris->normalized_uri_initialized == false) {
uriparser_copy_uri(&uriparser_uris->normalized_uri, &uriparser_uris->uri);
uriparser_normalize_uri(&uriparser_uris->normalized_uri);
uriparser_uris->normalized_uri_initialized = true;
}
if (uriparser_uris->normalized_uri_initialized) {
return;
}
uriparser_copy_uri(&uriparser_uris->normalized_uri, &uriparser_uris->uri);
uriparser_normalize_uri(&uriparser_uris->normalized_uri);
uriparser_uris->normalized_uri_initialized = true;

Perhaps?

ZEND_ASSERT(uriparser_uri != NULL);

if (uriparser_uri->userInfo.first != NULL && uriparser_uri->userInfo.afterLast != NULL) {
const char *c = memchr(uriparser_uri->userInfo.first, ':', get_text_range_length(&uriparser_uri->userInfo));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is inconsistent with the implementation in uriparser_read_username.


static uriparser_uris_t *uriparser_create_uris(void)
{
uriparser_uris_t *uriparser_uris = emalloc(sizeof(*uriparser_uris));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uriparser_uris_t *uriparser_uris = emalloc(sizeof(*uriparser_uris));
uriparser_uris_t *uriparser_uris = ecalloc(1, sizeof(*uriparser_uris));

This would cleanly zero-initialize everything. As far as I can see, the uriparser library guarantees that calling uriFreeUriMembersA() on a zero-initialized URI is also safe, thus this improves the overall safety somewhat.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked if this guarantee will continue to hold: uriparser/uriparser#238

Comment on lines +327 to +332
efree(uriparser_uris);
if (!silent) {
throw_invalid_uri_exception();
}

return NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will leak uri. Test case:

<?php

var_dump(
    new Uri\Rfc3986\Uri('foo', new Uri\Rfc3986\Uri('bar'))
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

* is discarded altogether. */
static void *uriparser_clone_uri(void *uri)
{
uriparser_uris_t *uriparser_uris = (uriparser_uris_t *) uri;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uriparser_uris_t *uriparser_uris = (uriparser_uris_t *) uri;
uriparser_uris_t *uriparser_uris = uri;

Same below.

Comment on lines +402 to +410
uriFreeUriMembersA(&uriparser_uris->uri);

if (uriparser_uris->normalized_uri_initialized) {
ZEND_ASSERT(uriparser_uris->normalized_uri.owner);

uriFreeUriMembersA(&uriparser_uris->normalized_uri);
}

efree(uriparser_uris);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When switching to calloc(), you can do:

Suggested change
uriFreeUriMembersA(&uriparser_uris->uri);
if (uriparser_uris->normalized_uri_initialized) {
ZEND_ASSERT(uriparser_uris->normalized_uri.owner);
uriFreeUriMembersA(&uriparser_uris->normalized_uri);
}
efree(uriparser_uris);
uriFreeUriMembersA(&uriparser_uris->uri);
uriFreeUriMembersA(&uriparser_uris->normalized_uri);
efree(uriparser_uris);

Upstream confirmed that this is safe.

ZEND_ASSERT(result == URI_SUCCESS);
}

static void normalize_uri_if_needed(uriparser_uris_t *uriparser_uris)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a little nicer to make this:

static UriUriA *get_normalized_uri(uriparser_uris_t *uriparser_uris) {
    if (!uriparser_uris->normalized_uri_initialized) {
        // normalize
    }

    return &uriparser_uris->normalized_uri;
}

Then you can just is this as a getter for the normalized URI and it will to the right thing.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the things that Tim said, here some small remarks

ZEND_ASSERT(uriparser_uri != NULL);

if (uriparser_uri->hostText.first != NULL && uriparser_uri->hostText.afterLast != NULL && get_text_range_length(&uriparser_uri->hostText) > 0) {
if (uriparser_uri->hostData.ip6 != NULL || uriparser_uri->hostData.ipFuture.first != NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this needs a comment to explain that the hostText overlaps with ip6 / ipFuture

return SUCCESS;
}

static int str_to_int(const char *str, int len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess len should actually be a size_t to be strictly correct, and i too.

}

ZVAL_NEW_STR(retval, smart_str_extract(&str));
} else if (uriparser_uri->absolutePath) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be:

Suggested change
} else if (uriparser_uri->absolutePath) {
} else if (uriparser_uri->absolutePath || uriHasHostA(uriparser_uri)) {

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this would make a few tests fail:

$uri = new Uri\Rfc3986\Uri("https://example.com");
var_dump($uri->getPath());

Currently, this outputs string(0) "", but it would then output string(1) "/". Handling of the leading slash is a bit convoluted with uriparser. I filed a ticket with my question (uriparser/uriparser#215) and the uriparser/uriparser#8 (comment) table should clear all the questions.

charsRequired++;

zend_string *uri_string = zend_string_alloc(charsRequired - 1, false);
result = uriToStringA(ZSTR_VAL(uri_string), uriparser_uri, charsRequired, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the +1/-1 for the NUL byte?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is highlighted in the uriparser documentation: https://uriparser.github.io/doc/api/latest/index.html#recomposition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment