Skip to content

feat: add discover_valid_sitemaps utility#1777

Merged
Pijukatel merged 6 commits into
apify:masterfrom
Mantisus:discover-valid-sitemaps
Mar 6, 2026
Merged

feat: add discover_valid_sitemaps utility#1777
Pijukatel merged 6 commits into
apify:masterfrom
Mantisus:discover-valid-sitemaps

Conversation

@Mantisus

@Mantisus Mantisus commented Mar 4, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Add discover_valid_sitemaps utility to search for sitemaps of websites for the provided URLs.

Issues

Testing

  • Add new unit tests

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports/introduces a Python discover_valid_sitemaps helper to discover sitemap URLs for a set of input URLs (robots.txt sitemaps, direct sitemap URLs, and common sitemap paths), aligning with issue #1740.

Changes:

  • Add discover_valid_sitemaps() (plus internal helpers/constants) to orchestrate sitemap discovery per-hostname and deduplicate results.
  • Extend common sitemap probing to include /sitemap_index.xml and add is_status_code_successful() for status evaluation.
  • Add unit tests covering robots.txt discovery, common-path probing, input URL detection, deduplication, and multi-domain behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/crawlee/_utils/sitemap.py Implements sitemap discovery orchestration, common-path probing, and async generator merging.
src/crawlee/_utils/web.py Adds a helper to classify 2xx/3xx responses as “successful”.
tests/unit/_utils/test_sitemap.py Adds unit tests for the new sitemap discovery utility with mocked HTTP behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/crawlee/_utils/sitemap.py Outdated
Comment thread src/crawlee/_utils/sitemap.py Outdated
Comment thread src/crawlee/_utils/sitemap.py
Comment thread src/crawlee/_utils/sitemap.py
Comment thread src/crawlee/_utils/sitemap.py
Mantisus and others added 4 commits March 4, 2026 23:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Comment thread src/crawlee/_utils/sitemap.py

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Comment thread src/crawlee/_utils/web.py Outdated
Comment thread src/crawlee/_utils/sitemap.py
Comment thread src/crawlee/_utils/sitemap.py
Comment thread src/crawlee/_utils/sitemap.py Outdated
@Mantisus Mantisus requested a review from vdusek March 6, 2026 12:30

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit 872447b into apify:master Mar 6, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

5 participants