9,635 questions
0
votes
0
answers
77
views
URL Targeted web crawler [closed]
I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to ...
0
votes
0
answers
47
views
How to Get Reddit Crawler to Use my Video Preview?
I have a React page which gets re-routed for crawlers to a SEO backend page in nodejs + express. And I want to make it work with reddits crawler to get embedded videos, which it doesn't.
When I post ...
0
votes
1
answer
216
views
How can I use Firecrawl to crawl and take a screenshot of a webpage instead of using Playwright in Node.js?
I'm currently using Playwright in Node.js to capture screenshots of webpages, but I'm exploring Firecrawl and wondering if it can handle screenshots directly.
Here is what my firecrawl looks like with ...
-2
votes
1
answer
117
views
Webscrape links to download files based on word in page HTML
I am webscraping WHO pages using the following code:
pacman::p_load(rvest,
httr,
stringr,
purrr)
download_first_pdf_from_handle <- function(handle_id) {
...
1
vote
1
answer
267
views
Firecrawl self-hosted crawler throws Connection violated security rules error
I set up a self-hosted Firecrawl instance and I want to crawl my internal intranet site (e.g. https://intranet.xxx.gov.tr/).
I can access the site directly both from the host machine and from inside ...
0
votes
1
answer
65
views
Intermittent 406 Errors on Post, Pages Detected by Site Analyzers, Not Direct Browser Access
"My WordPress site's post pages return intermittent HTTP 406 "Not Acceptable" errors, but ONLY for site analysis/SEO tools (e.g., SEMrush, Ahrefs, GTmetrix). When accessed directly by ...
0
votes
2
answers
75
views
SitemapLoader(sitemap_url).load() hangs
from langchain_community.document_loaders import SitemapLoader
def crawl(self):
print("Starting crawler...")
sitemap_url = "https://gringo.co.il/sitemap.xml"
...
0
votes
0
answers
183
views
Crawl4AI token threshold not applied to raw html in arun
Here’s a brief overview of what I want to achieve
Extract raw htmls and save them
Use Crawl4AI to produce a ‘cleaner’ and smaller HTML that has a lot of information, including what I will eventually ...
0
votes
0
answers
87
views
Facebook Crawler Floods CPU usage to 100%
Im running facebook ads and today i woke up to see my server cpu at 100%.
I couldnt even use my website.
I did some research and found out it was a Facebook Crawler sending excessive requests.
I tried ...
0
votes
0
answers
84
views
Adding user agent in chrome options in selenium
i'm performing data crawling on a webpage using selenium. this is my code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options ...
0
votes
1
answer
117
views
Substitute host name with its IP address in HTTPS requests
I'm working on a web crawler and I'm trying to understand how the IP substitution works.
From what I have read, DNS hostname should be resolved to its IP address (one of many) and used instead of the ...
2
votes
1
answer
929
views
How to use LLMConfig in crawl4ai?
So I was testing https://leonardo467.gumroad.com/l/cstsu this code that uses crawl4ai, but it seems that the library has been updated or something because if you run it (with an API, so I use free ...
0
votes
0
answers
34
views
Transfermarkt Scraper can not get Club name
I want to use the data in my codes with Transfermark Scraper for my own special purpose. I get all the desired data in the codes except Current Club, but I can't get the Club name. I tried all the ...
0
votes
0
answers
158
views
How to Extract Code Blocks from Different Tabs in a Code Documentation Using Crawl4AI (or any other tool)?
I'm trying to scrape code blocks from multiple tabs in a documentation page using Crawl4AI. While I'm able to extract Markdown content, the code blocks inside tabbed sections are not being captured.
...
0
votes
0
answers
20
views
How to Determine Twitter API Token Lifetime and Rate Limits for Crawling with a Dummy Account?
How can I determine the token lifetime used for crawling Twitter with a dummy account, and what are the limitations per token?
What are the request rate limits per token for free and paid tiers?
How ...