Newest 'web-crawler' Questions

0 votes

0 answers

77 views

URL Targeted web crawler [closed]

I have a bit of code I am trying to build to take a specific tumblr page and then iteratively scan by post # sequentially and check to see if a page exists. If it does it will print that full URL to ...

Kyle Campbell

53

asked Nov 23 at 6:25

0 votes

0 answers

47 views

How to Get Reddit Crawler to Use my Video Preview?

I have a React page which gets re-routed for crawlers to a SEO backend page in nodejs + express. And I want to make it work with reddits crawler to get embedded videos, which it doesn't. When I post ...

Andi Giga

4,252

asked Nov 18 at 20:00

0 votes

1 answer

216 views

How can I use Firecrawl to crawl and take a screenshot of a webpage instead of using Playwright in Node.js?

I'm currently using Playwright in Node.js to capture screenshots of webpages, but I'm exploring Firecrawl and wondering if it can handle screenshots directly. Here is what my firecrawl looks like with ...

James

11

asked Oct 23 at 6:55

-2 votes

1 answer

117 views

Webscrape links to download files based on word in page HTML

I am webscraping WHO pages using the following code: pacman::p_load(rvest, httr, stringr, purrr) download_first_pdf_from_handle <- function(handle_id) { ...

flâneur

321

asked Oct 5 at 4:24

1 vote

1 answer

267 views

Firecrawl self-hosted crawler throws Connection violated security rules error

I set up a self-hosted Firecrawl instance and I want to crawl my internal intranet site (e.g. https://intranet.xxx.gov.tr/). I can access the site directly both from the host machine and from inside ...

birdalugur

307

asked Aug 22 at 13:47

0 votes

1 answer

65 views

Intermittent 406 Errors on Post, Pages Detected by Site Analyzers, Not Direct Browser Access

"My WordPress site's post pages return intermittent HTTP 406 "Not Acceptable" errors, but ONLY for site analysis/SEO tools (e.g., SEMrush, Ahrefs, GTmetrix). When accessed directly by ...

Zaheer Ahmad Safeer

41

asked Jul 1 at 11:31

0 votes

2 answers

75 views

SitemapLoader(sitemap_url).load() hangs

from langchain_community.document_loaders import SitemapLoader def crawl(self): print("Starting crawler...") sitemap_url = "https://gringo.co.il/sitemap.xml" ...

Gulzar

28.8k

asked Apr 18 at 19:38

0 votes

0 answers

183 views

Crawl4AI token threshold not applied to raw html in arun

Here’s a brief overview of what I want to achieve Extract raw htmls and save them Use Crawl4AI to produce a ‘cleaner’ and smaller HTML that has a lot of information, including what I will eventually ...

Leksa99

115

asked Apr 13 at 13:10

0 votes

0 answers

87 views

Facebook Crawler Floods CPU usage to 100%

Im running facebook ads and today i woke up to see my server cpu at 100%. I couldnt even use my website. I did some research and found out it was a Facebook Crawler sending excessive requests. I tried ...

Shami Asad

1

asked Mar 31 at 7:07

0 votes

0 answers

84 views

Adding user agent in chrome options in selenium

i'm performing data crawling on a webpage using selenium. this is my code: from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options ...

midmash36

1

asked Mar 24 at 7:01

0 votes

1 answer

117 views

Substitute host name with its IP address in HTTPS requests

I'm working on a web crawler and I'm trying to understand how the IP substitution works. From what I have read, DNS hostname should be resolved to its IP address (one of many) and used instead of the ...

DK3Z

123

asked Mar 18 at 13:08

2 votes

1 answer

929 views

How to use LLMConfig in crawl4ai?

So I was testing https://leonardo467.gumroad.com/l/cstsu this code that uses crawl4ai, but it seems that the library has been updated or something because if you run it (with an API, so I use free ...

ray

21

asked Mar 16 at 3:56

0 votes

0 answers

34 views

Transfermarkt Scraper can not get Club name

I want to use the data in my codes with Transfermark Scraper for my own special purpose. I get all the desired data in the codes except Current Club, but I can't get the Club name. I tried all the ...

Perseus

29

asked Mar 13 at 20:40

0 votes

0 answers

158 views

How to Extract Code Blocks from Different Tabs in a Code Documentation Using Crawl4AI (or any other tool)?

I'm trying to scrape code blocks from multiple tabs in a documentation page using Crawl4AI. While I'm able to extract Markdown content, the code blocks inside tabbed sections are not being captured. ...

harsha bajaj

23

asked Feb 28 at 13:57

0 votes

0 answers

20 views

How to Determine Twitter API Token Lifetime and Rate Limits for Crawling with a Dummy Account?

How can I determine the token lifetime used for crawling Twitter with a dummy account, and what are the limitations per token? What are the request rate limits per token for free and paid tiers? How ...

Deryalfi

1

asked Feb 28 at 7:31

Collectives™ on Stack Overflow

URL Targeted web crawler [closed]

How to Get Reddit Crawler to Use my Video Preview?

How can I use Firecrawl to crawl and take a screenshot of a webpage instead of using Playwright in Node.js?

Webscrape links to download files based on word in page HTML

Firecrawl self-hosted crawler throws Connection violated security rules error

Intermittent 406 Errors on Post, Pages Detected by Site Analyzers, Not Direct Browser Access

SitemapLoader(sitemap_url).load() hangs

Crawl4AI token threshold not applied to raw html in arun

Facebook Crawler Floods CPU usage to 100%

Adding user agent in chrome options in selenium

Substitute host name with its IP address in HTTPS requests

How to use LLMConfig in crawl4ai?

Transfermarkt Scraper can not get Club name

How to Extract Code Blocks from Different Tabs in a Code Documentation Using Crawl4AI (or any other tool)?

How to Determine Twitter API Token Lifetime and Rate Limits for Crawling with a Dummy Account?

Hot Network Questions