How to speed up YaCy crawls?

Having much bigger Hardware now - and much more network power, I would like to make my boxes sweat, but:
The crawler does not reach significantly more than 1000 PPM, the felt average is 50.
Starting more crawls in parallel does not help. What does limit the performance? What are the tricks to sppeed up?
Should I use a DNS Jumper ? (currently I use 1.1.1.1 or 8.8.8.8) as my ISP (Swisscom) constantly suspects me of being a virus… :wink: …when havin a lot of dead inks in my starturls.

Best regards

Markus

More ram, more crawls, more bandwidth.

I don’t think the DNS have anything to do whit that, and I can advice you OpenNIC as DNS instead of Google and Cloudflare :slight_smile:

Also keep in mind you must not crawl too fast to not overload the websites, it would be like a ddos and everyone would start to block the yacybot crawler after that.

1 Like

YaCy limits bandwidth per domain by 120 pages per minute. We try to do a domain balancing, so crawling of several domains at the same time speeds up the crawling. But that means with 10 different domains in the crawl queues, you can only reach 1200 ppm.
To spee up, you just need more different domains.

1 Like

I have a list of 9000 universities worlwide. all different domains. It started with up to 3000 ppm, but broke down to 5-20 after 2 hrs. Crawls started afterwards look like they have to wait / are enqueued.
Lots of urls hit my blacklist, which has all the commercial crap on it like tumblr, facebook twitter and amazon. I got the impression, that many asian URLs answer fairly slow, so IMHO a simple increase of parallel threads should help, but where is the parameter?

CPU (12 Cores) is at 15%, Network at 5% and Disk at 3%, 30 GB Ram allowed, but only 20GB used.
3.200.000 at the Local Crawler queue.
DNS definitely is a suspect. Too many unknown domain lookups lead to lockouts after a few hours. OpenNIC is a good hint - thx. I will try. Have a look at DNSJumper.

I am thinking about a parallel setup: 10 YaCy instances on one machine, solrs connected. If it works: Increase the number of machines. Unfortunately the Index browser does not work anymore it there is several solrs connected. Even worse: It depends, which YaCy Instance you use for searching. Results are totally different. It looks like the search mostly always use the local solr for searching.

1 Like

Also maybe configure a good firewall and change IP?
A server scraping all the web get some attacks very fast or IP can be blacklisted.

Hello Zooom,

do you found meanwhile the reason for the performance limit?
Thank you in advance,

Urs

After a reboot, PPM starts high (~uo to 300-500) but falls down to ~10 after a few minutes.
I crawl a list of 1Mio. different domains. I guess, as many of them time out, I should have lot more crawler queues. How can I increase the number?

P.S.: I use OpenNIC now, but w/o a change in performance.

I guess the DNS is the problem. Having several million different domain names in the queue will surely be throttled by your DNS provider of your setup.

Has anyone thought about this issue?

I am afraid the only solution is to run at least one DNS (cache) server by yourself.

Any experience or recommendations here, e.g. what software / setup to use?

DNS limit is easy to fix. Just run unbound locally.

Thx a lot. I was thinking about running a local bind dns, but now I am reading the docs for unbound.
I guess I have to load the domain/IP Address data as “domain overrides” somehow first…

OK. Looks promising.

I run a local unbound now with a list of public DNS servers in round robin mode. 250K different domains crawl at 1500 PPM now. Important: Reboot YaCy first, otherwise you get out of memory or get stuck in any other way.

Very helpful for me was this unbound tutorial:

https://calomel.org/unbound_dns.html

unbound is really cpu consuming. make sure your box is powerful enough.