How to speed up YaCy crawls?

Having much bigger Hardware now - and much more network power, I would like to make my boxes sweat, but:
The crawler does not reach significantly more than 1000 PPM, the felt average is 50.
Starting more crawls in parallel does not help. What does limit the performance? What are the tricks to sppeed up?
Should I use a DNS Jumper ? (currently I use 1.1.1.1 or 8.8.8.8) as my ISP (Swisscom) constantly suspects me of being a virus… :wink: …when havin a lot of dead inks in my starturls.

Best regards

Markus

More ram, more crawls, more bandwidth.

I don’t think the DNS have anything to do whit that, and I can advice you OpenNIC as DNS instead of Google and Cloudflare :slight_smile:

Also keep in mind you must not crawl too fast to not overload the websites, it would be like a ddos and everyone would start to block the yacybot crawler after that.

YaCy limits bandwidth per domain by 120 pages per minute. We try to do a domain balancing, so crawling of several domains at the same time speeds up the crawling. But that means with 10 different domains in the crawl queues, you can only reach 1200 ppm.
To spee up, you just need more different domains.

I have a list of 9000 universities worlwide. all different domains. It started with up to 3000 ppm, but broke down to 5-20 after 2 hrs. Crawls started afterwards look like they have to wait / are enqueued.
Lots of urls hit my blacklist, which has all the commercial crap on it like tumblr, facebook twitter and amazon. I got the impression, that many asian URLs answer fairly slow, so IMHO a simple increase of parallel threads should help, but where is the parameter?

CPU (12 Cores) is at 15%, Network at 5% and Disk at 3%, 30 GB Ram allowed, but only 20GB used.
3.200.000 at the Local Crawler queue.
DNS definitely is a suspect. Too many unknown domain lookups lead to lockouts after a few hours. OpenNIC is a good hint - thx. I will try. Have a look at DNSJumper.

I am thinking about a parallel setup: 10 YaCy instances on one machine, solrs connected. If it works: Increase the number of machines. Unfortunately the Index browser does not work anymore it there is several solrs connected. Even worse: It depends, which YaCy Instance you use for searching. Results are totally different. It looks like the search mostly always use the local solr for searching.

Also maybe configure a good firewall and change IP?
A server scraping all the web get some attacks very fast or IP can be blacklisted.

Hello Zooom,

do you found meanwhile the reason for the performance limit?
Thank you in advance,

Urs