How to speed up YaCy crawls?

zooom · 1 April 2020 07:22

Having much bigger Hardware now - and much more network power, I would like to make my boxes sweat, but:
The crawler does not reach significantly more than 1000 PPM, the felt average is 50.
Starting more crawls in parallel does not help. What does limit the performance? What are the tricks to sppeed up?
Should I use a DNS Jumper ? (currently I use 1.1.1.1 or 8.8.8.8) as my ISP (Swisscom) constantly suspects me of being a virus… …when havin a lot of dead inks in my starturls.

Best regards

Markus

search · 1 April 2020 09:19

More ram, more crawls, more bandwidth.

I don’t think the DNS have anything to do whit that, and I can advice you OpenNIC as DNS instead of Google and Cloudflare

Also keep in mind you must not crawl too fast to not overload the websites, it would be like a ddos and everyone would start to block the yacybot crawler after that.

Orbiter · 1 April 2020 21:48

YaCy limits bandwidth per domain by 120 pages per minute. We try to do a domain balancing, so crawling of several domains at the same time speeds up the crawling. But that means with 10 different domains in the crawl queues, you can only reach 1200 ppm.
To spee up, you just need more different domains.

zooom · 2 April 2020 09:19

I have a list of 9000 universities worlwide. all different domains. It started with up to 3000 ppm, but broke down to 5-20 after 2 hrs. Crawls started afterwards look like they have to wait / are enqueued.
Lots of urls hit my blacklist, which has all the commercial crap on it like tumblr, facebook twitter and amazon. I got the impression, that many asian URLs answer fairly slow, so IMHO a simple increase of parallel threads should help, but where is the parameter?

CPU (12 Cores) is at 15%, Network at 5% and Disk at 3%, 30 GB Ram allowed, but only 20GB used.
3.200.000 at the Local Crawler queue.
DNS definitely is a suspect. Too many unknown domain lookups lead to lockouts after a few hours. OpenNIC is a good hint - thx. I will try. Have a look at DNSJumper.

I am thinking about a parallel setup: 10 YaCy instances on one machine, solrs connected. If it works: Increase the number of machines. Unfortunately the Index browser does not work anymore it there is several solrs connected. Even worse: It depends, which YaCy Instance you use for searching. Results are totally different. It looks like the search mostly always use the local solr for searching.

search · 2 April 2020 12:53

Also maybe configure a good firewall and change IP?
A server scraping all the web get some attacks very fast or IP can be blacklisted.

Urs-Bruelhart · 14 April 2020 20:21

Hello Zooom,

do you found meanwhile the reason for the performance limit?
Thank you in advance,

Urs

zooom · 18 August 2020 12:16

After a reboot, PPM starts high (~uo to 300-500) but falls down to ~10 after a few minutes.
I crawl a list of 1Mio. different domains. I guess, as many of them time out, I should have lot more crawler queues. How can I increase the number?

P.S.: I use OpenNIC now, but w/o a change in performance.

zooom · 20 August 2020 09:45

I guess the DNS is the problem. Having several million different domain names in the queue will surely be throttled by your DNS provider of your setup.

Has anyone thought about this issue?

I am afraid the only solution is to run at least one DNS (cache) server by yourself.

Any experience or recommendations here, e.g. what software / setup to use?

TheHolm · 24 August 2020 13:19

DNS limit is easy to fix. Just run unbound locally.

zooom · 26 August 2020 06:52

Thx a lot. I was thinking about running a local bind dns, but now I am reading the docs for unbound.
I guess I have to load the domain/IP Address data as “domain overrides” somehow first…

zooom · 27 August 2020 09:59

OK. Looks promising.

I run a local unbound now with a list of public DNS servers in round robin mode. 250K different domains crawl at 1500 PPM now. Important: Reboot YaCy first, otherwise you get out of memory or get stuck in any other way.

Very helpful for me was this unbound tutorial:

https://calomel.org/unbound_dns.html

unbound is really cpu consuming. make sure your box is powerful enough.

zooom · 29 November 2020 14:45

Hi all,

I would like to share my unbound.conf here as this speeds up things incredibly.
The trick is “round robin” (rrset-roundrobin: yes) and a huge list of DNS resolvers (resolver.inc):

server:
access-control: 10.0.0.0/8 allow
access-control: 127.0.0.0/8 allow
access-control: 192.168.0.0/16 allow
aggressive-nsec: yes
cache-max-ttl: 14400
cache-min-ttl: 1200
hide-identity: yes
hide-version: yes
interface: 0.0.0.0
prefetch: yes
rrset-roundrobin: yes
use-caps-for-id: yes
verbosity: 1

num-threads: 24
msg-cache-slabs: 16
rrset-cache-slabs: 16
infra-cache-slabs: 16
key-cache-slabs: 16
msg-cache-size: 2048M
rrset-cache-size: 4096M
outgoing-range: 16348
num-queries-per-thread: 4096

infra-cache-numhosts: 1000000

serve-expired: yes

forward-zone:
name: “.”

include: resolver.inc


command to generate the resolver.inc by downloading a great list from the web:

curl -s "https://public-dns.info/nameservers.txt" | perl -e 'while(<>){chomp;print "forward-addr: " . $_ . "\n";}' > resolver.inc
`

Orbiter · 1 December 2020 23:59

There is already a DNS cache inside of YaCy!
You can see it in http://localhost:8090/PerformanceMemory_p.html … search for “DNS” somewhere in the lower part of the page.

The main reason to have the cache was, that Java had (has?) a well-known DNS cache deadlock bug. So Crawling was deadlocked in very old YaCy versions and we made a workaround to the jvm DNS cache (which also exists!).

zooom · 18 December 2020 06:58

OK. DNS is not the bottleneck anymore but PPM is still below 30 when crawling 300K different domains.

Crawling the robots.txt at the very beginning of a crawl seems to use all ressources properly:

The network access grid shows all slots occupied with robots.txt (rights side)
CPU load is at full speed (unix “top” shows unbound and java heavily loaded)

But: When starting the “real” crawl of content, networks shows 2-3 slots occupied and CPU usage falls down to below 5%.

Why are the robots.txt crawled so fast, usin all queues, but not the rest?

What is the reason? Which is the parameter to fix?

Orbiter · 20 December 2020 19:40

deeply hidden in my comment YACY Web interface hangs after some time there is a hint:

with latest commits, the default speed is now 4 pages/s instead 2/s for a single host
you can now select googlebot as user agent; this will increase that to 40 pages/s

However host load balancing is still in place, including the flux rule.

zooom · 22 December 2020 15:52

Unbelievable. Using the G user agent for crawling leads to almost 200 PPM vs 60PPM before.
Du you have an explanation?

Orbiter · 22 December 2020 16:13

Thats too low, I tested here with up to 1500 PPM.
I also made a change in the host selection process which may or may not be helpful.

zooom · 22 December 2020 16:25

I have no idea, what I am doing wrong. I run 3 different independent data centers in 2 countries (DE and CH). 3 different ISPs.
It’s all the same with all sites. I tried everything. Physical machines, VMs. 100GB RAM. All the same.
Lists of single domains as starturls now run at 200 PPM (level 0). My Instance with some hundred different news starturls (level 3) crawl at ~1200 PPM. My unbound service is very busy during round robin with several 100 DNS resolvers) as expected. Mostly factor 10 times more than java.

I run ipfire as firewall on a dedicated machine. Do I need more sophisticated network equipment? Latency issue?

Tom_Booth · 24 December 2020 22:14

I would think at some point, a limiting factor would be the speed of the servers and networks the website being crawled are hosted on.

Crawling hundreds or thousands of pages a minute seems quite exceptional to me, considering that during ordinary web browsing, in my experience, it often takes quite some time, sometimes a full minute or more for ONE website to respond.

That could be due to heavy traffic, congestion on the network, distance to the server, speed of the server, etc. (Another bottleneck on the network could be government surveillance of the internet backbone. Presumably.)

Orbiter · 24 December 2020 23:27

ok I have 450mbit and used a wide crawl with many hosts…
The balancer scaled up to about 50 loading threads … while there are 200 are reserved and 150 had not been used because the balancer stopped from loading so far.