How to speed up YaCy crawls?

TheHolm · 24 August 2020 13:19

DNS limit is easy to fix. Just run unbound locally.

zooom · 26 August 2020 06:52

Thx a lot. I was thinking about running a local bind dns, but now I am reading the docs for unbound.
I guess I have to load the domain/IP Address data as “domain overrides” somehow first…

zooom · 27 August 2020 09:59

OK. Looks promising.

I run a local unbound now with a list of public DNS servers in round robin mode. 250K different domains crawl at 1500 PPM now. Important: Reboot YaCy first, otherwise you get out of memory or get stuck in any other way.

Very helpful for me was this unbound tutorial:

https://calomel.org/unbound_dns.html

unbound is really cpu consuming. make sure your box is powerful enough.

zooom · 29 November 2020 14:45

Hi all,

I would like to share my unbound.conf here as this speeds up things incredibly.
The trick is “round robin” (rrset-roundrobin: yes) and a huge list of DNS resolvers (resolver.inc):

server:
access-control: 10.0.0.0/8 allow
access-control: 127.0.0.0/8 allow
access-control: 192.168.0.0/16 allow
aggressive-nsec: yes
cache-max-ttl: 14400
cache-min-ttl: 1200
hide-identity: yes
hide-version: yes
interface: 0.0.0.0
prefetch: yes
rrset-roundrobin: yes
use-caps-for-id: yes
verbosity: 1

num-threads: 24
msg-cache-slabs: 16
rrset-cache-slabs: 16
infra-cache-slabs: 16
key-cache-slabs: 16
msg-cache-size: 2048M
rrset-cache-size: 4096M
outgoing-range: 16348
num-queries-per-thread: 4096

infra-cache-numhosts: 1000000

serve-expired: yes

forward-zone:
name: “.”

include: resolver.inc


command to generate the resolver.inc by downloading a great list from the web:

curl -s "https://public-dns.info/nameservers.txt" | perl -e 'while(<>){chomp;print "forward-addr: " . $_ . "\n";}' > resolver.inc
`

Orbiter · 1 December 2020 23:59

There is already a DNS cache inside of YaCy!
You can see it in http://localhost:8090/PerformanceMemory_p.html … search for “DNS” somewhere in the lower part of the page.

The main reason to have the cache was, that Java had (has?) a well-known DNS cache deadlock bug. So Crawling was deadlocked in very old YaCy versions and we made a workaround to the jvm DNS cache (which also exists!).

zooom · 18 December 2020 06:58

OK. DNS is not the bottleneck anymore but PPM is still below 30 when crawling 300K different domains.

Crawling the robots.txt at the very beginning of a crawl seems to use all ressources properly:

The network access grid shows all slots occupied with robots.txt (rights side)
CPU load is at full speed (unix “top” shows unbound and java heavily loaded)

But: When starting the “real” crawl of content, networks shows 2-3 slots occupied and CPU usage falls down to below 5%.

Why are the robots.txt crawled so fast, usin all queues, but not the rest?

What is the reason? Which is the parameter to fix?

Orbiter · 20 December 2020 19:40

deeply hidden in my comment YACY Web interface hangs after some time there is a hint:

with latest commits, the default speed is now 4 pages/s instead 2/s for a single host
you can now select googlebot as user agent; this will increase that to 40 pages/s

However host load balancing is still in place, including the flux rule.

zooom · 22 December 2020 15:52

Unbelievable. Using the G user agent for crawling leads to almost 200 PPM vs 60PPM before.
Du you have an explanation?

Orbiter · 22 December 2020 16:13

Thats too low, I tested here with up to 1500 PPM.
I also made a change in the host selection process which may or may not be helpful.

zooom · 22 December 2020 16:25

I have no idea, what I am doing wrong. I run 3 different independent data centers in 2 countries (DE and CH). 3 different ISPs.
It’s all the same with all sites. I tried everything. Physical machines, VMs. 100GB RAM. All the same.
Lists of single domains as starturls now run at 200 PPM (level 0). My Instance with some hundred different news starturls (level 3) crawl at ~1200 PPM. My unbound service is very busy during round robin with several 100 DNS resolvers) as expected. Mostly factor 10 times more than java.

I run ipfire as firewall on a dedicated machine. Do I need more sophisticated network equipment? Latency issue?

Tom_Booth · 24 December 2020 22:14

I would think at some point, a limiting factor would be the speed of the servers and networks the website being crawled are hosted on.

Crawling hundreds or thousands of pages a minute seems quite exceptional to me, considering that during ordinary web browsing, in my experience, it often takes quite some time, sometimes a full minute or more for ONE website to respond.

That could be due to heavy traffic, congestion on the network, distance to the server, speed of the server, etc. (Another bottleneck on the network could be government surveillance of the internet backbone. Presumably.)

Orbiter · 24 December 2020 23:27

ok I have 450mbit and used a wide crawl with many hosts…
The balancer scaled up to about 50 loading threads … while there are 200 are reserved and 150 had not been used because the balancer stopped from loading so far.

zooom · 26 December 2020 16:45

Is this hardcoded or is there a parameter to tweak?
Where can I watch these numbers?
How can I double the number of crawling slots and the timeout?

I can see tons of timeout errors in the recjected URLS list, although they can be reached and I do not think that all of these have “I do not want to be crawled automatically” restrictions implemented.

(although this cloudflare garbage is getting more and more popular)

If this is the reason for the poor performance, shouldn’t show the “Network Access” a full workload at the right side (which it only does during GET robots.txt)?

Time	URL	Fail-Reason
2020/12/26 17:43:48	http://www.www.gasthaus-rosengarten.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.www.gasthaus-rosengarten.ch duration=144 for url http://www.www.gasthaus-rosengarten.ch/
2020/12/26 17:43:48	https://www.chin-min.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.chin-min.ch duration=155 for url https://www.chin-min.ch/
2020/12/26 17:43:48	https://www.kulturgut.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.kulturgut.ch/ to https://www.denkmal.ch/ placed on crawler queue for double-check
2020/12/26 17:43:47	https://www.schoenzeit.webstores.ch/robots.txt	TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:47	http://cajon.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://cajon.ch/ to https://cajon.ch/ placed on crawler queue for double-check
2020/12/26 17:43:47	https://www.dalucia.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.dalucia.ch duration=356 for url https://www.dalucia.ch/
2020/12/26 17:43:47	https://www.sineq.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.sineq.ch duration=468 for url https://www.sineq.ch/
2020/12/26 17:43:47	https://www.schoenshop.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.schoenshop.ch/ to https://www.schoenzeit.webstores.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	https://www.heime-consulting.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.heime-consulting.ch/ to https://www.heime-consulting.ch/home.html placed on crawler queue for double-check
2020/12/26 17:43:46	http://www.www.telefonmonteur.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.www.telefonmonteur.ch duration=135 for url http://www.www.telefonmonteur.ch/
2020/12/26 17:43:46	https://suli.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://suli.ch/ to https://www.suli.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	https://www.alpsu.ch/robots.txt	TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:46	https://citywettingen.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://citywettingen.ch/ to https://www.citywettingen.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	http://www.alpsu.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://www.alpsu.ch/ to https://www.alpsu.ch/ placed on crawler queue for double-check
2020/12/26 17:43:46	https://www.tdcag.ch/robots.txt	TEMPORARY_NETWORK_FAILURE no response body (http return code = 404)
2020/12/26 17:43:46	https://stattboden-riet.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: stattboden-riet.ch duration=73 for url https://stattboden-riet.ch/
2020/12/26 17:43:46	https://www.lenzerheide2020.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.lenzerheide2020.ch/ to https://www.biathlon-lenzerheide.swiss/ placed on crawler queue for double-check
2020/12/26 17:43:45	https://www.reklamegrafik.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=https://www.reklamegrafik.ch/ to http://www.artatelier.ch/ placed on crawler queue for double-check
2020/12/26 17:43:45	https://www.etude-avocat-belhocine.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - Client can’t execute: www.etude-avocat-belhocine.ch duration=464 for url https://www.etude-avocat-belhocine.ch/
2020/12/26 17:43:45	http://tdcag.ch/	TEMPORARY_NETWORK_FAILURE cannot load: load error - CRAWLER Redirect of URL=http://tdcag.ch/ to https://www.tdcag.ch/ placed on crawler queue for double-check
2020/12/26 17:43:44	https://suxessm

zooom · 26 December 2020 16:54

If this was the case, doubling the number of queues should double the PPM

casey · 28 February 2021 23:39

Hi.

I had the high speed which some of you refering here only on autocrawl. I got like 4000 PPM and easily fill up 500 thread Loader. But when I run manual advanced crawl, its extremly slow and does not matter how many different urls are processed.

I downloaded some datasets from domainsproject.org, then join and shuffle the final file. This provided realy wide set of URL addresses. Then I loaded 1M into yacy. I tried one big file and serveral small files, same result. PPM is 0-10 at best. I know yacy is working with this domain set, because my internal DNS resolver got hammered realy bad when i loaded the file, but after that it stops. When i terminate all the crawls, yacy catches back and PPM rises again.

zooom · 22 September 2021 02:19

Hi casey,
every Internet provider gives you full speed in the first minutes and throttles down your speed to a minimum after a while. This is to show best results while you run speed tests.
Second thing is DNS. Check for “unbound” here.
Cheers

M

casey · 22 September 2021 11:44

Well was not my issue. I have symetric gigabit speed on academic network and there is no aggregation or FUP. Second i had dedicated dns server only for yacy search. It helped a lot when i loaded a big dns dataset, but not much for crawl results. But i stopped playing with yacy a while ago, so i dont need a fix for that anoymore. But thanks.

Orbiter · 2 October 2021 22:50

I worked again on the crawing code and made two enhancements, one to reduce blocking and another one to speed up the start of crawls with very large url start lists

github.com/yacy/yacy_search_server

enhanced crawler

committed 01:23PM - 17 Aug 21 UTC

Orbiter

+844 -845

a main problem when crawling is long waiting time cuased by crawl-delay values f…rom robots.txt entries. that attribute is not supported by google and interpreted by yandex and bing in different ways. In large crawls there is always one host which blocks the whole crawl with extreme large values. YaCy now still obeys crawl-delay but limits them to 10 seconds. Additionally the blocking logic when loading new robots.txt was analyzed and a deadlock was removed. Furthermore the construction of new queue lists was redesigned and it was ensured that always a large list of different hosts for host-balancing is provided for the loader.

Hopefully this helps a bit.

lifeofguenter · 27 December 2021 18:25

I am having the same issue (with latest master build).

Specs:

CPU: Intel i7-6700 (dedicated)
Mem: 32G (24g allocated for yacy)
Disks: SSD (Raid 1)
Network: 1Gbit (no throttling / Hetzner)

Usage

I monitor resource usage via munin - but none get even remotely close to their max.

Observation

Even with many many (1000+) different domains in the queue, I hardly go beyond 2-6 threads for crawling. Either something down the line is blocking (maybe the indexing?) or maybe there are too many URLs in the queue that is making it slow to compute the next URL to crawl? Currently my queue is at 13M links.

Nanook · 26 May 2022 20:02

I managed to get up to 1438 ppm crawl today, the secret being to crawl a LOT of sites simultaneously, in my case 25 sites.