YACY Web interface hangs after some time

zooom · 29 November 2020 14:54

I run one instance in “Robinson Mode” to crawl a defined list of domains (~1Mio) and the GUI comes up but hangs after a few minutes of crawling. JAVA is ~700% of CPU. Maybe this is the reason for hanging? When I start the crawl job (1 Level deep, max. 100 links per domain) It at first reads the robots.txt. No clue why, because I unchecked to obey the rules. Then, the domains itself are crawled. When I then reboot the machine after a while, after the GUI fucked up, it continues to crawl the links at level 1. Then it runs like hell and the GUI hangs again after few minutes. Not funny.

Orbiter · 1 December 2020 13:21

sorry for all the inconvenience, I am working on this but it looks like a hard problem because there are no hints that are sufficient to identify the cause. I’m on it.

Orbiter · 2 December 2020 00:11

So what I have done so far is:

set up several old versions, running them to see if the problem occurs only in recent versions
add some configurations in the startup of the http server process to provide more servlet threads
added a forced garbage collection every 10 minutes; maybe this helps somehow to remove load from the host
added an automated storage of thread dumps to DATA/LOG/threaddump.txt to be able to get such a dump even if the web interface is not available and kill -3 commands are not working (like inside of docker containers)

Unfortunately the bug did not show up in any of the freshly set up test peers, so I either have to wait until that happens or find the cause through debugging (which looks hard right now as nothing can be seen)

Orbiter · 2 December 2020 23:27

updated jetty from 9.4.17 to 9.4.35 and fixed a (bad!) bug in SSI handling (how was it possible that this ever worked?)

However thats just another shot

zooom · 3 December 2020 08:11

OK. Solved for the moment. I guess YaCy runs best when alone on a machine. So I decided to run every YaCy instance in a dedicated VM (FreeBSD) using VMWare Player on Windows 2008 Servers. (Verrry old school). But runs like hell…and…STABLE!! Crazy setup but after 3 yrs of trial-and-error w/ YaCy the 1st solution which did not f*up for days

BTW: Every VM has its own unbound service.

This way my 2008 host (4 YaCy VM) crawls ~1000 PPM each, which is fine for now.

zooom · 3 December 2020 08:19

Orbiter,

there was no inconvenience at all.

Cheers
Markus

Orbiter · 3 December 2020 08:27

very good!
Yes the demo peer at https://yacy.searchlab.eu/ is also operating in an unusual very responsive way after the update while it experienced the hangs as observed as well.

So I hope that was it. Looks like I should make a release, but I want to fix something else before doing that. Lets observe the behavior for some days until considering this as closed.

zooom · 3 December 2020 09:13

Orbiter,
pls let me know as soon as I can test the new release.

Cheers
Markus

zooom · 12 December 2020 11:45

I just got the first hanger of YaCy GUI on a VM. Reason: JAVA Exception when deleting some domains.
I pressed the “engage deletion” button a 2nd time after having deleted ~4k domains.

Is there a chance to start the Jetty stuff separately? Does it make sense to wait until the crawl is done or is it better to kill -9 Java and start over?

zooom · 12 December 2020 12:38

just compiling and installing the latest source from git.

Now the crawler throws tons of these:

   at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
    at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:179)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:160)
    at org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:895)
    at org.apache.solr.client.solrj.SolrClient.deleteByQuery(SolrClient.java:913)
    at net.yacy.cora.federate.solr.connector.SolrServerConnector.deleteByQuery(SolrServerConnector.java:183)
    at net.yacy.cora.federate.solr.connector.MirrorSolrConnector.deleteByQuery(MirrorSolrConnector.java:183)
    at net.yacy.search.index.Fulltext.deleteDomainWithConstraint(Fulltext.java:475)
    at net.yacy.search.index.Fulltext.deleteDomainErrors(Fulltext.java:456)
    at net.yacy.search.index.ErrorCache.removeHosts(ErrorCache.java:72)
    at net.yacy.crawler.CrawlStacker.enqueueEntries(CrawlStacker.java:210)
    ... 7 more

Orbiter · 15 December 2020 00:28

Hi @zooom I tried to do a deletion and also did a crawl to reconstruct the problem without success. Both did not cause an error here.
This looks a bit like the crawler is still trying to do its work while the YaCy instance is currently shutting down. Sounds strange but I have no explanation yet.

zooom · 15 December 2020 13:35

Hm. I can imagine, that it’s a challenge to reproduce such kind of problems.

I feel, that big crawls / domain lists > 10K cause that problem. Makes it a bit complicated if you want to crawl > 500K domains

At the moment, IMHO there is no stable version of YaCy around.

Could you describe in detail, what your latest stable environment is? OS Version? JAVA Version? YaCy Version. Amount of RAM? Disk type?

E.G. I have 96GB RAM (30GB for YaCy), attached local SAS disk array, 24 Processors.
I tried: MS Win2008 Server, Freebsd (latest) debian (latest) everything on VMs or physical.
I compiled the latest YaCy source (freebsd), but this throws exceptions (see my post above).

For non-Java guys like me: How do I deploy a newly compiled version over an existing one?

My guess: tar xvf the fresh RELEASE/yacy…tar.gz into a yacy/ dir and then copy all the files over the existing installation?
Thanks in advance

vamikri · 15 December 2020 15:09

Hello all
I was not active for a while. I found a workaround for this problem and was able to keep the UI interface responsive. What I did was I’m refreshing the “status.html” page every 20 sec. (just a javascript reload) and it’s working pretty OK. But, this is not possible in Linux systems, where I’m planning to use yacy. I learned that the new version is available now and will try it and get back to you.
Thanks
Vami

zooom · 16 December 2020 12:49

cool idea. On linux you can easily schedule a wget …/status.html > /dev/null every minute. I will test it.

zooom · 17 December 2020 22:13

@Orbiter

Thanks for making the latest yacy_v1.924_20201214_10042.tar.gz available.

The GUI still hangs now temporarily, when the crawler is under heavy load during starting up a new crawl in the phase loading a huge list of e.g. 100K domains into the queue when crawling the robots.txt.

BTW: I always give a shit to the robots.txt but YaCy insists in loading them. Funnywise I guess this is the problem here.

My unbound process then is at 1450% (!) while java consumes 102%.

The Browser’s error message now is a different one:

“Die Verbindung zum Server wurde zurückgesetzt, während die Seite geladen wurde.”

But: After having loaded all starturls (300K domains) and some hours later the Gui answers again without a restart!!!

zooom · 20 December 2020 06:40

Runs without any hangers or crash very stable now!

Orbiter · 20 December 2020 19:03

This is connected to the fact that every crawler should identify itself with a well-known user agent. Because the user agent is possibly addressed in the robots.txt, it is visible to the server if the crawler behaves correctly. This requires that the user agent must be loaded.

However.

Since quite some time it is possible to activate a “googlebot” user agent when the peer is running in portal mode. This was necessary because many web pages now categorically deny all crawlers but the googlebot, which makes YaCy effectively unusable for any kind of user who wants to replace google or the google search appliance (as it still existed). But it is not acceptable that the marked of search engines is restricted in such a way because site administrators are not aware that they destroy a free search engine markets. For this reason (and only this reason) I made the “googlebot” user agent available to YaCy. YaCy still behaves correctly according to robots.txt, but it now considers the rules for googlebot. YaCy also still behaves nicely according to the “don’t load too fast if the server responds too slow” policy (called “flux” in YaCy). But the googlebot setting breaks up the “only 2 pages per second” rule (which is now a 4 pages per second rule) which is simply a very conservative way YaCy acts, google does not have that.

The selection option for the user agent is now also available in p2p mode.

zooom · 21 December 2020 11:16

I agree that every crawler should identify itself, but: This is theory. In practice, the situation is upside down.
The behavior, that pages show content only to google, leads to the fact you describe further down.

Even worse…the trend is to use Javascript to hide content. This forces crawlers to simulate users by using real browser engines. Funnywise Gugel offers stuff like “Headless Chrome” for that. Crap.

I prefer Autohotkey (https://www.autohotkey.com/) scripts, simulating users sitting at browsers of your choice and a grid of cheap Laptops running VMs in VMware Player’S (8 per Laptop). A cheap VPN provider like NordVPN is verrry helpful. Change your IP every other 5 minutes ;-). If you need scripts - conatct me.

Nevertheless, the task of crawling > 300 Mio homepages is almost impossible with YaCy currently and its host balancing brakes. Every domain in this task is crawled only once! There is no need to wait between the domains.

The G*Bot trick may convince the sites to deliver more content, but I do not think that servers deliver stuff faster, when we pretend to be Gugel.

I will try the trick anyhow.

Cheers

P.S.: Why don’t we use 500 instead of 50 slots to crawl and instead allow a timeout of e.g. 60s for a page to load? BTW: Google prefers fast loading sites - as they seem to be more commercial (less private junk on cheap infrastructure) and ranks them higher. Good idea - to be honest.

vamikri · 22 October 2021 08:44

Hi,
Can you lease share the automation scripts you have mentioned. I would like to try it.

thanks

Sviatoslav · 26 December 2022 08:44

This is all reality. But I could never understand what makes the owners of sites hide content?
For what purpose is the site created then? Isn’t it to show content?
This is how it is written in the famous book by Lewis Carroll:

But I was thinking of a plan
To dye one’s whiskers green,
And always use so large a fan
That they could not be seen.