Crawlers slowly killing management UI accessibility

kgardas · 4 August 2021 19:44

Hi,
I’m attempting to experiment with YaCy. Installed it on Debian 11 VM with 16GB RAM dedicated to VM and then started some crawlers. They have run fine but I needed to increase YaCy RAM setting to 4GB RAM. Then I started another cralwers set and ended with completely unresponsive UI – e.g. can’t open search at all nor any other management page. I’ve even not being able to ./stopYACY.sh at all. Hence hard kill. Then I reconfigure using the provided script and give yacy even more RAM. IIRC 8GB, restarted and after some time again, management UI is not responsive at all. Kill again, increase RAM and restart.

To me it looks like crawlers are overloading JDK completely and its not able to do anything in amount of RAM I’ve given it. Would be good if there would be some RAM limit for crawlers or if crawlers are independnet processes at all to not eat and kill main JDK process.

Yes, I know, I’m a pig assigning ~10 crawlers and set theirs depth to 7. I just would like to see how many links/docs I’m able to parse/index to the dedicated 1TB of SSD space…

Thanks!
Karel

kgardas · 4 August 2021 20:27

Looks like this is more or less long-standing issue. Here it is also mentioned: YACY Web interface hangs after some time – the problem with it is that man never knows when the RAM is enough to have UI working well and not being killed by crawlers – especially when crawling process may be started remotely or automatically on renew scan.

Orbiter · 5 August 2021 09:44

YaCy suffers from OOM and GC hell since its existence. The only help is to further increase RAM.
Maybe my latest commit addressing better GC behavior after removing Xms with earlier heap in… · yacy/yacy_search_server@294d56d · GitHub prevents GC hell (which may the cause of the UI locking) a bit.
This works only if assigned memory is high enough.

kgardas · 5 August 2021 11:03

Hi Orbiter,

hmm. Increasing RAM on system which is P2P and you like to motivate people to use that is not that lucky IMHO. For example now I do have 16GB set for JVM which itself consumes 45GB vsize and 10GB rsize RAM. Don’t think this is acceptable for common user machine which is a target for yacy usage, is it?

So I think original YaCy architecture design is not most lucky and it would be good to divide it a bit into separate running processes:

UI/admin/search process
all crawlers process

This way you are able to set RAM limits independently and crawlers suffering from gc/oom hell will not kill searching UI.

What do you think about it?

Thanks,
Karel

ian · 6 August 2021 10:52

Things that might help on ui responsiveness according to my limited experience:

Disable embedded solr and use a separate
Use index in different drive

By the way what virtualization are you using?

kgardas · 6 August 2021 11:14

It’s possible to use external solr? Will need to investigate that since indeed that would mean that part of a load maybe off-loaded to different process.
whole yacy works on separate drive (of VM). Also the VM is no IO starved at all. Whole VM is backed by single SATA SSD (Crucial MX500 2TB). But again no IO load to discuss about…

VBox, I know not the best performant solution but working well and used for all company works.

Anyway, VM does have 64GB RAM allocated for it now, JVM does have 49GB set, currently running on 83GB vsize, 24GB rsize and it looks like this is enough for the crawlers load I put there. UI/management is more or less responsive although slowly.

BTW, is it possible to use external crawlers? It would be interesting what is the interface between crawlers and indexing engine. So far I’ve just seen crawlers queue full of *.stack files but due to limited time have not had a chance to fully understand the format of the file from investigating the source code.

Thanks! Karel

ian · 6 August 2021 17:03

Hi Karel,

a) Yes you can use one (or more) external solr cores from here /IndexFederated_p.html
You can see documentation here Dev:Solr – YaCyWiki

I use yacy for linguistic research and I still do experiment in various setups, however I am very new user, so others may have better ideas. However I saw great improvement by disabling embedded solr and setting up a stand alone in the same machine. Setting a solr server proven to be very easy (under 10 minutes of reading and setting up), but since I do not have experience it might need some fine tuning later for your needs.

Please check the solr version that your version of yacy use and use the same.

b) Yacy is a crawler and the solr is the core search engine. Both can have very heavy IO operations. The embedded solr index is located under yacy/DATA/INDEX/[network]/SEGMENTS/solr_version, ie yacy/DATA/INDEX/webportal/SEGMENTS/solr_8_8_1/

By having the index in a separated drive, you can fully parallelize IO operations of index and crawler since they are in a different drive. You can do this by shutting down yacy, move solr_x_x to a different drive and make a symbolic link there ln -s /new/full/path/to/solr_x_x in order yacy to find it.

Bear in mind that a drive can have large bandwidth, but it can also have high I/O latency on heavy loads that can greatly affect performance on heavy operations such as large indexes and crawling. Even on SSD drives, latency can be a bottleneck in some situations.

Virtualbox is a good virtualizer for desktop, however I would suggest qemu-KVM for that kind of usage with virtio drivers in order to avoid SATA emulation. (I am not sure if virtio is supported by VirtualBox yet but it worth a try to give a look)

Use tools like iotop, htop, sysstat and the like to check for your bottlenecks and watch for your IO wait

Regarding memory, if it’s your bottleneck, Yacy in general is a well written architecture as far as I have seen so far, but crawling and indexing are both heavy operations and the internet is vast. I come from a C++ world but I write C# for a living, and I dont want to start a religious war here, but GC can greatly affect performance in heavy usage scenarios like gaming or crawling millions of sites. I see that YaCy, although is well written, it is affected by this and I dont think that it’s YaCy or solr fault. You might need to fine tune java GC parameters according to your needs. I havent test latest Orbiter fix for startup parameters yet

kgardas · 6 August 2021 19:27

I’m testing with JDK 11 now. The reason for this it does have more advanced GCs then JDK 8. Anyway JDK 11 is much more strict in honoring mx/ms setup hence I needed to low from 59g to 39g which is kind of managable in 64g VM.

Now, when yacy is started it process whole crawlers queue – probably load it into the RAM and on JDK 11 I see following exception:

I 2021/08/06 21:15:54 HostQueue opened HostQueue /yacy/data/stable/yacy/DATA/INDEX/freeworld/QUEUES/CrawlerCoreStacks/www.sprachenzentrum.uni-trier.de-#8wNMhB.80 with 1 urls.
W 2021/08/06 21:15:54 ConcurrentLog java.lang.NullPointerException
java.lang.NullPointerException
	at java.base/java.util.TreeMap.rotateLeft(TreeMap.java:2221)
	at java.base/java.util.TreeMap.fixAfterInsertion(TreeMap.java:2268)
	at java.base/java.util.TreeMap.put(TreeMap.java:580)
	at net.yacy.kelondro.table.Table.<init>(Table.java:273)
	at net.yacy.kelondro.index.OnDemandOpenFileIndex.getIndex(OnDemandOpenFileIndex.java:61)
	at net.yacy.kelondro.index.OnDemandOpenFileIndex.has(OnDemandOpenFileIndex.java:190)
	at net.yacy.kelondro.index.BufferedObjectIndex.has(BufferedObjectIndex.java:179)
	at net.yacy.crawler.HostQueue.has(HostQueue.java:397)
	at net.yacy.crawler.HostBalancer.has(HostBalancer.java:244)
	at net.yacy.crawler.HostBalancer.push(HostBalancer.java:284)
	at net.yacy.crawler.data.NoticedURL.push(NoticedURL.java:193)
	at net.yacy.crawler.CrawlStacker.stackCrawl(CrawlStacker.java:401)
	at net.yacy.crawler.CrawlStacker.process(CrawlStacker.java:140)
	at net.yacy.crawler.CrawlStacker.process(CrawlStacker.java:65)
	at net.yacy.kelondro.workflow.InstantBlockingThread.job(InstantBlockingThread.java:72)
	at net.yacy.kelondro.workflow.AbstractBlockingThread.run(AbstractBlockingThread.java:82)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
I 2021/08/06 21:15:54 HostQueue opened HostQueue /yacy/data/stable/yacy/DATA/INDEX/freeworld/QUEUES/CrawlerCoreStacks/motel-6-savannah-richmond-hill.booked.net-#3ll-5w.443 with 1 urls.
I 2021/08/06 21:15:54 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://codegolf.meta.stackexchange.com/robots.txt
I 2021/08/06 21:15:54 LOADER CRAWLER ..Redirecting request to: https://codegolf.meta.stackexchange.com/robots.txt
I 2021/08/06 21:15:54 LOADER CRAWLER Redirection detected ('HTTP/1.1 301 Moved Permanently') for URL http://codegolf.stackexchange.com/robots.txt
I 2021/08/06 21:15:54 LOADER CRAWLER ..Redirecting request to: https://codegolf.stackexchange.com/robots.txt
I 2021/08/06 21:15:55 org.apache.solr.update.DirectUpdateHandler2 start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
I 2021/08/06 21:15:55 REJECTED https://ghc.haskell.org/robots.txt - no response body (http return code = 404)
I 2021/08/06 21:15:55 org.apache.solr.update.DirectUpdateHandler2 end_commit_flush
I 2021/08/06 21:15:55 org.apache.solr.core.QuerySenderListener QuerySenderListener sending requests to Searcher@1fee20a2[collection1] main

and it looks like my UI is dead again.

kgardas · 6 August 2021 19:42

Hi Ian,

thanks for the hint with Solr, will probably test it although the current issues seems to be related to the crawlers host queeu loaded into the RAM as a whole. Imagine I have around >1.5mil items in crawlers queue – and that is while UI was still living so it may be even more.

W.r.t. architecture, if it is good or bad. This decision should be driven by target audience usage and expectations. I think we’re still living in a world where majority of computers have just handful of GB of RAM and from those only small part of RAM may be donated to keep yacy running. So let say can we keep yacy running in 1-2GB of RAM? I think answer is no, since user is very easily able to completely kill that by enthusiastically crawling – which is what we all need – users to really use the system and crawl as much as possible every hidden corner of the internet. So enough of philosophy, from this point of view yacy architecture is not good.

W.r.t. IO load on my VM. You are shooting on completely wrong target here. IO load is not the issue at all here. Bug(s) in yacy are. BTW: VM’s is backed by ZFS and I monitor it’s usage (zpool iostat 1) and there is nothing significant happening there. I mean read/write of few MB/s is nothing which would put mx500/sata in panic. Yes, number of IOs is low too. Yes, KVM is better, but in the past I tested and both solutions KVM and Vbox can’t be run together and since absolute majority of my VMs are vbox I’m kind of drag by the history decisions. Besides this, IO difference or memory mapping speed is just few % between boths – on this lightly usage scenario. BTW: in the worst case, linux/zfs buffers as much as possible. VM does have 64GB, host 128GB, still around 50GB available to linux/zfs buffers. If I du while yacy data on VM drives its just around 50GB so in the worst case and best optimization case whole yacy can basically run in RAM…

W.r.t. yacy bugs: there is probably bug somewhere in UI code which does not check/catch OOM exception and when this happen UI stop working. This way overloaded crawlers queue is very easily killing UI. It may be and probably is completely unrelated to the exception I got from jdk 11 and reported above.

Thanks,
Karel

kgardas · 6 August 2021 20:01

Just for the record NPE reported above seems to also kill any other further crawling processes since since that time my yacy logs only remote peers communications (more or less) and nothing more.

ian · 6 August 2021 20:33

Hi Karel

My setup is on a Debian 10 using OpenJDK Runtime Environment (build 11.0.12+7-post-Debian-2deb10u1) on a bare metal machine with AMD Ryzen 5 3600 and 64GB ram and the index is on a RAID0 NVMe array. When I used embedded solr, I ran yacy with -Xmx 42000 -Xms 42000 , so 42GB of RAM was dedicated to yacy and I had very few GC events.

I used yacy v1.924/10069 until very recently and now I am migrating my index to 1.925/10120 with solr 8 because of this fix enabled crawl starts with very large sets of start urls · yacy/yacy_search_server@e81b770 · GitHub which caused me delays when adding large bulks in the queue, and another performance fix that affected me. I dont know if this is related with your issues as well.
Can you give latest yacy version a try? My first impression after some tests is that it’s a faster experience in general but I am not fully using it yet as I am in the process of migration.

My index is ~450GB and has around 20million pages for the time being and my queue had around 8-9 million, but I think only around 1-2GB of ram was reported to be dedicated to queue usage. I cannot verify this right now, but I didnt have any delay issues.

The null reference exception you are getting might be a bug or data corruption in queue, but I can verify that I didnt have something similar with OpenJDK 11

I some had similar hangs as you are describing when I ran on an enterprise HDD which was solved after moving to SSD and nvme. This is why I suggested about checking your I/O latency.

As I am very new user of YaCy and I use it for a domain specific purpose, I cannot comment yet about memory usage and architecture for tight memory scenarios. However since I am an optimization psycho, I will have a look at the crawling queue code and I will do some tests on low memory scenarios when I have some time.

kgardas · 6 August 2021 21:12

Hi Ian,

you have nice setup indeed. Your numbers made me think a bit and I just stopped yacy, increase VM RAM size and started again – get to the point of ~2.3mil local crawler queue when the business stabilise and yacy seems to started again crawling and indexing and meassure RAM consumption:

openjdk 8 hotspot : 23-24GB rss
openjdk 8 openj9 : 19GB rss
openjdk 11 hotspot : 12GB rss – however it slowly climbs, but I also keep that working longer than previous.

So as you can see quite different numbers but this was completely unscientific comparison. For kind of scientific I would need to start exactly from the same DATA state which I dont, since I just start , wait till queue is loaded – watch logs – meassure – stop and this in 2 loops for every VM.

ian · 6 August 2021 21:41

I am pretty sure that you already know that but, just in case, memory sooner or later will climb up to the point stated in -Xmx java parameter. -Xms is used to avoid allocations and excessive GC runs. However the true memory that is currently needed by YaCy can be seen in page /PerformanceMemory_p.html and only after a full GC is run.

In languages that use garbage collection, there are some trade-offs between how much memory is allocated by system (what you see in rss) and how often GC runs which is usually a performance penalty. There are some allocation strategies like Orbiter mentioned in his post about addressing better GC behavior after removing Xms with earlier heap in… · yacy/yacy_search_server@294d56d · GitHub

However, generally speaking, a language VM runtime rarely returns free memory back to the system, in order to avoid (expensive) reallocations later.

I am not well informed about java GC parameters that control the JVM behavior and what kind of sacrifices they do in their JVM. In my scenario I just preallocated much memory in order to avoid excessive GC because I care more about performance and less about memory.

From my experience from other GC languages, and since I do not like the idea of garbage collection at all, I don’t believe that there is one ideal GC configuration that covers all use case scenarios and needs so maybe you should also have a look to GC configuration parameters of the OpenJDK 11, if you haven’t already

kgardas · 6 August 2021 22:10

Ian,

it is not so simple with GC. Good implementation un-commit allocated memory. Also for example right after post mentioning openjdk 11 staying on 12GB slowly rising I’ve noticed that it shoot up to 40GB in just few seconds. jdk 8 both hotspot and openj9 are more stable on this.

No I do not use yacy Performance since this is just a poor man solution in comparison with monitoring provided by jconsole. Just try and see for yourself.

GC or not-GC, that is indeed the question but if I may choose, I rather write Haskell than Rust. Java? No, if I can help it, wrote that for 20 years already. – but honestly ZGC in openjdk 16 is something I like…

ian · 6 August 2021 22:31

No it’s not simple. This is why I do not like this. Not because it’s complicated, but because it tries to hide the complication of resource management under a (non-deterministic) abstraction.
And, generally speaking, when you do that in software engineering, you usually end up with more complexity than you had in the first place.

And this is not about language syntax, i.e. I dont see why C# or Java could not have RAII and optionally GC. They already have desctructors and scopes. On the other hand such languages ended up with excessive usage of things like finally keyword and Disposable pattern which is not ideal.

But at the end of the day, how relaxed you want to be when you write your code, always depends on what you are trying to build. I wouldnt write a HFT in java and I wouldnt write a website in C++ (at least not yet )

And this is where the conversation becomes religious

I will check jconsole, thanks. However I think that you should try latest github version of yacy and see if latest perfomance fixes and the changed startYACY.sh java parameters solve your memory issues.

cryo420 · 15 March 2022 12:18

why not create a sector of virtual memory/physical ram for the vm as dedicated and designated for use by only the web crawler assigned to it?