Index Migration to Solr 8.8.1

LA_FORGE · 6 March 2021 19:46

Hi,

I had an issue bootstrapping YaCy v.1.925/10082. According to the log, the new Solr version doesn’t accept the “old” Index Data Files from Solr version 7.7.3. Do I need to dump the whole Index via the Export feature and put it in /yacy/DATA/SURROGATES/in on the new YaCy release or is there a smoother way for an Index transition to Solr 8.8.1?

Greetings

LA_FORGE

Orbiter · 7 March 2021 15:15

Yes, that code change was a bit in a hurry and should be better to support a migration.
So far there is no migration. I am working on it.

LA_FORGE · 7 March 2021 18:56

Thank you very much. It would be very great if the step via the export and import isn’t necessary because the PCIe SSD I’m using for YaCy has “only” 70 petabytes of I/O lifespan (MTBF) and 30 terabytes of I/O per week isn’t a rarity when operating a peer with the largest Index in Freeworld (I’m continuously recrawling very old Index entries, too).

Orbiter · 8 March 2021 12:58

I just commited changes that moves YaCy latest git version into a situation where it can start up again. There were many more dependency and configuration problems for solr 8 but it works now with that.

Sad to say that there is currently no automatic migration of indexes… Here is what you must do:

use YaCy in version (up to) 1.924 (not latest git), i.e. https://download.yacy.net/yacy_v1.924_20201214_10042.tar.gz
Open Index Export/Import
click on the “Export” button (leave all other settings as it is by default, export as json)
let it run and it will create a (large) single dump file in /DATA/EXPORT/

If you loaded a newer version already, don’t worry, the original 6.6er index files are not touched and not removed. Just move the DATA folder to an older YaCy version and the old index is still there.

I will make sure that this will can be imported with the solr 8.8.1 version.
Sorry for inconveniences, but I researched for solr-embeded tools but could not find one that migrates from 6.6 to 8.8.

LA_FORGE · 8 March 2021 14:15

Thank you very much for outstanding work!

Sad to say that there is currently no automatic migration of indexes

No problem, I’m using an old SAS-HDD with good average access time for the dump and the restore. Is the JSON export method more powerful than the XML one?

Orbiter · 8 March 2021 16:52

the JSON method is the choosen alternative since I started with YaCy Grid which has also a JSON dump format with almost the same attribute fields.

I just uploaded https://download.yacy.net/yacy_v1.924_20210209_10069.tar.gz as the “latest 1.924” before the commit to solr 8.8.1 happened - to be used for an export.

LA_FORGE · 8 March 2021 17:29

Great! Thanks

Cicero:index stefan$ ls -S -lah
total 8001171072
-rw-r--r--    1 user  staff   1.2T Nov 15 16:48 _1nkhb.fdt
-rw-r--r--    1 user  staff   381G Oct 23 06:47 _1nkhb_Lucene50_0.pos
-rw-r--r--    1 user  staff   260G Nov 15 09:41 _1nkhb_Lucene54_0.dvd

The whole Solr data directory has 3,6 TB. Well, that would take a few weeks I guess

Orbiter · 8 March 2021 18:35

oh wow das is not good. Looks like I should do some performance enhancements on the export process. But then - there would be the need to backport that. Hm.

LA_FORGE · 8 March 2021 19:14

I could try to increase the process priority if you can tell me the filename of the Java class. This boosted the performance at another workflow (recrawl the whole index) remarkably.

Orbiter · 8 March 2021 20:45

well everything happens within net.yacy.cora.federate.solr.connector.EmbeddedSolrConnector but how do you boost a single class?

LA_FORGE · 8 March 2021 21:11

Hallo Michael,

ich hab immer noch null Ahnung vom Programmieren Ich bewundere das weiterhin was du da machst.

Ich meine natürlich den Thread hab in der RecrawlBusyThread.java einen String gefunden

this.setPriority(Thread.MAX_PRIORITY);

Seitdem ich das auf MAX gesetzt habe, sind 200 GB Traffic/Woche beim recrawlen keine Seltenheit mehr Kann ich die Priorität von dem Index Export auch beeinflussen? Wenn ja: Wie heißt die Datei vom Quellcode?

Viele Grüße

Stefan

Orbiter · 8 March 2021 23:05

also das ist gar keine schlechte Idee, ich habe jetzt genau diese Zeile mal in AbstractSolrConnector.java in Zeile 371 gepackt. Weiss noch nicht obs hilft…

LA_FORGE · 9 March 2021 14:25

Wenn ich mit meinem “Spezialpeer” (Endeavour) fertig mit exportieren bin (Er hat nur 40 Mio. Einträge, hiermit habe ich eine Open Source Intelligence Plattform gecrawlt) mache ich die Änderung auch mal im Quellcode der 1.924/10042 und exportiere dann damit die 206 Mio. Dokumente von dem Epistemophilia peer. Hab mir eine schnelle SCSI-Platte via AFP gemounted für den Dump, will die PCIe SSD für YaCy etwas schonen, die hat “nur” 70 PB bis MTBF und hab hab garantiert schon 20 PB verbraten weil ich immer so viele Crawl-Jobs habe und die Tiefe immer auf 5 stelle

LA_FORGE · 10 March 2021 10:26

So, der Export der 206 Mio. Dokumente läuft:

sh-3.2# iostat -d -n 3
          disk0               disk1               disk3
KB/t  tps  MB/s     KB/t  tps  MB/s     KB/t  tps  MB/s
51.05    4  0.21    41.25   27  1.08   900.86   50 43.94

disk3 ist die Festplatte für den Dump. Läuft recht flott dafür dass es keine SSD ist. Ich hab die Änderungen im Code auch so gemacht wie du geschrieben hattest und neu kompiliert. Ich freue mich auf die neue Version mit dem Sol 8.8.1. Vielen Dank für alles!

ian · 6 August 2021 08:26

@LA_FORGE how did your export go?

I had similar troubles migrating around 500GB. (For anyone brave who thinks to give a try: do not use IndexUpgraderTool | Apache Solr Reference Guide 8.8.2 unless you are ok to miss a few fields mostly dates, and have a few java exceptions in your logs for these incompatible fields.)

Next think I did, I setup a separate solr 7.7.1 instance and used that for export, thinking that export speed will be increased if solr is in a different process space. I also changed YaCy code to export xml with lz4 compression instead of gzip in order to speed up the export. (I didnt use json yet, let’s hope that produced xml will work I will revert in a few hours)

However @Orbiter I see the following thing during export (with gzip or lz4 or json or embedded solr, it does the same)
It runs fast for a 1-2 hours and then the performance degrades:

The communication between yacy and solr over localhost is as follows:

I had a look at your export code in Fulltext.java and especially in your BlockingQueue that you get from solr, but it doesnt seem obvious to me why it behaves like that. I dont have much solr experience so it’s not obvious why solr speed degrades over time during export. (the machine is a high performance yacy dedicated machine with 64GB ram and there is nothing else consuming resources)

Any ideas? If we manage to somehow improve the export speed, I can write a small migration tool to export from one solr and import to the other if anyone else finds this useful, since you froze your version at 1.924 and you can no longer publish changes.

ian · 8 August 2021 09:00

Just for the record, if anyone wants to do the same, compressed xml export and import works without any issues as well. I successfully migrated my indexes to the 8.8.1 version.

Orbiter · 8 August 2021 09:23

Thank you for the analysis. Right now I don’t have an idea why the performance degrades. Maybe there is a connection between caching in solr - so it could be that performance degrades if caches in solr are all used up and performance is only dependent on IO. Then IO-performance stays the same, looks a bit like that.