YaCy and Kiwix Index

The kiwix project helps people download and search data offline:

https://kiwix.org

One facility at the project enables users to download a dump of Wikipedia, along with an index, which means people can keep and search the whole of Wikipedia without using the internet.

  1. Might YaCy be able to make use of the Kiwix indexes of Wikipedias (they are available in several languages.)

  2. Would use of these indexes be able to reduce redundant crawls of Wikipedia amongst different peers?

  3. Apart from Wikipedia, Kiwix supports several other data projects that might be of use in searches.

I am trying to crawl a local german Wikipedia and indexed ~3 million sites and medias. A restore of the previous saved JSON dump didn’t work http://127.0.0.1:8090/IndexExport_p.html , since the recommended “curl -XPOST localhost:9200/collection1/yacy/_bulk --data-binary @yacy_dump_XXX.flatjson” (watchout for the wrong port 9200!) needs to much ram. An upload with curl and other parameters or with wget didn’t work for me.

The JSON file looked like it could be modified easily with sed or perl. If you can import the wp dump, export the index, modify and import it again, that could be a solution.

Did you recognize the kiwix parameter “-z, --nodatealiases create URL aliases for each content by removing the date”. With this you can use a second URL without date and if you use a newer version of the kiwix file it is possible to use the old index.

1 Like