The Story of YaCy Grid

It took a long time to get here … but finally, this story had to be told…

In 2015 YaCy had become a well-recognized and already mature search engine software. This platform was intended to be used by private persons but there was also a demand by professional users as well. Designed as a peer-to-peer software, the architecture had some flaws by design:

  • no stable search: consecutive queries do not produce the same search result, not on one peer and never for different peers. If YaCy would have a stable search result that would contradict to censor-resistance.
  • incompleteness: we distribute our search index to a set of remote peers and we don’t have any control over the lifetime of the storage of that index. This causes a bad recall.
  • lacking speed: peer-to-peer search is meta-search - the retrieval process is only as fast as the slowest peer. If we do a time-out, we are lacking information and increasing incompleteness.

All these problems are ok if we insist on having a “freedom index” but for professional use, these problems must be solved. As the demand was there, I was trying to find a good concept for a full re-design!

In late 2015 I met Ilya Kreymer in San Francisco (check out his repository, its amazing!). He was a former employee of the Internet Archive and worked on a free version of the wayback machine: openwayback. He created numerous other warc-based tools and this told me that a re-design of YaCy should not only consider separated components but also should use standards as ‘glue’ between the parts of the software.
This convinced me that we should have a WARC middle-layer in a new YaCy architecture and re-use all these amazing tools. The new YaCy architecture could i.e. have a crawler archive which looks loke the internet archive of all crawled pages.

A Supercomputer Plattform Architecture

So I designed slides to advertise a redesign of YaCy, at this time called “Kaskelix”.

These components would be either constructed from recycled code from YaCy or they would consist of external, standardized software modules. This design contained also optional elements - like “Moderation”, “Add-On Content” which are not obligatory for the whole construction but would leave room for a commercial assignment.

To prove feasibility, I added the following to-do picture:

In January 2016, the OpenWebIndex initiative was started with a conference consisting mostly of members of the suma-ev. The idea was, that the OWI creates a large search index but not a search interface. Users of the OWI would have to create their own interface which would cause that comparison of the OWI with other search indexes could not happen on a user-experience level but only on scientific attributes.

YaCy Grid did still only exist as a concept, but I was sure that my approach was best suited for OWI. Unfortunately, it turned out that no software was ever developed for the OWI, the approach was purely political - at that time. We did not even reach the point where I was able to propose my architecture, which was most disappointing.

Implementation for a Business Partner

At the end of 2016, I actually found a business partner to implement some of the proposed components.
The architecture required an orchestration element which I called (ironically) MCP - “Master Connect (sic!) Program”. It also required a queueing mechanism which provided interfaces and scaling to the other grid elements.
The following modules had been implemented for YaCy Grid:

  • yacy_grid_mcp - grid orchestration. This software runs not only as daemon once in the grid, but it is also deeply integrated into the crawler, parser, loader and search element as git submodule. The MCP also includes a client to elasticsearch and acts as an indexing client for the grid.
  • yacy_grid_crawler - crawler, which includes the crawl start servlet, host-oriented crawling balancing and filter logic for the crawl jobs.
  • yacy_grid_loader - loader and headless browser. As of today, many (maybe mostly) web pages have not any more static content and content is loaded dynamically. As headless browser loading of web content is very complex, this component must be scalable.
  • yacy_grid_parser - the YaCy Parser as it is implemented in “legacy” YaCy. We have an extremely rich metadata schema in YaCy and YaCy Grid inherits this schema.
  • yacy_grid_search - the query parser and search back-end API for search front-ends. In the fashion of stateless microservices, this component can be scaled up according to the load on the search front-end.
  • yacy_webclient_bootstrap - a demonstration search client that looks exactly the same as the in-legacy-YaCy built-in search front-end.

These parts must be combined with the following standard software:

  • elasticsearch - instead of solr, YaCy Grid now uses elasticsearch as a search index
  • kibana - dashboards for monitoring
  • RabbitMQ - Queues for high-performance computing
  • an FTP server - storage for WARC and flat-index files

Creating a search front-end for YaCy was also part of Google Summer of Code within the FOSSASIA Community - this created the following component:

  • susper.com - a Google - look-alike Search Interface

All together in one picture:

Going Online in a Cloud-Hosted Kubernetes Cluster

A huge YaCy Grid installation went online at the beginning of 2018: Our partner who is running YaCy Grid for http://land.nrw uses a Kubernetes Cloud hosting for YaCy Grid Docker containers:


This provides a search index for the public administration documents and web pages of all communities (cities, villages, more than 1000) in the state of NRW/Germany.

We can monitor crawling behavior with kibana:

The load status of crawl queues and the queues of other grid components can be monitored with the dashboard of RabbitMQ:

YaCy Grid: A Scalable Search Appliance

As YaCy does not only provide a rich, opensearch-based search API but also an implementation of the Google Search Appliance XML API. That means, YaCy Grid may be a drop-in replacement of existing GSA user. As Google abandoned the GSA, users should switch to YaCy Grid.
With YaCy Grid we achieved finally:

  • index stability - all search results for the same query are the same
  • completeness - we can find everything that was crawled
  • speed - this construction provides unlimited scaling: for crawling, indexing and for search.

The story is now still going on:

  • more monitoring features and operational support (like re-crawling of failed loadings) is currently being developed for YaCy Grid
  • we should develop a concept to integrate (or join) YaCy Grid with (the old) “legacy YaCy”.
  • The OpenWebIndex initiative has new people on board and we are currently trying to integrate one part (yacy_grid_parser) of YaCy Grid into the OWI architecture.
  • we need documentation. Creating a platform for documentation is now required…

To be continued…

If you like this story then you are invited to share your ideas in the comments! It is a huge challenge to join old and new YaCy components towards a better platform: would you like to contribute? What can be done by the community? Are you a professional user of the old YaCy software and would you like to switch to YaCy Grid?

5 Likes

Please share: https://www.reddit.com/r/YaCy/comments/c0z9sm/the_story_of_yacy_grid_development_of_a_large/

Yes, I am a professional user of the “old” yacy software and would like to switch to the new yacy grid.

What about data migration? My personal experience with yacy is, that there is 2 important parts to take care of: 1. Starturls/Domains and 2. Blacklists (“the opposite”). Both means a lot of work and manual interaction, so I am working on a management system for these topics.

Another interesting topic is automatic content classification (e.G. part of speech tagging). This implies the knowledge and availability of reference data like taxonomies and structured lists of named entities.

This is the second topic I am working on and I would be happy to contribute to the new yacy grid.

What is the best starting point? Can I setup a sandbox? Is the grid compatible to freeworld?

Best regards

Markus

Hi Markus, a lot of good questions…

Yes! There are several things to consider:

  • index migration: Legacy YaCy runs on Solr and the RWI index, YaCy Grid runs on elasticsearch. You can export the Solr index in the “Index Export” function in Legacy YaCy into elasticsearch bulk format. Have a look at the option, it explains the full process
  • index startup migration: The crawler in YaCy Grid can be called with a curl command which accepts the same crawl start attributes as you see in the process scheduler. So instead of migration of the search index you can do a re-start indexing using the Crawl Start URLs. However you must rewrite a part of the start URL to the new path of the YaCy Grid Crawler component. I will explain details later.

As explained above! Second Option

I dropped the blacklists as their syntax was terrible and confusing all the times. To have blacklists in YaCy Grid, you must translate them into a (maybe very long) must-not-match list, to be used as parameter in the crawl start.
There is no automatism for this, it’s just a concept.

I also dropped content classification for now. I left this out as a possible commercial or used-constructed function which can be applied on the parser result. The parser creates flat json files, the same as you get with the legacy YaCy export-to-elasticsearch files. What you must do is: parse the json, match content against you vocabulary and write the classifcation back to the index dump file.

Yes, there is no general solution. Every user of a possible content classification brings it’s own files. We must see where we can go here with a community contribution. In fact this would be a great project for community work,

Right now the only documentation is the readme at the project repositories. Start with https://github.com/yacy/yacy_grid_mcp/blob/development/README.md and then read the README of the other grid components.
I also want to write a nice manual, but only after we migrated the home page of yacy.net into a new CMS with the ability to write longer documentation texts.
Or maybe earlier as enhancement of the READMEs

Only in some special way:

  • the code of the MCP and the parser is largely taken from legacy YaCy. The crawler and loader is mainly rewritten. So there is a compatibility on code basis and this will be used to merge YaCy Grid code back into legacy YaCy
  • the Loader in YaCy Grid produces WARC files and legacy YaCy can import these with the surrogate framework (just put them into the surrogate/in/ path)
  • legacy YaCy can export into YaCy Grid dump index file format
  • the search API is identical. Completely the same. A search front-end on legacy YaCy fits also on YaCy Grid and vice versa. This applies also for the GSA (Google Search Applicance) XML api which is also a search API in YaCy.

Future functions which may enhance compatibility:

  • I am planning that a crawl start in legacy YaCy can be send to YaCy Grid as crawl start, but that is not implemented
  • I am planning a “connect to YaCy Grid” function in legacy YaCy to use a YaCy Grid search API as metasearch element.
  • It may be also possible that the Crawler in legacy YaCy could create WARC files and puts it into the YaCy Grid indexing queue.

So lets see where this leads. YaCy Grid is by definition NOT a p2p network, it is ‘just’ a massively scaling search engine tool set. Maybe we have enough experience here to make a composition out of that which forms a collection of such Grid implementations and maybe share data, like:

  • a common repository of crawled WARC files
  • a common repository of parsed WARC files
  • a common list of indexes

Please share your thoughts about that.

a Twitter user asked about YaCy Grid:

So the answer

points to required action items here. The is not yet a fixed answer on how to connect Legacy YaCy with YaCy Grid. But even if I don’t find time to create something here, everyone could connect the systems using their open APIs. But it would be good to have a concept.

  • The first thing we could easily do is having a kind of registry of YaCy Grid installations where YaCy Grid users can subscribe automatically - but voluntary! So we could have a meta-search over the Grid Installations
  • Another things could be, that users can join an existing YaCy Grid with their own Grid Loader & Grid Parser. That also would require the registry mentioned above.

Challenge:

One concept of YaCy Grid is to use simple standard software to implement parts of the platform. So what kind of ready-made URL registry would you suggest?

Do we need something like https://aws.amazon.com/pub-sub-messaging/

Yes, for that the RabbitMQ is part of the architecture.

First, thanks for all the important work on YaCy and congrats on the redesign.

I do think that the name “YaCy Grid” is a little confusing. The name implies something that is more P2P than original YaCy, not less, perhaps like a connected “grid” of instances with chosen P2P connection policies. It is more like “YaCy Platform” or “YaCy Engine” than a “grid”.

YaCy Grid refers to the actual definition of “Grid Computing” as you can find in Wikipedia: https://en.wikipedia.org/wiki/Grid_computing

Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application.

So YaCy Grid is exactly that: it’s distributed (not necessarily but potentially decentralized) but the parts are not interacting (it’s using a hub - a broker) and cluster components have different tasks (the grid components).

It’s not about more or less p2p, it’s just another computational approach.