YaCy as a news search engine

okybaca · 5 December 2021 11:22

Hi,
I’m trying to use YaCy as a news media search engine. Over an year or so I indexed around 14M pages, mainly world’s news, on a dedicated server. Fresh news out of several hunderds both global and local news sites are acquired every few hours from an RSS aggregator and distributed over P2P network, and the most important media are crawled as a whole on a semi-regular basis.

While still in a testing phase, I must say, that I love the YaCy concept, mainly P2P and decentralisation, and I wish it could supersede Google one day. YaCy is the only P2P search engine I found usable, and I find the concept of having decentralised search engine crucial for freedom of speech and weakening the information dominance of biggest companies. But the implementation is still quite suboptimal (anyways, impressive job, so far!) and the search results after year of testing are still very unsatisfying for me, sadly still behind the frontier of real usability.

I have mainly following issues and questions:

slow search results
This is the main stopper in YaCy usage. Even in local peer search, it takes several tens to hunderds of seconds to display the results. Sometimes, if searching the keyword for the first time, it displays zero results and only after pressing the search button again (and next tens of seconds or so) it finally displays some results. As an incidental first time visitor of a service, this is really discouraging: after long waiting, the search returns no results. “I’ll never use this search again!”
Crawling is of course slowing down the search even more. Is it possible to snomehow prioritize search over other YaCy tasks, since the search is the most important and results should be fastest as possible?
Are there any possibilites to speed-up the search or space for some optimisations in the code?
Would separate instance, used only as a search front-end help? Is it possible for such instance to use just one (“local”) instance (not whole p2p network) as a backend? Would using some other front-end (eg. SearX) help from your experience? I’ve seen some peer names suggesting, that some people are using separate instances for separate tasks. Any experience with that?
(I got around 14M urls, dedicated machine with 4 CPUs, 6GB RAM reserved for YaCy and YaCy version: 1.924/10069)
better page structure understanding
News media pages typically contain a lot of links to “related” and “latest” articles, besides the main news text. Those contain a lot of text unrelated to the actual article topic and in the search for a keyword, pages with such links often clutter up the search result page. I believe the better understanding of a page structure would help. I already boosted the H1 relevance in the ranking, which helped a bit. I already use webgraph as well. Is it possible to somehow distinguish the “main” text on the page and the mess around it? Would for example Open Graph (og) or Dublin Core metadata parsing help? Other search engines do this job somehow, so it must be possible. How they do that? Is it possible to custom-modify the parser to distinguish some new fields and use them in solr? How?
date published problem
/date search uses the “date indexed” to sort out the results. If I crawl a huge news site, all, even the really historical pages (NYtimes got archives dating back to 19th century, for example) are dated as “today”. Would it be possible to do some heuristics on a real date published, probably using some combination of metadata? Again, other search engines do, somehow. Is it possible to switch /date operator to use http date_modified header indexed in solr instead?
logical operators in the search field
In the search field of a portal, logical operators such as OR and AND are discarded as stopwords. Is it possible to use them somehow? Maybe it is, but the interface doesn’t describe nor encourage their usage. The search is not really “understanding”, what I need to find, as other search engines do, which may be an advantage, since I can construct the query more preciesly, without fuziness of “for sure you did mean rather…” – but I need to use the logical operators then.
removing huge sites from queue
If a crawled/loader queue hits some huge site with a several millions of pages and they are all enqueued, when I want to terminate the crawl, it takes several painful hours of disk-intensive operation. Stopping YaCy and removing the host dir from a queue manually takes just few tens of second. Is it possible for YaCy to do the same on demand for me, without restarting?
starting a site crawl is unreliable
Adding a new task with Advanced Crawler is unpredictable. If the instance is already crawling, sometimes the new job is just added to the Crawler Monitor and is never really executed, even after other jobs were completed. My usual protocol is to pause the crawl, wait, then add a job – then it usually works, but it’s inconveniencing.
searching from the other hosts and RWIs transfer predictability
This is just out of curiosity: sometimes the other hosts ask me the queries, sometimes not, even if the machine got low-load. What clue is used for queries distribution? Similarly, since I want to distribute RWIs over the network, sometimes quite a lot of URLs are sent in a batch (like 900 for every peer), sometimes just a few (like 10 for a peer). How the host decides how many RWIs to bundle? Furthermore, I’m seldom connected to more than 50 peers. Is there any limit? Again, how is it set?
unstability
YaCy still need quite a lot of baby-sitting. Sometimes it just stuck or slow-down and works better after restart. Sometimes it runs out of memory (6 GB is a plenty, isn’t that? Or what is the optimal amount?) and doesn’t do anything other than printing out-of memory messages in the log. Sometimes it just crashes and exits. Definitely, it’s not start-and-forget service, still. What would help? Are there any stability issues that need to be solved in the code? What are the main problems? I found out that Solr optimization sometimes helps massively. I use the default max. 10 segments. What is the optimal settings and what it depends on?
czech & slovak stemming
Since I search texts in these flexing languages, I need the stemming. This I suppose I can do easily myself with synonymes dictionary – which I plan to do and contribute.
further development and YaCy Grid
What is the size of YaCy developer community? How many active developers are there? I understand that Orbiter started the project and made whole amazing system, are there others developer involved?
I’m also not sure if the YaCy Grid is ment to supersede ‘legacy’ YaCy, or if any further developement of ‘old’ YaCy is planned?
In other words, the question is if the old YaCy is worth of time investment or if it is ment to be abandoned and replaced by the Grid. Will the grid version have P2P component as well then? Compatibility of P2P of the Grid and ‘legacy’ would keep the whole network running.

Thank you for this amazing piece of software. I’m always in wonder, how many functions and gadgets it has, it must have taken a lot of work and passion. I admire that!