Very Large YaCy Folder

After a new installation of YaCy, we ran a few crawls on some publications. This rapidly resulted in a very large folder (over 30GB).

YaCy is engineered so that large storage isn’t usually needed. Why would such a large folder result from about 20 crawls?

How big is your index? So how many sites are indexed? Do you cache the sites?

1 Like

The index was about 32GB. I was using the default settings and tried running some crawls on quite large sites. I think I tried about 10 or more sites.

I wouldn’t mind caching the results, if they could be made of use to the rest of YaCy users. I could see some improvement in the search results I had after those few crawls, which were still in progress when allocated space became full.

By the way, if I wish to reinstall YaCy to a larger drive, is it an easy process to copy paste a folder into the new installation to avoid having to go through the crawling again?

1 Like

TW1920 - thank you for asking.

I don’t know if we are caching the websites, but if that is the default setting, then we probably are. What are the benefits of caching the crawled websites? If we turned off caching, would the cache be automatically cleared? Could the cache benefit others, if we are going to delete it?

This is crawling about 12 sites to a depth of 4.

Looking at the YaCy folder there are a few gigantic files:

\YaCy\DATA\INDEX\freeworld\SEGMENTS\solr_6_6\collection1\data\index
\YaCy\DATA\INDEX\freeworld\SEGMENTS\default\text.index.20200703184026253.blob
\YaCy\DATA\INDEX\freeworld\SEGMENTS\default\text.index.20200704095302555.blob
And there is a large folder with lots of files:
\YaCy\DATA\HTCACHE\file.array

It is all over 300GB now, after just a couple of days of crawling.
This sort of storage usage might discourage new users. Perhaps there ought to be a different default setting to not cache, if the cache is the problem.