YaCy Crawl Depth Question

isle · 27 October 2020 05:26

There is a Document, A with an URL on it that leads to Document B. Document B also has an URL, leading to Document C, which in turn has an URL leading to Document D, etc.

In YaCy, if the Crawl Depth is set to 0, am I correct in thinking that if I start a crawl with Document A’s URL as the parameter, the indexing will only reach as far as Document A?

If Crawl Depth is 1, the URLs on Document A and B will be indexed, not C’s?
Crawl Depth 2 would index A, B and C, etc?

transysthor · 29 October 2020 10:02

Same question here.
I was looking at Autocrawler (Production -> Advanced Crawler -> Autocrawl) and there are 2 options Shallow and Deep with different values.
I would like to understand how these levels of indexing work.
Even the other options like “Rows to fetch at once”, “Rows to fetch at once:” etc.
Thanks!

stembod · 30 December 2020 13:25

EDIT: @isle See this: Clarification on crawling levels?

@transysthor
While i can’t in a meaningful manner explain crawl depth levels accurately (for all i know YaCy applies the set depth level every time it branches outside of the current domain and into a new one), i think i can perhaps shed some light on Rows to fetch at once etc…

If you go to Crawler Monitor, you’ll see solr search api mentioned; linking to something like e.g this:

https://peach.stembod.online:8443/solr/select?core=collection1&q=*:*&start=0&rows=3 (replace host and port as needed)
(this is also useful for testing any desired Auto Crawler solr query string, instead of default *:*. See links below for more.)

Number of rows, at least in most databases, usually refers to number of database entries to fetch. (think of it like a spreadsheet/table, having rows (and columns))

And as you can see, that request asks for 3 results (rows=3). (in combo with start=0 , i’m guessing would mean kind of like ‘fetch me the rows 0 to 3’)

so Rows to fetch at once (with the default setting at 100, and no/default *:* query) would mean

select?core=collection1&q=*:*&start=0&rows=100 (row 0 to 100)

And then, when Auto Crawler gets done with that set, it probably does

select?core=collection1&q=*:*&start=100&rows=100 (row 100 to 200)
, then
select?core=collection1&q=*:*&start=200&rows=100 (row 200 to 300)
, and so on…

And with Deepcrawl every set to the default 50 . It means that result 50,150,250 etc. … Would get set to be Deep crawled (default 3), while the others gets set to be done at Shallow depth crawl (default 2(?)) .

I’m guessing… And not sure in what way it deals with the various custom collections, e.g user …

In regards to the Query setting, i’ve found this useful:

in combination with looking at the fields present in yacy’s IndexSchema_p.html page

Tom_Booth · 31 December 2020 06:42

I think that in this thread there is some confusion and misunderstanding between “crawl” and “query” and “page” vs “domain”.

Think of the internet as a library of physical books.

Books sometimes have an index or list of individual words used throughout a book. Sometimes, with a multiple volume, like an encyclopedia, an entire book will be an index listing words used throughout the entire 30 volumes. An internet search engine is like an index of words for every book in every library in the world.

A “web crawler” is like a librarian who is in charge of reading through all the library books to find out and record on what pages words appear and add words to the index in some lookup table like:

Apple: vol 1 page 137, vol 12 page 68, vol 14 page 52, 67, 95
Art: vol 7 page 74
Etc.

On the internet however, instead of volume numbers of bound copies of books there are URL’s which are individual pages.

A “crawl” reads some page and ads all the words found to the common index of words and their associated pages, like:

Apple: http://johnny appleseed.com/johny/history/intro.html, http://applesauce.com/mackintosh/applesauce/howitsmade.html

Art: http://www.themetmuseum.org/artdeco/artists.html

Etc. In alphabetical order or some other tabular arrangement that makes words easy to look up so someone can look in the index and go find the actual book as referenced in the table (index).

So a “crawl” is building up an index, adding words and references (page numbers or URL’s where the words can be found).

A “query” is USING the index, already compiled to locate a book on a particular subject by finding words in the index related to that subject, getting the volume and page number (or URL) and fetching the actual page that contains the word to read in context in some actual book or actual website.

Yacy performs both functions. As a webcrawler it builds an index and as a search engine it uses the index to look up specific websites based on a reference table, or index built up from previous crawls

To most people a search engine is the librarian who tells a patron where to find a book on a particular subject, but part of the librarians job is to keep all the books in some kind of order.

The internet is far more complicated and chaotic than any public library, so the index is generally broken down into individual pages (not so much domains, which are collections of files (or pages))

So crawling is building an index while querying is essentially the opposite; using the index to find which pages contain a “key word”. A key word is just a word in the index that is somewhat special in that it kind of represents a particular subject or category of information.

Most indexes in most libraries do not bother to index every word in every book, just the “special” words that have some significance. No real point in indexing very common words like it, to, the, a, so, from, him, her, she, he, for, … and so forth, that would make the index too big.

Some words are very special to the degree that they represent actual categories of information, like: automotive, agricultural, medical, theological, etc.

So, there are some rules associated with how words are indexed and in what order so as to keep the index as useful and efficient as possible. Such rules are different for different search engines and are what might make one search engine a bit better than another, but they all do essentially the same job of building an index and then using that index to go out and fetch the actual pages the words in the index have been in one way or another associated with.

YaCy’s index is in the form of a “distributed hash table” so the list of words is not all in one place. Everybody running YaCy shares in both building the index: crawling, and helping each other find stuff: querying.

Very complex, but really quite simple.

Tom_Booth · 31 December 2020 07:08

So crawl depth has to do with building the index how many pages will the librarian add to the card catalogue today? How many pages will be broken down into individual words to add to the index?

Crawl depth is like cross referencing.

If one page has a footnote that mentions another book (that is a “link” or hyperlink or URL) then that page with the footnote on it is page zero. Why not one? I guess because today the librarian is not interested in all the footnotes and cross references (which could include things like: “see page 55”, in the same book or on the same “domain”). Today only one page will be shredded up into individual words to add to the index.

A computer can accomplish that task in a fraction of a second. It may not appear to have done anything. It is only indexing the words on one page. That one page may or may not have links to other pages, but at zero, that is not much of a concern. Depth zero just indexes words on one page faster than the blink of an eye.

Level 1 follows the links on the first page and also indexes all the words on those pages. The first page might have no links at all, or could have thousands of links to thousands of other pages, but that doesn’t matter much. If there is one or a thousand, it still takes almost no time at all for a computer to analyze and index one page and all the pages cross references on that page, if any. Some pages are dead ends with no links out in that case “depth” is meaningless. There is nowhere else to go.

The OP is correct.