I think that in this thread there is some confusion and misunderstanding between “crawl” and “query” and “page” vs “domain”.
Think of the internet as a library of physical books.
Books sometimes have an index or list of individual words used throughout a book. Sometimes, with a multiple volume, like an encyclopedia, an entire book will be an index listing words used throughout the entire 30 volumes. An internet search engine is like an index of words for every book in every library in the world.
A “web crawler” is like a librarian who is in charge of reading through all the library books to find out and record on what pages words appear and add words to the index in some lookup table like:
Apple: vol 1 page 137, vol 12 page 68, vol 14 page 52, 67, 95
Art: vol 7 page 74
On the internet however, instead of volume numbers of bound copies of books there are URL’s which are individual pages.
A “crawl” reads some page and ads all the words found to the common index of words and their associated pages, like:
Apple: http://johnny appleseed.com/johny/history/intro.html, http://applesauce.com/mackintosh/applesauce/howitsmade.html
Etc. In alphabetical order or some other tabular arrangement that makes words easy to look up so someone can look in the index and go find the actual book as referenced in the table (index).
So a “crawl” is building up an index, adding words and references (page numbers or URL’s where the words can be found).
A “query” is USING the index, already compiled to locate a book on a particular subject by finding words in the index related to that subject, getting the volume and page number (or URL) and fetching the actual page that contains the word to read in context in some actual book or actual website.
Yacy performs both functions. As a webcrawler it builds an index and as a search engine it uses the index to look up specific websites based on a reference table, or index built up from previous crawls
To most people a search engine is the librarian who tells a patron where to find a book on a particular subject, but part of the librarians job is to keep all the books in some kind of order.
The internet is far more complicated and chaotic than any public library, so the index is generally broken down into individual pages (not so much domains, which are collections of files (or pages))
So crawling is building an index while querying is essentially the opposite; using the index to find which pages contain a “key word”. A key word is just a word in the index that is somewhat special in that it kind of represents a particular subject or category of information.
Most indexes in most libraries do not bother to index every word in every book, just the “special” words that have some significance. No real point in indexing very common words like it, to, the, a, so, from, him, her, she, he, for, … and so forth, that would make the index too big.
Some words are very special to the degree that they represent actual categories of information, like: automotive, agricultural, medical, theological, etc.
So, there are some rules associated with how words are indexed and in what order so as to keep the index as useful and efficient as possible. Such rules are different for different search engines and are what might make one search engine a bit better than another, but they all do essentially the same job of building an index and then using that index to go out and fetch the actual pages the words in the index have been in one way or another associated with.
YaCy’s index is in the form of a “distributed hash table” so the list of words is not all in one place. Everybody running YaCy shares in both building the index: crawling, and helping each other find stuff: querying.
Very complex, but really quite simple.