Can (or could) YaCy do conceptual indexing? (or be the search engine for "the semantic web")

What does “the semantic web” mean? To me it simply means searching, spidering, indexing and serving web pages by concepts rather than words in any particular language.

A simple and fairly well known example of this is most, if not all, public library book indexing systems.

Take DDS, the Dewey Decimal System.

In the DDS system, as an example, the number 006.3 represents the concept of Artificial Intelligence in computing. This conceptual representation cuts across language barriers and eliminates ambiguity.

So why not something like: (angle bracket)meta name=“dds” content=“006.3”(close bracket) ? as metadata for a website about artificial intelligence and why not have search engines that can read such conceptual metadata and index and serve up websites accordingly ?

Well, the number one reason is probably that OCLC (Online Computer Library Center, Inc.) which maintains and keeps the Dewey Decimal System up to date seems, in the past to have been a bit litigation prone. https://www.nytimes.com/2003/09/23/nyregion/where-did-dewey-file-those-law-books.html

Well, there are other conceptual classification systems, the Library of congress for example uses the code Q334-342 for the same concept, (Artificial Intelligence).

Maybe, but it is rather cumbersome and limited. Dewey had 10 primary categories represented by the numbers 0 through 9. So, can we count the categories of all the diverse information on the web on our fingers? That’s another reason not to use the Dewey system. Well The LOC system uses A through Z. Again, can all the information on the web be categorized by just 26 categories? I mean 21. For some reason, or no reason,the entire alphabet is not utilized. Can all human knowledge and information be divided into such a limited number of basic concepts?

There is one library conceptual system that might work for the entire web, with some modification. Colon Classification of S. R. Ranganathan or some similar faceted classification system https://en.wikipedia.org/wiki/Faceted_classification.

But the conceptual categories or facets of the internet far exceed anything imagined by Ranganathan… Libraries of books do not have any such thing as an IRC chat room or internet message board or blogs or Live streams or shopping carts or auction sites or upcoming event calendars or social networks etc. etc.

A faceted classification system, nevertheless packs a maximum amount of metadata into an incredibly compact form. Basic information like subject, time, location, use/purpose, rank: importance, urgency, and more can be represented by a rather concise string of characters in a way that is both computer readable and unambiguous.

Did I mention that this kind of conceptual indexing cuts across language barriers? Suppose for example we were to use a two character symbol system to represent concepts, and we choose the letters AG as the designated representation for the concept of Agriculture.

If this were to be agreed upon and used as a standard system of representation across the internet worldwide (AG=concept:Agriculture) then search interfaces (in any language) and indexing programs, databases etc, can all use this character code as representing the concept of Agriculture in any language, just as the number 630 in the Dewey system. The selection of a code to represent a concept is pretty much arbitrary. It doesn’t really matter much. What is important is that it is standardized (agreed on universally) and properly formatted so as to be computer readable.

With that, any search interface in any language can look for the specified concept (using some relatively simple character matching or “regular expression”) and use the specified code in the background.and retrieve the same data/websites that have included that code in their metadata, or that have otherwise been indexed as pertaining to that field of knowledge.

concept

Concepts can be combined. AG:FO might be used to designate the subject facet of Agriculture AND the “Site Purpose” facet of Discussion forum (FO=concept:forum). Combined these two facets represents “Agricultural Discussion Forum”

Remarkably, a simple system of encoding of this sort has managed to organize all the books in all the libraries around the world on every conceivable subject using an encoding system consisting of a simple character string that fits on the binding on the back of any book.

Hi Tom, welcome to our new forum - and YaCy!

Your thoughts above gives me the opportunity to tell another story - about my work for the German Digital Library DDB (http://ddb.de) which is the approach to provide digital assets about the german culture in the context of a whole-european approach which is in europeana.eu

I was once the first architect of the DDB and worked for the Deutsche Nationalbibliothek / German National Library as a consultant. Here I learned about the Dewey Decimal Classification - which I found to be a surprising approach to classify anything within 1000 classes. This is a bit of the root of my contact with semantic classification and the source of inspiration for work both for the DDB and YaCy as well.

The point is that the DDC is a kind of old approach. Other Approaches uses controlled vocabularies. One of the most common vocabulary ontology today was created by google, which is http://schema.org
This contains a lot of commercial-oriented attributes, like location, events, products with prices etc.

For YaCy I took the inspiration of the DDC to enable controlled Vocabularies which can be either imported or created with YaCy itself. For example you can create a vocabulary from the titles of a mediawiki. Open the menu “Content Semantic” to see what I mean. YaCy is able to auto-annotate pages based on keywords which appear on the page. This does not work well for everything, but not bad for locations and names.

If you like to, you can even use the classification names from DDC and make a YaCy vocabulary out of it.

The result of semantic annotation is then used in search results for search facets. If you create a new Vocabulary, then you get a facet with the name of the vocabulary.

This is very exciting, cutting edge stuff. I’m so pleased that you have had the foresight to incorporate such functionality into YaCy.

The instructions in the content semantic section of the YaCy admin area mentions: “A vocabulary must be created before content is indexed”.

I’m assuming that means it is just as well to start with a fresh index?

I’ve been looking into schema.org for years now with high hopes of making use of it some day, but have not found it useful.

Is there a video tutorial or something on how to use the Vocabulary Administration features? YaCy is so incredibly feature rich it is a bit overwhelming.

“Vocabularies can be used to produce a search navigation.”

I’m not entirely sure what that means. I think maybe it means in effect that It is possible to add “Tags” to links in the database then search for those “tags”. My assumption is that this would only work for URL’s in the local index, or could this work with remote YaCy peers if they had annotated their index using the same vocabulary? Or perhaps I’m reaching and have completely misunderstood.

"A vocabulary must be created before content is indexed. The vocabulary is used to annotate the indexed content with a reference to the object that is denoted by the term of the vocabulary. "

“The object can be denoted by a url stub that, combined with the term, becomes the url for the object.”

This appears to be similar to the schema.org methodology, using a url, I’ve never been able to figure out what purpose that is supposed to serve.

You wrote:

"If you like to, you can even use the classification names from DDC and make a YaCy vocabulary out of it.

“The result of semantic annotation is then used in search results for search facets. If you create a new Vocabulary, then you get a facet with the name of the vocabulary.”

I’m basically drawing a blank trying to understand what that means. By “names from DDC” do you mean subject titles like Science, Philosophy, Religion… etc.? or is it possible to use numbers like 006.3 ?

What is a “facet” in YaCy? I’m not sure we are talking about the same thing.

To me a facet, in a faceted search would be, (taking the example of a website for a brick and mortar business selling office supplies), things like TYPE OF WEBITE: brick and mortar business, LOCATION: Geo-code (Latitude, longitude), SUBJECT: Office Supplies

These are key identifiable aspects of the website. Things people might need or want to know about a website before bothering to click a link. The specific aspects of an internet resource someone might have in mind when conducting a search.

A general DDC classificarion would only be one facet that might be applicable to some websites.

"If you create a new Vocabulary, then you get a facet with the name of the vocabulary.”

I’m thinking that this is what I’m intensely interested in.

Lets say for example I want to index “Website Types” as a theoretical search parameter. “Website Type” becomes the “facet” or vocabulary name, presumably.

Now The actual controlled vocabulary for that facet might be something like:

Facet#1-Site Type
Controlled Vocabulary, Description
BL, Blog
BM, Brick and Mortar Establishment
SC, Online Shopping Cart
PH, Personal Home Page
PO, Portal
AR, Age Restricted
CA, Classified Advertising
AU, Auction Site
LG, Local Government Site
NS, News
ED, Educational
MB, Message Board
EV, Event Anouncement
…etc …etc

A second facet might be Subject

SCI, Science
AGR, Agriculture
PHL, Philosophy
SPI, Spiritual
COK, Cooking
AUT, Automotive
POL, Politics
etc. etc.

A third might be Accessabuility

CC, Credit Card Verification Required
RG, Registration required
OA, Open to all or Open Access (No restrictions)
PY, Payment required for full access
PT, Payment required with Free Trial
etc.

Target Audience
CH, Children
ME, Men
WM, Women
PA, Parents
ST, Students
RP, Retired Persons
FR, Farmers
GW, Government Workers

Location
Coordinates Latitude, Longitude

And so forth

These example Facets and corresponding Vocabularies are off the top of the head just to give an idea what I have in mind. I’m not sure if what you describe is the same or similar or something else altogether. i.e. are we on the same page so far or on two different planets?

Certainly if people are going to annotate links in such a way, or any other way, some common syntax and common controlled vocabulary would be desireable or this is only, in effect, a “proprietary” methodology that perhaps I can run locally but that would be incomprehensible to other YaCy peers

Adopting schema.org’s syntax and methodology would have limited usefulness but might be better than nothing. It would be nice if it wasn’t so… how should I say it? unworkable, unwieldy, bloated, complicated, nonsensical, unnecessarily verbose, language biased and full of useless redundancies to the point where it will never be adopted by anyone outside some large company with an IT team to work on developing an essentially proprietary system. I don’t think I could ever explain schema.org to my neighbors so they might benefit by marking up their own home pages to make them search engine friendly. But maybe I’m missing something. At any rate someone would first have to explain it better to me. I find Schema.org mind numbingly complicated and sprawling. Granted, it has some theoretically useful elements but implementing them seems not worth the effort. Ordinary people, non-professional web designers, are unlikely to ever bother with it. I’ve been studying it, or looking aghast at it, more accurately, for years and as far as I understand it, don’t like it.

I think (correct me if I’m wrong), what I’m getting at or dabbling with is the reverse or opposite of YaCy’s vocabulary. A kind of reverse vocabulary.

In YaCy I can say:

GA=Gardening,Agriculture,Farming

Perhaps I can add additional languages

GA=Gardening,Agriculture,Farming,Gartenarbeit,jardinage,tuinieren,trädgårdsarbete

I can then search for GA and get results that include sites that include the word(s) Gardening, Farming, Agriculture, in English as well as The concept of gardening in other languages.

That is something, but what I’d like to see, and have been working on is:

I have an English search engine interface. I search for “Gardening” which is transposed to GA.

The search is conducted on the controlled vocabulary across peers using GA without requiring that I know any language other than my own, English.

Likewise there are German, French, Dutch, Spanish and all other language search interfaces.

Now nobody has to create any extensive vocabulary. Rather, people using the search engine simply use the agreed-upon code, or, import a standard code file in their own language.

Very simply, my english yacy search engine has the vocabulary GA=Gardening

My German speaking peer has only GA=Gartenarbeit

The “GA” controlled language works well for both languages in this case

My Spanish speaking friend however also uses GA=Jardinería

And so forth for other languages. GA is the glue that brings together all the sites that pertain to the concept represented by GA regardless of the language spoken, and, probably pages returned could also be translated on the fly.

Language barriers have been flattened world wide.

Also people can use something like Meta Name=YaCycode:Subject Content=GA

But That is just to get across the idea. In practice many facets could be condensed into a “colon classification” type string that could include any number of facets.

It might be an interesting experiment if YaCy were configured, or configurable, so as to give websites with semantic metadata priority.

The programs I’ve written in Perl use semantic, conceptual indexing and retrieval exclusively.

Currently I’ve been experimenting with a dozen or so facets I thought might be important and could be subject to standardization. Latitude/Longitude, Website Purpose, Subject, Target Audience, Media Type, Start Date (for future events) including a flag for ongoing/recurring events, Access requirements (open or members only etc.), Ratings on a scale of 0 - 9 for importance, urgency and credibility, as well as a field for specifically identifying information associated with online collaborative projects. also page update frequency (How often should the page be spidered and re-indexed on a schedule)

Another useful facets might be Language, but I have not implemented this. Probably it should be included.

The result is a string of characters that looks something like this:

SDM50FRMPOL…42N075W.LIB534DOCHMs7yebhd8ao

Also included are page title, url and a page description, though strictly speaking these things are not included as conceptual data.

What that string of metadata says is:

Update frequency - SD This site is typically updated about two times day
Accessibility - M This is a Membership Site Access Requires sign up but no fee
Technical Accessibility - 50 Site is accessible to Most Common Web Browsers.
Site Purpose - FRM Forum (Discussion Board, A place to post messages)
Subject - POL… - POLITICS (additional “…” are placeholders for potential unused sub-topics)
Location - 42N075W within the region 42 +1 degrees North Lat. and 075 +1 West Long.
Date-Time - ( . ) Time Not specified. “.” placeholder since this site is not about an event.
Target Audience - LIB Libertarian
Importance - 5 Average Importance
Urgency - 3 Less than average urgency
Credibility - 4 May be Credible but Credibility has not been established
Media Type - DOCHM HTML document
Group Project Key: s7yebhd8ao Site relates to or is associated with a collaborative Project.

There is also a field indicating the “authority” that did the indexing or the site, or if the site is “self indexed” by its own metadata field or a generic unspecified indexer can be used. This is not so much any kind of metadata, but rather a kind of branding, i.e. index generated by Google / Godaddy / YaCy Peer / or whatever. for “bragging rights”.

This could also narrow results, if for example you want to only search sites that were indexed by a particular indexing “authority” such as the National Science Foundation or some other specialized body of information.

There is also an index version number and a “handle” to differentiate such a metadata string from other information on a website.

My “search spider” is just for testing purposes to verify that the metadata placed on a particular website is formatted correctly and computer readable.

BTW for brevity, I have not used any delimiters, such as a colon or comma between fields, rather I’m using field length and location or placement.

That is, for example; the digit 2 means something different according to its placement within a string of numbers.

2 = two
20 = twenty
200 = two hundred
2000 = two thousand

In other words the meaning of a bit of metadata code depends on its location within the overall string of metadata. This allows the entire metadata string to be slurped up and analyzed using a single standard RegEx. The data can then be picked apart according to need.

For example, If I want to find items within the events start date field, entries with a “.” placeholder in that field can be skipped if no data exists.

If I’m only interested in finding events related to computer technology then I can confine a search to the subject, date and perhaps location fields.

One reason I think that a very concise metadata structure is, or would be more useful in some ways than something like XML or JSON aside from simplicity and speed, is that it would allow for very deep indexing within such things as internal HTML target links (or subheadings on a page), images, alongside or included within image descriptions, alt tags, and external links.

For example <a href=“http://www.whatever.com” YaCycode=“SDM50FRMPOL…42N075W.LIB534DOCHMs7yebhd8ao”>

A browser enabled to read the code could then show in advance what the link is all about. Or some javascript might be used, onhover or whatever .

This is a hypothetical use, but my point is that such meaningful deep indexing would be possible.

Sharing indexes would also be possible. Perhaps an entire server hosting thousands of websites could have a separate index file on the server, which might consist of a simple flat file of URL, metadata pairs.

deep conceptual Indexing of all the websites on a server could be accomplished by accessing a single small file.

My ambition or dream has always been to see something like this implemented on a free, open source, peer-to-peer type search engine, not that the metadata, added to web pages, could not also be picked up and utilized by existing search engines as well.

In that respect YaCy is the only ballgame in town.

I should make a semantics tutorial. Unfortunately this is very difficult because it would require some time to make a video. I once made a talk about Semantics in YaCy at the Humboldt University in Berlin - but unfortunately it’s in german language.


Have a look at the slides titled “Linked Open Data” (Slides Section 4) which contains the theoretical context about semantics in YaCy.

Thanks, I may be able to use a PDF translator.

I also found this document, which seems to have some in-depth information.

yes, great paper, the authors of this paper approached me and asked if they can interview me to get enough information to write down the theoretical basics. I proof-read their paper before publishing. It’s a good explanation about whats going on on the p2p side - but it is not about the semantics aspect.

I noticed this section in the “expert crawl start” area of the admin.

Does this mean it is possible to crawl the web with YaCy looking for pages with content that matches a specific regular expression?

What I have in mind is, IF a webpage was marked with some specific metadata in a particular format, as I’ve previously described above, like say:

<!-- YaCycode=“SDM50FRMPOL…42N075W.LIB534DOCHMs7yebhd8ao -->

or something like:

<xml>
    <YaCy_code>
            SDM50FRMPOL…42N075W.LIB534DOCHMs7yebhd8ao
     </YaCy_code>
<xml>

or as above:

<a href=“http://www.whatever.com” YaCycode=“SDM50FRMPOL…42N075W.LIB534DOCHMs7yebhd8ao”>

A regEx could be used to tell YaCy to crawl pages looking for such a pattern?

Well, I used a fairly simple regex (\bXprc_([A-Z]{3,8})_codeX) to search a flat file database on my site. To see what would happen and got this error:

Crawling of “http://peoplesresearchcenter.com/XPRC_File.prc” failed. Reason: scraper cannot load URL: no parser support:no parser found; url = http://peoplesresearchcenter.com/XPRC_File.prc/

Is this because YaCy does not like the .prc extension? maybe?

I’m guessing, due to the forward slash appended to the url in the error, that YaCy is looking for a stub? The server, however, does not recognize the URL with a “/” appended and so returns a 404

yes!

Well, YaCy looks for a matching extension and a matching mime type. I looked up the url and found that the server does not return any mime type inside it’s http header! That is a bit strange.

That looks like a bug

The idea there was to have a special type of metadata file on the server that search spiders could look for or that could be shared. Theoretically the file could contain metadata for many websites, thousands, tens of thousands, millions perhaps.

I had plans for updating the program to index websites in different categories in different files. Mostly to reduce file size and speed up retrieval time.

PRC is simply abbreviation for “People’s Research Center”. The .prc extension is something I made up to represent such files containing metadata in “prc_code” format. It was not being used at the time I was experimenting with this. Such a file would be a gold mine for a search spider and would require some special handling, therefore I thought some sort of special extension might be appropriate. Essentially however it is just a text or html file.

Apparently it is the .prc extension causing the server to not know what mime type to send. I checked other pages on the server with .html .xhtml .pl or whatever and they are ok. Since I just made up the .prc extension it’s not surprising that the server doesn’t recognize it, which may be a good thing in a way. What I wanted was an extension that was not otherwise being used, that could not be confused with anything else.

just add the right mime type and the parser will recognize them!