OAI-PMH Servers

Wow, this is fantastic.

One question, to start with.

There seems to be a discrepancy between the YaCy list of OAI-PMH Servers here:

http://localhost:8090/IndexImportOAIPMH_p.html

Which totals 7081 servers in all

and the list provided by openarchives.org

Only 4224 currently.

Also, I might report something of a bug, or annoyance.

While attempting to compare the two lists, I was trying to highlight a URL in the YaCy list and suddenly a download started. I was a little alarmed because I did not intend this, had no idea how big the file might be and could find no way to cancel the process. I ended up just letting it run. I have no idea what exactly I downloaded but the files do not seem to have been too big.

I’m not sure why the download started, as I was only trying to just highlight the link, but the “bug” that seems like it could be a bigger problem is, Once the download finished, I was transferred to another page showing the results of the download.

I hit the back button in the browser (Firefox) to get back to the list I was reviewing and the same download started all over again!

So now it appears that I have the same index downloaded twice. but with two different “token” identifiers.

It seems like this could be a bigger problem if I had downloaded numerous files and then forgot not to hit the back button.

Does this sort of thing result in duplicate files in the index?

Apparently Google does not support the OAI-PMH format.

So is the YaCy list more extensive than the OpenArchive list or does it perhaps include outdated, unofficial or no-longer-existant resources.

It is also difficult to assess what the resource might contain?

I was perhaps lucky to have accidentally transferred a relatively small resource. What if I had accidentally or unknowingly clicked on, or started a download of a 50 Gigabyte file or something?

Anyway, the fact that YaCy supports this kind of open index/resource sharing blows my mind!

What about mod_oai?

OAI-PMH is not a file format, it’s a protocol to harvest metadata from Libraries. They provide chunks of bibliographic data information in the form of (mostly!) dublin core metadata. YaCy translates the dublin core metadata into it’s Web Document format and indexes that. Because the OAI-PMH records contain an URL as identifier of the resource they describe (a document, may be PDF or HTML), we can use a standard search result to point to that resource.

Unfortunately it is always unknown how many records are provided by one OAI-PMH resource. But even if would be millions (which is rare) it would not be a problem for YaCy because the transmitted data is small - it’s only metadata about the resource (title, author, link, sometimes a short description).

When I first implemented this, I also made a presentation about it: https://yacy.net/material/YaCy_OAI_20100507.pdf (in german language unfortunately)

What is YaCy’s “Web Document format”?

The YaCy-internal representation of web pages, a key-value pair list.