Launch of

Hi @ll.

In 2016 I started developing a database structure to store statistics of yacy peers since the original project shut down. But then I had to stop for personal reasons.

From time to time I ran a yacy instance on my private server. Unfortunately, it sent my private internet to the graveyard and I hab shut it down again (damn Fritzbox). :slight_smile:

About a month ago I started to develop a new database structure version and had a really good start with the scripts that are fetching the data, importing into the database, creating statistics and so on.
The new site I’ve put together is 2 weeks online now.

Here is what I have:

  • The network stats that can be collected from one peer installation that shows current statistics like ppm, qph, links, words etc.
  • The “seedlist” with all “public” peers
  • And… index browser pages…

The first two things should be clear. Some values and the official names of the peers. I am collecting these pages every hour.
But what I was really interested in over the last two days was the index browser page. It shows all web pages that had been indexed. If you collect this page from every peer you can create an overall index of every website.

I could write a lot more about it but now I’d like to hear your thoughts and maybe ideas!?

Thanks & greetings


Thanks very much for your work and for sharing it with us.

I think the GUI would be more intuitive if you could mouse hover and see a more detailed explanation of each data feature, with a short example to illustrate.

Thanks, interesting stats.
I have notice that number of Inactive peers was growing for an month, while number of active one is steady. Is it an monitoring methodology artifact or reality.


Nice to hear that you like the page, even if it still does not show everything I have in my mind. :slight_smile:

You are right. Explanations are very important. I am working on it.

The number of Active and Inactive peers are beeing calculated, of course.
I decided to call a peer active if it fullfills following criteria are based on 7 days. The data is being collected every hour:

  • The ppm and qph values had to be changed
  • The values of links and words had to be changed to the positive or negative

The ppm and qph values are quite hard to achieve by the peer because it has to have this value right at the time when my script runs.

If these values are sero all over the time the peer seems to be inactive.
The problem is that if it was crawling a page and at the exact same time lost these exact same values it would be still inactive. Right?
That’s what my script cannot detect. Well, it would be able if it collects data every second, but I don’t have such ressources. :slight_smile:


Hey @akdk7 thats fantastic! Nicely done!
tweetet it

1 Like

Hi @Orbiter. Thanks a lot. :slight_smile:

For a few days now I am collecting the “Index Browser” page from peers that allow a connection from the internet. So far, my script was able to connected to a total of 221 peers and 13,728 domains. Makes it quite interesting how many peers are available from the free world.
The overview page shows the top 5 of domains/links, tlds/links and peers/domains. You can also browse through the complete list of domains here.

interesting, this brings up one element that fits into a “data collection idea” that I have in mind for some time. Here is my posting about it: Self-hosted S3 Buckets for distributed Data Collection
Your collection of domains would be one piece that can be shared in that S3/minio place.

That is an interesting idea @Orbiter. I’m curious about the results.

I’ve released a big update. You can expand each row to see more details. For example the peers that are crawling the address.
I also changed the method of loading the data from the server. It should be a lot faster now as it only preloads a small part of the complete list.
If you like to see more or other details let me know.