Domain list for easier search bootstrapping

Hey guys,

I’ve been running Yacy on and off for quite some time. Every time I’ve installed
new node I had to search for domains to start crawl with all over again.

After some consideration I’ve decided to upload resulting list to Github: https://github.com/tb0hdan/domains

I hope this will be useful.

4 Likes

Hi tb0hdan,

thats a huge list!

Great thanks,

It’s far from being high quality but I’m doing my best to have it properly
sorted and updated.

Update:

TLD kinds: 1507
Country TLDs: 244
Generic TLDs: 1263
Total domains in dataset: 220,011,651
1 Like

wow! tweeted this…

Hi Bohdan, good work!

Did you really crawl the stuff yourself? Pretty many of them time out and there is tons of subdomains, esp. in the porn sector.

I downloaded similar stuff from domainlists.io. To filter out the ones with content behind, I use “subbrute” and “massdns” together with a perl script:

while(<>) {
if (/^;; ANSWER/) {
    $_ = <>;    
    if (/([a-z0-9\-]+)\.(\w+)\. (\d+) IN A (.+)/) {
        print "$1.$2\t$4\n";
    }
}

}

called by:

./bin/massdns -r resolvers.txt $1 | perl grep_ipaddrs.pl > ipaddrs.txt

where $1 is the domain list

tb0hdan was scanning for domain names, not web services. So right approach will be to feed list to nmap to find out what servers are responding on port 80. And than feed only them to YaCY,

I just filtered all names staring with “www.” and craw them. Getting pretty good results.

@zooom Yes, I do. Crawler code itself is opensource - https://github.com/tb0hdan/domains-crawler - just file reader and TLDs used to configure it are not. There are bugs (as always) but I’m working on getting them fixed.

I’ve used additional sources as crawler input to speed up dataset growth, all of them are listed in dataset readme.

Regarding subdomains - there are some limits in place, still I wanted to have those as well to allow for others to have doorway detection. I’m working on autovacuum process that
will filter invalid (i.e. expired) domain names.

Regarding domainlists.io - I strongly believe that domain list should be publicly available and not sold.

@TheHolm Yes, your approach with nmap seems to be the best so far.

What is the recommended way of using the list?

Hi there, @vasyugan

  1. Pick TLD you like, for this example it would be https://dataset.domainsproject.org/afghanistan/domain2multi-af.txt
  2. Use this command to convert domain list to URL list: cat domain2multi-af.txt| awk '{print "http://"$1}' > /tmp/domain2multi-af.txt.urls
  3. Go to http://127.0.0.1:8090/CrawlStartExpert.html
  4. Pick From File (enter a path within your local file system)
  5. Point to URL list you’ve generated - in this case - /tmp/domain2multi-af.txt.urls
  6. Hit Start New Crawl Job button

Yacy will automatically skip hosts that are not available for crawling.

You can browse dataset here: https://dataset.domainsproject.org/

1 Like

Here is a funny thing: I once applied to the Deutsche Nationalbibliothek to make a german domain search engine (that was a government request). They did not accept my proposal, but as a reference for a good harvesting start point I submitted a 1095-pages, 477119-domain start list in a pdf (a pdf was requested).

You can download that document here:

… maybe I did not get the job because the list started with “0-24-sex.de, 0-strom.de, 0.bild.poppen.de”??
(do not click on the links)

1 Like

Thanks, @Orbiter

Here are relevant commands to extract domain list from that PDF:

pdftotext Top-Level-Domain-Harvesting-DE-Seedlist.pdf
cat Top-Level-Domain-Harvesting-DE-Seedlist.txt|grep '\.de' |sed 's/\, /\n/g'|egrep '^[a-z|0-9](.+)\.de$' > dotde.txt

I’m going to verify them and import into my dataset.

1 Like

Hi Obriter
Thx a lot for the list. I do something similar in Switzerland. Made a proposal to implement YaCy which is still pending.

Would you be available for a project to implement YaCy?
Cheers M

Update:

TLD kinds: 1522
Country TLDs: 245
Generic TLDs: 1277
Total domains in dataset: 1,789,946,688

yes, maybe

wow, thats huge!

tb0hdan
bro the last updated list is 7 month ago when u will update new?

sir i got your dataset but there are losts of hostnames i dont want hostnames i want only domain and subdomains would you help me to extract

Cool, big list.

1 Like