Domain list for easier search bootstrapping

Hey guys,

I’ve been running Yacy on and off for quite some time. Every time I’ve installed
new node I had to search for domains to start crawl with all over again.

After some consideration I’ve decided to upload resulting list to Github: https://github.com/tb0hdan/domains

I hope this will be useful.

1 Like

Hi tb0hdan,

thats a huge list!

Great thanks,

It’s far from being high quality but I’m doing my best to have it properly
sorted and updated.

Update:

TLD kinds: 1507
Country TLDs: 244
Generic TLDs: 1263
Total domains in dataset: 220,011,651
1 Like

wow! tweeted this…

Hi Bohdan, good work!

Did you really crawl the stuff yourself? Pretty many of them time out and there is tons of subdomains, esp. in the porn sector.

I downloaded similar stuff from domainlists.io. To filter out the ones with content behind, I use “subbrute” and “massdns” together with a perl script:

while(<>) {
if (/^;; ANSWER/) {
    $_ = <>;    
    if (/([a-z0-9\-]+)\.(\w+)\. (\d+) IN A (.+)/) {
        print "$1.$2\t$4\n";
    }
}

}

called by:

./bin/massdns -r resolvers.txt $1 | perl grep_ipaddrs.pl > ipaddrs.txt

where $1 is the domain list

tb0hdan was scanning for domain names, not web services. So right approach will be to feed list to nmap to find out what servers are responding on port 80. And than feed only them to YaCY,

I just filtered all names staring with “www.” and craw them. Getting pretty good results.

@zooom Yes, I do. Crawler code itself is opensource - https://github.com/tb0hdan/domains-crawler - just file reader and TLDs used to configure it are not. There are bugs (as always) but I’m working on getting them fixed.

I’ve used additional sources as crawler input to speed up dataset growth, all of them are listed in dataset readme.

Regarding subdomains - there are some limits in place, still I wanted to have those as well to allow for others to have doorway detection. I’m working on autovacuum process that
will filter invalid (i.e. expired) domain names.

Regarding domainlists.io - I strongly believe that domain list should be publicly available and not sold.

@TheHolm Yes, your approach with nmap seems to be the best so far.

What is the recommended way of using the list?

Hi there, @vasyugan

  1. Pick TLD you like, for this example it would be https://dataset.domainsproject.org/afghanistan/domain2multi-af.txt
  2. Use this command to convert domain list to URL list: cat domain2multi-af.txt| awk '{print "http://"$1}' > /tmp/domain2multi-af.txt.urls
  3. Go to http://127.0.0.1:8090/CrawlStartExpert.html
  4. Pick From File (enter a path within your local file system)
  5. Point to URL list you’ve generated - in this case - /tmp/domain2multi-af.txt.urls
  6. Hit Start New Crawl Job button

Yacy will automatically skip hosts that are not available for crawling.

You can browse dataset here: https://dataset.domainsproject.org/