How not to collect all links to youtube, facebook, wikipedia and the like?

Hello, Friends. Please tell me how not to collect all links to youtube, facebook, wikipedia and the like? I need the robot to take only the youtube page that the page being scanned links to. For example a page my-name.com links to a video on youtube, I need the robot to take only this one page with this video, and not scan the entire youtube and then the entire Internet along with Amazon…

Friends, please help ina with this. :slight_smile:

I think, in the admin area, set the crawl depth to 0 to index only the page associated with the URL entered, then Yacy will not crawl any additional pages. Set to depth = 1 Yacy will also index the pages associated with the links on the first page but go no further.

Personally, I almost never set depth to more than 1 or at most 2 as I only want to index very selected known URLs on specific topic areas and keep resource use to a minimum.

This should be able to be done using regex based blacklist rules, under Filter & Blacklists settings.
(or regex rule set at advanced crawler runs ?)

Unfortunately, regex is all but arcane whichcraft to me :rofl:, so i’m unable to provide the specific regex rules for your need.

Personally, i’ve simply blacklisted e.g facebook at domains level.

Maybe these could be of use to you? If only as tips at the least.

https://www.regextester.com/ (this one’s very nice for editing and trying regex rules. I’ve at times often used it for grab-site warc archiving efforts)

It’s also probably a good idea to doublecheck the rules using YaCy’s own regex tester found at /RegexTest.html , just to be sure.

1 Like

/CrawlStartExpert.html

this will prevent crawler from going outside given list of domain(s) of desired start pages(, i think). Probably the easiest way without applying more advanced filtering.