Ignoring robots.txt

What would be the best way to ignore robots.txt of a website?

I unchecked
Obey html-robots-noindex:
Obey html-robots-nofollow:
in Advanced Crawler.
But the website is still not crawlable with a robots.txt like this:
User-agent: *
Disallow: /

1 Like

Also the crawled pages with a Disallow robots.txt are not included at all, not even the front page.
This a big issue right now, being able to configure the web crawler and changing the user agent name (more choices) should be added.
I’m just gonna have to make my own search engine from scratch now

I think it’s more complicated than just ignoring it as my homemade search engine have the same issue even when not specifying anything related to robots.txt
Someone suggested using MITM proxy or maybe using some headless browser on top of it, a more complicated issue than what I thought first.

I’m not a big fan of Facebook, but as a courtesy to users of my YaCy local Kiosk, I thought it would be a good idea to at least spider the Facebook, (and other portals) login page.

However, this appears to be a problem:

I’m not sure why any portal would not want their main page to be accessable, especially since there is a Facebook link on nearly every page on the internet, along with twitter, instagram, and other major portals.

So search results, when looking for “Facebook” returns everything and anything with a Facebook link somewhere on the page, but not facebook itself.

Probably just as well.

P.S. I forgot, I did have a question, for anyone who knows;

If such a page or portal has such a robot.txt preventing it from being crawled, is there some way to enter the url into the index manually, and, also, designate it to be the top result returned?