How to rescan documents with an error status

Yesterday we had an issue that prevented the gsa crawler process from entering our site for crawling. Because of this, many URLs are indexed as the login page. I see a lot of results on a search page called "Please Login" (name of the login page). Also when I check Index Diagnostics, the crawl status for these URLs is " Retrying URL: Connection reset by peer during fetch.

".

Now the login issue has been fixed and after crawling the page again, the crawl status succeeds and it brings up the page content and the search results are displayed with the appropriate title. But since I have no control over what it is when crawling, there are pages that have not yet been rescanned and still have a problem.

There isn't a single URL that I can force to re-crawl. Hence my question: Is there a way to force re-crawling based on the crawl state (" Retrying URL: Connection reset by peer during fetch.

")? If it is specifically a question about re-examination based on bypass status type ( Errors/Successful/Excluded

)?

+3


source to share


2 answers


  • Export all error url as csv file using "Index> Diagnostics> Index Diagnostics"

  • Open CSV and apply the filter in colraw status and get the urls you are looking for.

  • Copy these urls and go to "Content Sources> Web Scan> Freshness Tuning> Recrawl the URL Patterns" and paste and click Recrawl

What is it. All is ready!

PS: If the error url is larger (> 10000 if I'm not mistaken) you won't be able to get all of them in one csv file. In this case, you can do it in batches.



Hello,

Mohan

+2


source


You can use this to submit a batch of urls to re-crawl: https://github.com/google/gsa-admin-toolkit/blob/master/interactive-feed-client.html



I tested 80K batches at the same time.

+1


source







All Articles