Google is ignoring my robots.txt file

Here is the content of my robots.txt file:

User-agent: *
Disallow: /images/
Disallow: /upload/
Disallow: /admin/

      

As you can see, I explicitly forbade all robots to index folders images

, upload

and admin

. The problem is that one of my clients sent a request to remove content from the images folder because a .pdf document from the folder appeared in google search results images

. Can anyone explain to me what I am doing wrong here and why Google indexed my folders?

thank!

+3


source to share


1 answer


Citing Google Docs Webmaster

If I block Google from crawling a page by blocking the robots.txt directive, will it disappear from the search results?

Blocking Google from crawling a page will likely decrease your rankings or cause them to drop altogether over time. It can also reduce the amount of detail provided to users in the text below the search result. This is because without the content of the page, the search engine has much less information to work with.

-

However, robots.txt Disallow does not guarantee that the page will not appear in the results . Google can still decide, based on external information such as inbound links, that it matters. If you want to explicitly block a page from being indexed, you must instead use the noindex robots meta tag or the X-Robots-Tag HTTP header. In this case, you should not deny the page in the robots.txt file, because the page must be crawled for the tag to be viewed and executed.

Set X-Robots-Tag header with noindex for all files in folders. Set this header from your web server config for folders. https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl=de



  • Set header from Apache Config for pdf files:

    <Files ~ "\.pdf$"> Header set X-Robots-Tag "noindex, nofollow" </Files>

  • Disable directory index / list of this folder.

  • Add an empty index.html with the "noindex" meta tag.

    <meta name="robots" content="noindex, nofollow" /> <meta name="googlebot" content="noindex" />

  • Forcefully remove indexed pages manually using webmaster tools.


Question in the comment: How to ban all files in a folder?

// 1) Deny folder access completely
<Directory /var/www/denied_directory>
    Order allow,deny
</Directory>

// 2) inside the folder, place a .htaccess, denying access to all, except to index.html
Order allow,deny
Deny from all
<FilesMatch index\.html>
        Allow from all
</FilesMatch>

// 3) allow directory, but disallow specifc environment match
BrowserMatch "GoogleBot" go_away_badbot
BrowserMatch ^BadRobot/0.9 go_away_badbot

<Directory /deny_access_for_badbot>
order allow,deny
allow from all
deny from env=go_away_badbot
</Directory>  

// 4) or redirect bots to main page, sending http status 301
BrowserMatch Googlebot badbot=1
RewriteEngine on
RewriteCond %{ENV:badbot} =1
RewriteRule ^/$ /main/  [R=301,L]

      

+7


source







All Articles