Heritrix: Ignore robots.txt file for one site only

Question

Heritrix: Ignore robots.txt file for one site only

I am using Heritrix 3.2.0.

I want to grab everything from one site, including pages that are usually protected by robots.txt.

However, I don't want to ignore the robots.txt file for other sites. (Don't want Facebook or Google to be mad at us, you know)

I tried to set up a sheet overlay very similar to the one in the 3.0 / 3.1 manual (at the end of the post)

Work builds without comment, but overlay doesn't seem to fire, local robots.txt file is still respected.

So what am I doing wrong?

Stig Hemmer

<beans>
  ... all the normal default crawler-beans.cxml stuff ...

  <bean id="sheetOverLayManager" autowire="byType"
        class="org.archive.crawler.spring.SheetOverlaysManager">
  </bean>

  <bean class='org.archive.crawler.spring.SurtPrefixesSheetAssociation'>
    <property name='surtPrefixes'>
     <list>
       <value>
http://(no,kommune,trondheim,)/
https://(no,kommune,trondheim,)/
       </value>
     </list>
   </property>
   <property name='targetSheetNames'>
     <list>
       <value>noRobots</value>
     </list>
   </property>
 </bean>

 <bean id='noRobots' class='org.archive.spring.Sheet'>
   <property name='map'>
     <map>
       <entry key='metadata.robotsPolicyName' value='ignore'/>
     </map>
   </property>
 </bean>
</beans>

+3

heritrix

Stig Hemmer 09 June 15 at 8:49

source to share

1 answer

Stig Hemmer · Answer 1 · 2015-06-11T10:14:52+0000

The original poster is here. As always, the problem exists between the keyboard and the chair.

It turns out I didn't understand how SURTs work.

New and improved configuration:

<property name='surtPrefixes'>
  <list>
    <value>http://(no,kommune,trondheim,</value>
    <value>https://(no,kommune,trondheim,</value>
  </list>
</property>

The important change was that the end of each SURT was open as I really wanted to include children in the rules.

I also split the two SURTs into two <value>

s. Not sure if this is necessary, but at least it is more readable.

I still have problems, but at least I have new problems!

Heritrix: Ignore robots.txt file for one site only

More articles: