Heritrix: Ignore robots.txt file for one site only
I am using Heritrix 3.2.0.
I want to grab everything from one site, including pages that are usually protected by robots.txt.
However, I don't want to ignore the robots.txt file for other sites. (Don't want Facebook or Google to be mad at us, you know)
I tried to set up a sheet overlay very similar to the one in the 3.0 / 3.1 manual (at the end of the post)
Work builds without comment, but overlay doesn't seem to fire, local robots.txt file is still respected.
So what am I doing wrong?
Stig Hemmer
<beans>
... all the normal default crawler-beans.cxml stuff ...
<bean id="sheetOverLayManager" autowire="byType"
class="org.archive.crawler.spring.SheetOverlaysManager">
</bean>
<bean class='org.archive.crawler.spring.SurtPrefixesSheetAssociation'>
<property name='surtPrefixes'>
<list>
<value>
http://(no,kommune,trondheim,)/
https://(no,kommune,trondheim,)/
</value>
</list>
</property>
<property name='targetSheetNames'>
<list>
<value>noRobots</value>
</list>
</property>
</bean>
<bean id='noRobots' class='org.archive.spring.Sheet'>
<property name='map'>
<map>
<entry key='metadata.robotsPolicyName' value='ignore'/>
</map>
</property>
</bean>
</beans>
source to share
The original poster is here. As always, the problem exists between the keyboard and the chair.
It turns out I didn't understand how SURTs work.
New and improved configuration:
<property name='surtPrefixes'>
<list>
<value>http://(no,kommune,trondheim,</value>
<value>https://(no,kommune,trondheim,</value>
</list>
</property>
The important change was that the end of each SURT was open as I really wanted to include children in the rules.
I also split the two SURTs into two <value>
s. Not sure if this is necessary, but at least it is more readable.
I still have problems, but at least I have new problems!
source to share