Using Scrapy to bypass a shared FTP server

Question

Using Scrapy to bypass a shared FTP server

How can I get Scrapy to crawl an FTP server that does not require a username and password? I tried adding url to start urls, but Scrapy requires username and password to access FTP. I've overridden start_requests()

to provide the default (username "anonymous" and empty password works when I try to run a Linux command ftp

), but now I'm getting 550 responses from the server.

How can I bypass crawling FTP servers with Scrapy - ideally this will work with all FTP servers that don't require a username or password?

+3

python twisted web-scraping scrapy ftp

false_azure 04 jan. 15 at 21:36

source to share

1 answer

alecxe · Accepted Answer · 2015-01-12T04:06:14+0000

This is not documented, but Scrapy has functionality built in. There is FTPDownloadHandler

one that handles FTP upload using twisted FTPClient

. You don't need to call it directly, it is automatically included if a URL is requested ftp

.

In your spider, keep using the class scrapy.http.Request

, but provide the ftp credentials in the dictionary meta

in ftp_user

and of the ftp_password

elements:

yield Request(url, meta={'ftp_user': 'user', 'ftp_password': 'password'})

ftp_user

and ftp_password

. There are also two additional keys that you can provide:

ftp_passive

(enabled by default) sets passive FTP connection mode
ftp_local_filename

:
- If not specified, the file data will flow to response.body like a normal Scrapy Response, which implies that the entire file will be in memory.
- If specified, the file data will be saved in a local file with the specified name. This helps when loading very large files to avoid memory problems. In addition, for convenience, the local filename will also appear in the response body.

The latter is useful when you need to download a file and save it locally without processing the response in the spider callback.

With regard to anonymous use, what credentials are to be provided depends on the ftp server itself. The user is "anonymous" and the password is usually your email address, any password or space.

FYI, quote from spec :

Anonymous FTP is a means by which archive sites allow general access to their archives of information. These sites create a special account called anonymous. The "anonymous" user has limited access rights to the archive host, as well as some operational restrictions. In fact, the only actions allowed are logging in using FTP, listing the contents of a limited set of directories, and extracting files. Some sites restrict the contents of the directory so that an anonymous user can see as well. Note that "anonymous" users are usually not allowed to transfer files to an archive site, but can only retrieve files from such a site.

Traditionally, this special anonymous user account accepts any string as a password, although usually either the "guest" password or a single e-mail address is used. Some archive sites now explicitly ask for the user's email address and do not allow "guest" login. Providing an email address is a courtesy that allows the operators of archived sites to get an idea of who is using their services.

Trying it on the console usually helps you figure out what password you should be using, the welcome message usually explicitly states the password requirements. Real world example:

$ ftp anonymous@ftp.stratus.com
Connected to icebox.stratus.com.
220 Stratus-FTP-server
331 Anonymous login ok, send your complete email address as your password.
Password:

Here's a working example for mozilla public FTP :

import scrapy
from scrapy.http import Request

class FtpSpider(scrapy.Spider):
    name = "mozilla"
    allowed_domains = ["ftp.mozilla.org"]

    handle_httpstatus_list = [404]

    def start_requests(self):
        yield Request('ftp://ftp.mozilla.org/pub/mozilla.org/firefox/releases/README',
                      meta={'ftp_user': 'anonymous', 'ftp_password': ''})

    def parse(self, response):
        print response.body

If you run the spider, you will see the contents of the README file in the console:

Older releases have known security vulnerablities, which are disclosed at 

  https://www.mozilla.org/security/known-vulnerabilities/

Mozilla strongly recommends you do not use them, as you are at risk of your computer 
being compromised. 
...

Using Scrapy to bypass a shared FTP server

More articles: