Multithreaded Python Web Crawler Got Stuck

I am writing a Python web crawler and I want to make it multithreaded. Now I finished the main part, below is what it does:

  • the thread gets the url from the queue;

  • the thread fetches links from the page, checks if the links exist in the pool (set), and puts new links in the queue and pool;

  • the stream writes the url and http response to a csv file.

But when I run the crawler, it always gets stuck in the end rather than exiting it properly. I went through the official Python doc but still don't know.

Below is the code:

#!/usr/bin/env python
#!coding=utf-8

import requests, re, urlparse
import threading
from Queue import Queue
from bs4 import BeautifulSoup

#custom modules and files
from setting import config


class Page:

    def __init__(self, url):

        self.url = url
        self.status = ""
        self.rawdata = ""
        self.error = False

        r = ""

        try:
            r = requests.get(self.url, headers={'User-Agent': 'random spider'})
        except requests.exceptions.RequestException as e:
            self.status = e
            self.error = True
        else:
            if not r.history:
                self.status = r.status_code
            else:
                self.status = r.history[0]

        self.rawdata = r

    def outlinks(self):

        self.outlinks = []

        #links, contains URL, anchor text, nofollow
        raw = self.rawdata.text.lower()
        soup = BeautifulSoup(raw)
        outlinks = soup.find_all('a', href=True)

        for link in outlinks:
            d = {"follow":"yes"}
            d['url'] = urlparse.urljoin(self.url, link.get('href'))
            d['anchortext'] = link.text
            if link.get('rel'):
                if "nofollow" in link.get('rel'):
                    d["follow"] = "no"
            if d not in self.outlinks:
                self.outlinks.append(d)


pool = Queue()
exist = set()
thread_num = 10
lock = threading.Lock()
output = open("final.csv", "a")

#the domain is the start point
domain = config["domain"]
pool.put(domain)
exist.add(domain)


def crawl():

    while True:

        p = Page(pool.get())

        #write data to output file
        lock.acquire()
        output.write(p.url+" "+str(p.status)+"\n")
        print "%s crawls %s" % (threading.currentThread().getName(), p.url)
        lock.release()

        if not p.error:
            p.outlinks()
            outlinks = p.outlinks
            if urlparse.urlparse(p.url)[1] == urlparse.urlparse(domain)[1] :
                for link in outlinks:
                    if link['url'] not in exist:
                        lock.acquire()
                        pool.put(link['url'])
                        exist.add(link['url'])
                        lock.release()
        pool.task_done()            


for i in range(thread_num):
    t = threading.Thread(target = crawl)
    t.start()

pool.join()
output.close()

      

Any help would be appreciated!

thank

Mark

+3


source to share


1 answer


The scan function has an infinite while loop with no possible exit path. The condition is True

always evaluated as True

, and the loop continues as you say,

doesn't work fine

Modify the bypass function in the loop to include the condition. For example, when the number of links stored in the csv file exceeds a certain minimum number, exit the while loop.



ie,

def crawl():
    while len(exist) <= min_links:
        ...

      

+3


source







All Articles