Beautiful Soup inested div (adding extra function)

I am trying to extract company name, address and zipcode from [www.quicktransportsolutions.com][1]

. I wrote the following code to scribble the site and return the information I needed.

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in soup.findAll('div', {'class': 'well well-sm'}):
            title = link.string
            print(link)
trade_spider(1)

      

After running the code, I see the information I need, but I'm confused as to how to print it out without any inappropriate information.

Over

print(link)

      

I thought I might have a link.string to pull the company names, but that failed. Any suggestions?

Output:

div class="well well-sm">
<b>2 OLD BOYS TRUCKING LLC</b><br><a href="/truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Missouri Trucking Company 2 OLD BOYS TRUCKING ADRIAN"><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u></a><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><a href="http://maps.google.com/maps?q=227+E+2ND,ADRIAN,MO+64720&amp;ie=UTF8&amp;z=8&amp;iwloc=addr" target="_blank"><span itemprop="streetAddress">227 E 2ND</span></a>
<br>
<span itemprop="addressLocality">Adrian</span><span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br>
                Trucks: 2       Drivers: 2<br>
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795                <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b>
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports">

      

All,

Thanks for the help so far ... I am trying to add an additional feature to my little finder. I wrote the following code:

def Crawl_State_Page(max_pages):
    url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php'
    while i <= len(url):
        response = requests.get(url)
        soup = BeautifulSoup(response.content)
        table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"})
        for link in table.find_all(href=True):
            print link['href']

Output: 

    abbeville.php
    adamsville.php
    addison.php
    adger.php
    akron.php
    alabaster.php
    alberta.php
    albertville.php
    alexander-city.php
    alexandria.php
    aliceville.php


     alpine.php

... # goes all the way to Z I cut the output short for spacing.. 

      

What I'm trying to do here is pull out all the href with city.php and write it to a file ... But right now, I'm stuck in an infinite loop where it keeps looping through the url. Any advice on how to increase it? My ultimate goal is to create another function that feeds back to my trade_spider using www.site.com/state/city.php site and then projects all 50 dates ... Something like an effect

while i < len(states,cities):
    url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +" 

      

And then it will loop over my trade_spider function, pulling in all the information I need.

But before I get to this part, I need a little help getting out of my endless loop. Any suggestions? Or the predictable problems I am facing?

I tried to create a crawler that will cycle through each link on the page, and then if it finds content on the page that trade_spider can crawl, it will write it to a file ... However, that was a bit out of my skillset so far. So, I am trying to use this method.

+3


source to share


1 answer


I would rely on the attributes of itemprop

different tags for each company. They are conveniently set for name

, url

, address

etc .:

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
        response = requests.get(url)
        soup = BeautifulSoup(response.content)
        for company in soup.find_all('div', {'class': 'well well-sm'}):
            link = company.find('a', itemprop='url').get('href').strip()
            name = company.find('span', itemprop='name').text.strip()
            address = company.find('span', itemprop='address').text.strip()

            print name, link, address
            print "----"

trade_spider(1)

      



Printing

2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND

Adrian, MO 64720
----
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453

Adrian, MO 64720
----

      

+2


source







All Articles