Get data from multiple urls using Python

I am trying to do the following -

  • Go to the web page, enter your search term.
  • Get some data.
  • In turn, it has multiple URLs. I need to parse each one to get some data from them.

I can do 1 and 2. I don't understand how I can go to all urls and get data from them (similar across all urls, but not the same).

EDIT: additional information - I enter search terms from a csv file, get multiple ids (with urls) from each page. I would like to go to all of these urls to get more ids from the next page. I want to write the whole thing to a CSV file. Basically, I want my output to be something like this

Level1 ID1   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID
Level1 ID2   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             .
             .
             .
             Level2 IDN   Level3 ID

      

For each Level1 identifier, there can be several Level2 identifiers. But there will be only one corresponding Level ID for each Level2 ID.

CODE I have written so far:

import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen

colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header


def download_gsm_number(gse_id):
    url = "http://www.example.com" + id
    readurl = urlopen(url)
    soup = BeautifulSoup(readurl)
    soup1 = str(soup)
    gsm_data = readurl.read()
    #url_file_handle.close()
    pattern=re.compile(r'''some(.*?)pattern''')  
    data = pattern.findall(soup1)
    col_width = max(len(word) for row in data for word in row)
    for row in data:
        lines = "".join(row.ljust(col_width))
        sequence = ''.join([c for c in lines])
        print sequence

      

But this leads to the fact that all the IDs immediately go to the URL. As I mentioned earlier, I need to get the level 2 ids from the level 1 ids given as input. Also, of the Level 2 IDs, I need Level 3 IDs. Basically, if I only get one part (getting either Level 2 or Level 3) I can figure out the rest.

+3


source to share


1 answer


I believe your answer is urllib .

It's actually as easy as you are:

web_page = urllib.urlopen(url_string)

      

And then with that, you can do normal file operations like:

read()
readline()
readlines()
fileno()
close()
info()
getcode()
geturl()

      



From there, I would suggest using BeautifulSoup for parsing, which is as simple as:

soup = BeautifulSoup(web_page.read())

      

And then you can do all the wonderful BeautifulSoup operations .

I would suggest Scrapy is overkill and there is still a lot of overhead. BeautifulSoup has great documentation, examples, and is simple to use.

+2


source







All Articles