Get data from multiple urls using Python

I am trying to do the following -

  • Go to the web page, enter your search term.
  • Get some data.
  • In turn, it has multiple URLs. I need to parse each one to get some data from them.

I can do 1 and 2. I don't understand how I can go to all urls and get data from them (similar across all urls, but not the same).

EDIT: additional information - I enter search terms from a csv file, get multiple ids (with urls) from each page. I would like to go to all of these urls to get more ids from the next page. I want to write the whole thing to a CSV file. Basically, I want my output to be something like this

Level1 ID1   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             Level2 IDN   Level3 ID
Level1 ID2   Level2 ID1   Level3 ID
             Level2 ID2   Level3 ID
             Level2 IDN   Level3 ID


For each Level1 identifier, there can be several Level2 identifiers. But there will be only one corresponding Level ID for each Level2 ID.

CODE I have written so far:

import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen

colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header

def download_gsm_number(gse_id):
    url = "" + id
    readurl = urlopen(url)
    soup = BeautifulSoup(readurl)
    soup1 = str(soup)
    gsm_data =
    data = pattern.findall(soup1)
    col_width = max(len(word) for row in data for word in row)
    for row in data:
        lines = "".join(row.ljust(col_width))
        sequence = ''.join([c for c in lines])
        print sequence


But this leads to the fact that all the IDs immediately go to the URL. As I mentioned earlier, I need to get the level 2 ids from the level 1 ids given as input. Also, of the Level 2 IDs, I need Level 3 IDs. Basically, if I only get one part (getting either Level 2 or Level 3) I can figure out the rest.


source to share

1 answer

I believe your answer is urllib .

It's actually as easy as you are:

web_page = urllib.urlopen(url_string)


And then with that, you can do normal file operations like:



From there, I would suggest using BeautifulSoup for parsing, which is as simple as:

soup = BeautifulSoup(


And then you can do all the wonderful BeautifulSoup operations .

I would suggest Scrapy is overkill and there is still a lot of overhead. BeautifulSoup has great documentation, examples, and is simple to use.



All Articles