Get data from multiple urls using Python
I am trying to do the following -
- Go to the web page, enter your search term.
- Get some data.
- In turn, it has multiple URLs. I need to parse each one to get some data from them.
I can do 1 and 2. I don't understand how I can go to all urls and get data from them (similar across all urls, but not the same).
EDIT: additional information - I enter search terms from a csv file, get multiple ids (with urls) from each page. I would like to go to all of these urls to get more ids from the next page. I want to write the whole thing to a CSV file. Basically, I want my output to be something like this
Level1 ID1 Level2 ID1 Level3 ID
Level2 ID2 Level3 ID
.
.
.
Level2 IDN Level3 ID
Level1 ID2 Level2 ID1 Level3 ID
Level2 ID2 Level3 ID
.
.
.
Level2 IDN Level3 ID
For each Level1 identifier, there can be several Level2 identifiers. But there will be only one corresponding Level ID for each Level2 ID.
CODE I have written so far:
import pandas as pd
from bs4 import BeautifulSoup
from urllib import urlopen
colnames = ['A','B','C','D']
data = pd.read_csv('file.csv', names=colnames)
listofdata= list(data.A)
id = '\n'.join(listofdata[1:]) #to skip header
def download_gsm_number(gse_id):
url = "http://www.example.com" + id
readurl = urlopen(url)
soup = BeautifulSoup(readurl)
soup1 = str(soup)
gsm_data = readurl.read()
#url_file_handle.close()
pattern=re.compile(r'''some(.*?)pattern''')
data = pattern.findall(soup1)
col_width = max(len(word) for row in data for word in row)
for row in data:
lines = "".join(row.ljust(col_width))
sequence = ''.join([c for c in lines])
print sequence
But this leads to the fact that all the IDs immediately go to the URL. As I mentioned earlier, I need to get the level 2 ids from the level 1 ids given as input. Also, of the Level 2 IDs, I need Level 3 IDs. Basically, if I only get one part (getting either Level 2 or Level 3) I can figure out the rest.
source to share
I believe your answer is urllib .
It's actually as easy as you are:
web_page = urllib.urlopen(url_string)
And then with that, you can do normal file operations like:
read()
readline()
readlines()
fileno()
close()
info()
getcode()
geturl()
From there, I would suggest using BeautifulSoup for parsing, which is as simple as:
soup = BeautifulSoup(web_page.read())
And then you can do all the wonderful BeautifulSoup operations .
I would suggest Scrapy is overkill and there is still a lot of overhead. BeautifulSoup has great documentation, examples, and is simple to use.
source to share