Retrieving Data with Python

I am interested in retrieving historical prices from this link: https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol=KEL

For this I am using the following code

import requests
import pandas as pd
import time as t

t0=t.time()

symbols =[
          'HMIM',
           'CWSM','DSIL','RAVT','PIBTL','PICT','PNSC','ASL',
          'DSL','ISL','CSAP','MUGHAL','DKL','ASTL','INIL']

for symbol in symbols:
    header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)
    dfs = pd.read_html(r.text)
    df=dfs[6]
    df=df.ix[2: , ]
    df.columns=['Date','Open','High','Low','Close','Volume']
    df.set_index('Date', inplace=True)
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),columns=['Open','High','Low','Close','Volume'],
             index_label=['Date'])

    print(symbol)


t1=t.time()
print('exec time is ', t1-t0, 'seconds')

      

The above code extracts the data from the link, converts it to pandas framework and saves it.

The problem is that it takes a long time and is ineffective with more characters. Can anyone suggest any other way to achieve the above result in an efficient way.

Also, is there another programming language that will do the same job, but in less time.

+3


source to share


1 answer


Normal GET requests with requests

are "blocking"; one request is sent, one response is received and then processed. At least some of your processing time is spent waiting for responses - we can send all our requests asynchronously from requests-futures

and then collect responses as they are ready.

However, I think it DSIL

is a timeout or something similar (I need to look further). Although I managed to get a decent speedup with random selection from symbols

, both methods take approx. the same time if DSIL

in the list.

EDIT: It seems I lied, it was just an unfortunate match with "DSIL" a few times. The more tags you have symbols

, the faster the async method will become more standardized requests

.

import requests
from requests_futures.sessions import FuturesSession
import time

start_sync = time.time()

symbols =['HMIM','CWSM','RAVT','ASTL','INIL']

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

for symbol in symbols:
    r = requests.get('https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(str(symbol)), headers=header)

end_sync = time.time()

start_async = time.time()
# Setup
session = FuturesSession(max_workers=10)
pooled_requests = []

# Gather request URLs
for symbol in symbols:
    request= 'https://pakstockexchange.com/stock2/index_new.php?section=research&page=show_price_table_new&symbol={}'.format(symbol)
    pooled_requests.append(request)

# Fire the requests
fire_requests = [session.get(url, headers=header) for url in pooled_requests]
responses = [item.result() for item in fire_requests]

end_async = time.time()

print "Synchronous requests took: {}".format(end_sync - start_sync)
print "Async requests took:       {}".format(end_async - start_async)

      



In the above code, I am getting 3x speedup when getting responses. You can iterate over the list responses

and process each response as usual.

EDIT 2: Looking at the responses of the asynchronous requests and saving them as you did before:

for i, r in enumerate(responses):
    dfs = pd.read_html(r.text)
    df=dfs[6]
    df=df.ix[2: , ]
    df.columns=['Date','Open','High','Low','Close','Volume']
    df.set_index('Date', inplace=True)
    df.to_csv('/home/furqan/Desktop/python_data/{}.csv'.format(symbols[i]),columns=['Open','High','Low','Close','Volume'],
             index_label=['Date'])

      

+2


source







All Articles