Python parses data from website using regular expression

Question

Python parses data from website using regular expression

I am trying to parse some data from this site: http://www.csfbl.com/freeagents.asp?leagueid=2237

I wrote the code:

import urllib
import re

name = re.compile('<td><a href="[^"]+" onclick="[^"]+">(.+?)</a>')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')

url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'

sock = urllib.request.urlopen(url).read().decode("utf-8")

#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)

First question: player_id returns the whole url "player.asp?playerid=4209661"

. I was unable to get only part of the number. How can i do this? (my attempt is described in #player_id_num

)

Second question: I cannot get stat_c when span_class

empty, as in ""

.

Is there a way that I can solve this problem? I'm not very familiar with RE (regular expressions), I've searched for tutorials online, but it's still unclear what I'm doing wrong.

+3

python

Euisoo Yoo 03 June 15 at 12:11 am

source to share

1 answer

Manhattan · Answer 1 · 2015-06-03T00:28:25+0000

It is very easy to use the library . pandas

Code:

import pandas as pd

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)

# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()

Result:

    0                 1    2  3      4     5     6     7     8      9     10  \
0  Pos              Name  Age  T     PO    FI    CO    SY    HR     RA    GL   
1    P    George Pacheco   38  R   4858  7484  8090  7888  6777   4353  6979   
2    P     David Montoya   34  R   3944  5976  6673  8699  6267   6685  5459   
3    P       Robert Cole   34  R   5769  7189  7285  5863  6267   5868  5462   
4    P  Juanold McDonald   32  R  69100  5772  4953  4866  5976  67100  5362   

     11    12  13       14      15          16  
0    AR    EN  RL  Fatigue  Salary         NaN  
1  3747  6171  -3     100%     ---  $3,672,000  
2  5257  5975  -4      96%      2%  $2,736,000  
3  4953  5061  -4      96%      3%  $2,401,000  
4  5982  5263  -4     100%     ---  $1,890,000

You can apply whatever cleaning methods you want from here on out. The code is rudimentary, so you can improve it.

More Code:

import pandas as pd
import itertools

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.

# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)

# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)

# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]

# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))

# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]

# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
    df.loc[:, o] = df[s].apply(lambda x: x[:2])
    df.loc[:, c] = df[s].apply(lambda x: x[2:])

df = df[new_header] # Drop the other columns.

print df.head()

Result:

  Pos              Name Age  T POr  POp FIr FIp COr COp     ...      RAp GLr  \
0   P    George Pacheco  38  R  48   58  74  84  80  90     ...       53  69   
1   P     David Montoya  34  R  39   44  59  76  66  73     ...       85  54   
2   P       Robert Cole  34  R  57   69  71  89  72  85     ...       68  54   
3   P  Juanold McDonald  32  R  69  100  57  72  49  53     ...      100  53   
4   P      Trevor White  37  R  61   66  62  64  67  67     ...       38  48   

  GLp ARr  ARp ENr ENp  RL Fatigue      Salary  
0  79  37   47  61  71  -3    100%  $3,672,000  
1  59  52   57  59  75  -4     96%  $2,736,000  
2  62  49   53  50  61  -4     96%  $2,401,000  
3  62  59   82  52  63  -4    100%  $1,890,000  
4  50  70  100  62  69  -4    100%  $1,887,000

Obviously I have extracted the Real values from the Potential values instead. Some tricks have been used, but it gets the job done for at least the first table of players. The next few require some degree of manipulation.

Python parses data from website using regular expression

More articles: