Python parses data from website using regular expression

I am trying to parse some data from this site: http://www.csfbl.com/freeagents.asp?leagueid=2237

I wrote the code:

import urllib
import re

name = re.compile('<td><a href="[^"]+" onclick="[^"]+">(.+?)</a>')
player_id = re.compile('<td><a href="(.+?)" onclick=')
#player_id_num = re.compile('<td><a href=player.asp?playerid="(.+?)" onclick=')
stat_c = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">(.+?)</span><br><span class="[^"]?">')
stat_p = re.compile('<td class="[^"]+" align="[^"]+"><span class="[^"]?">"[^"]+"</span><br><span class="[^"]?">(.+?)</span></td>')

url = 'http://www.csfbl.com/freeagents.asp?leagueid=2237'

sock = urllib.request.urlopen(url).read().decode("utf-8")

#li = name.findall(sock)
name = name.findall(sock)
player_id = player_id.findall(sock)
#player_id_num = player_id_num.findall(sock)
#age = age.findall(sock)
stat_c = stat_c.findall(sock)
stat_p = stat_p.findall(sock)

      

First question: player_id returns the whole url "player.asp?playerid=4209661"

. I was unable to get only part of the number. How can i do this? (my attempt is described in #player_id_num

)

Second question: I cannot get stat_c when span_class

empty, as in ""

.

Is there a way that I can solve this problem? I'm not very familiar with RE (regular expressions), I've searched for tutorials online, but it's still unclear what I'm doing wrong.

+3


source to share


1 answer


It is very easy to use the library . pandas

Code:

import pandas as pd

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)

# print dfs[3]
# dfs[3].to_csv("stats.csv") # Send to a CSV file.
print dfs[3].head()

      

Result:

    0                 1    2  3      4     5     6     7     8      9     10  \
0  Pos              Name  Age  T     PO    FI    CO    SY    HR     RA    GL   
1    P    George Pacheco   38  R   4858  7484  8090  7888  6777   4353  6979   
2    P     David Montoya   34  R   3944  5976  6673  8699  6267   6685  5459   
3    P       Robert Cole   34  R   5769  7189  7285  5863  6267   5868  5462   
4    P  Juanold McDonald   32  R  69100  5772  4953  4866  5976  67100  5362   

     11    12  13       14      15          16  
0    AR    EN  RL  Fatigue  Salary         NaN  
1  3747  6171  -3     100%     ---  $3,672,000  
2  5257  5975  -4      96%      2%  $2,736,000  
3  4953  5061  -4      96%      3%  $2,401,000  
4  5982  5263  -4     100%     ---  $1,890,000 

      

You can apply whatever cleaning methods you want from here on out. The code is rudimentary, so you can improve it.



More Code:

import pandas as pd
import itertools

url = "http://www.csfbl.com/freeagents.asp?leagueid=2237"
dfs = pd.read_html(url)
df = dfs[3] # "First" stats table.

# The first row is the actual header.
# Also, notice the NaN at the end.
header = df.iloc[0][:-1].tolist()
# Fix that atrocity of a last column.
df.drop([15], axis=1, inplace=True)

# Last row is all NaNs. This particular
# table should end with Jeremy Dix.
df = df.iloc[1:-1,:]
df.columns = header
df.reset_index(drop=True, inplace=True)

# Pandas cannot create two rows without the
# dataframe turning into a nightmare. Let's
# try an aesthetic change.
sub_header = header[4:13]
orig = ["{}{}".format(h, "r") for h in sub_header]
clone = ["{}{}".format(h, "p") for h in sub_header]

# http://stackoverflow.com/a/3678930/2548721
comb = [iter(orig), iter(clone)]
comb = list(it.next() for it in itertools.cycle(comb))

# Construct the new header.
new_header = header[0:4]
new_header += comb
new_header += header[13:]

# Slow but does it cleanly.
for s, o, c in zip(sub_header, orig, clone):
    df.loc[:, o] = df[s].apply(lambda x: x[:2])
    df.loc[:, c] = df[s].apply(lambda x: x[2:])

df = df[new_header] # Drop the other columns.

print df.head()

      

Result:

  Pos              Name Age  T POr  POp FIr FIp COr COp     ...      RAp GLr  \
0   P    George Pacheco  38  R  48   58  74  84  80  90     ...       53  69   
1   P     David Montoya  34  R  39   44  59  76  66  73     ...       85  54   
2   P       Robert Cole  34  R  57   69  71  89  72  85     ...       68  54   
3   P  Juanold McDonald  32  R  69  100  57  72  49  53     ...      100  53   
4   P      Trevor White  37  R  61   66  62  64  67  67     ...       38  48   

  GLp ARr  ARp ENr ENp  RL Fatigue      Salary  
0  79  37   47  61  71  -3    100%  $3,672,000  
1  59  52   57  59  75  -4     96%  $2,736,000  
2  62  49   53  50  61  -4     96%  $2,401,000  
3  62  59   82  52  63  -4    100%  $1,890,000  
4  50  70  100  62  69  -4    100%  $1,887,000  

      

Obviously I have extracted the Real values ​​from the Potential values ​​instead. Some tricks have been used, but it gets the job done for at least the first table of players. The next few require some degree of manipulation.

+1


source







All Articles