Findall regex method malfunction

Question

Findall regex method malfunction

I'm new to Python and my professor at school thinks everyone understands the code he is posting, but I am having trouble using the searchnames methods to find a specific pattern in an HTML file. This is the code he posted and claims to do it. I don't know what the characters in the call to findall mean.

def searchnames(cont):
try:
    info = re.findall('(\d+)\s(\w+)(\d+,\d+|\d+)\n\s(\w+)\n(\d+,\d+|\d+)', cont)
    return info
except:
    print "couldn't find child info"
pass

where cont is the HTML file containing this

<head><title>Popular Baby Names</title>
<meta name="dc.language" scheme="ISO639-2" content="eng">
<meta name="dc.creator" content="OACT">
<meta name="lead_content_manager" content="JeffK">
<meta name="coder" content="JeffK">
<meta name="dc.date.reviewed" scheme="ISO8601" content="2006-03-10">
<link rel="stylesheet" href="../OACT/templatefiles/master.css" type="text/css" media="screen">
<link rel="stylesheet" href="../OACT/templatefiles/print.css" type="text/css" media="print">
</head>
<body bgcolor="#ffffff" text="#000000" topmargin="1" leftmargin="0">
<table width="100%" border="0" cellspacing="0" cellpadding="4">
  <tbody>
  <tr> 
    <td class="sstop" valign="bottom" align="left" width="25%">
      Social Security Online
    </td>
    <td valign="bottom" class="titletext">
      <!-- sitetitle -->Popular Baby Names
    </td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
  <tr>
    <td class="graystars" width="25%" valign="top">
       <a href="../OACT/babynames/">Popular Baby Names</a></td>
    <td valign="top"> 
      <a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
      width="52" height="47" align="left"
      alt="SSA logo: link to Social Security home page" border="0"></a>
      <h1>Popular Names by Birth Year</h1>September 11, 2014</td>
  </tr>
</tbody></table>
<script type="text/javascript" src="../OACT/babynames/chkinput.js"></script>

<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
  <tr valign="top">
    <td width="25%" class="greycell">
      <a href="../OACT/babynames/background.html">Background information</a>
      <p><br />
      &nbsp; Select another <label for="yob">year of birth</label>?<br />
      <form name="popnames" method="post" action="/cgi-bin/popularnames.cgi"
       onSubmit="return submitIt();">
      &nbsp; <input type="text" name="year" id="yob" size="4" value="2012"><input type="hidden" name="top" value="25"><input type="hidden" name="number" value="">
      &nbsp; <input type="submit" value="   Go  "></form>
    </td>
    <td><p align="center"><table width="$tablewidth" border="1" bordercolor="#aaabbb" cellpadding="2" cellspacing="0" summary="Popularity for top 25">
    <caption><h2>Popularity in 2012</h2></caption>
    <tr align="center" valign="bottom">
      <th scope="col" width="12%" bgcolor="#efefef">Rank</th>
      <th scope="col" width="$colwidth" bgcolor="#99ccff">Male name</th>
    <th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right">
 <td>1</td> <td>Jacob</td> <td>Sophia</td>
</tr>
<tr align="right">
 <td>2</td> <td>Mason</td> <td>Emma</td>
</tr>
<tr align="right">
 <td>3</td> <td>Ethan</td> <td>Isabella</td>
</tr>
<tr align="right">
 <td>4</td> <td>Noah</td> <td>Olivia</td>
</tr>
<tr align="right">
 <td>5</td> <td>William</td> <td>Ava</td>
</tr>
<tr align="right">
 <td>6</td> <td>Liam</td> <td>Emily</td>
</tr>
<tr align="right">
 <td>7</td> <td>Michael</td> <td>Abigail</td>
</tr>
<tr align="right">
 <td>8</td> <td>Jayden</td> <td>Mia</td>
</tr>
<tr align="right">
 <td>9</td> <td>Alexander</td> <td>Madison</td>
</tr>
<tr align="right">
 <td>10</td> <td>Aiden</td> <td>Elizabeth</td>
</tr>
<tr align="right">
 <td>11</td> <td>Daniel</td> <td>Chloe</td>
</tr>
<tr align="right">
 <td>12</td> <td>Matthew</td> <td>Ella</td>
</tr>
<tr align="right">
 <td>13</td> <td>Elijah</td> <td>Avery</td>
</tr>
<tr align="right">
 <td>14</td> <td>James</td> <td>Addison</td>
</tr>
<tr align="right">
 <td>15</td> <td>Anthony</td> <td>Aubrey</td>
</tr>
<tr align="right">
 <td>16</td> <td>Benjamin</td> <td>Lily</td>
</tr>
<tr align="right">
 <td>17</td> <td>Joshua</td> <td>Natalie</td>
</tr>
<tr align="right">
 <td>18</td> <td>Andrew</td> <td>Sofia</td>
</tr>
<tr align="right">
 <td>19</td> <td>Joseph</td> <td>Charlotte</td>
</tr>
<tr align="right">
 <td>20</td> <td>David</td> <td>Zoey</td>
</tr>
<tr align="right">
 <td>21</td> <td>Jackson</td> <td>Grace</td>
</tr>
<tr align="right">
 <td>22</td> <td>Logan</td> <td>Hannah</td>
</tr>
<tr align="right">
 <td>23</td> <td>Christopher</td> <td>Amelia</td>
</tr>
<tr align="right">
 <td>24</td> <td>Gabriel</td> <td>Harper</td>
</tr>
<tr align="right">
 <td>25</td> <td>Samuel</td> <td>Lillian</td>
</tr>
<tr><td colspan="3"><small>Note: Rank 1 is the most popular,
rank 2 is the next most popular, and so forth. 
</table></p>
</td></tr></table>
<table class="printhide" width="100%" border="0" cellpadding="1" cellspacing="0">
  <tr bgcolor="#333366"><td height="1" colspan="2"></td></tr>
  <tr>
    <td width="26%" valign="middle">&nbsp;</td>
    <td valign="top" class="seventypercent">
       <a href="http://www.ssa.gov/privacy.html">Privacy Policy</a>&nbsp;
     | <a href="http://www.ssa.gov/websitepolicies.htm">Website Policies
        &amp; Other Important Information</a>&nbsp;
     | <a href="http://www.ssa.gov/sitemap.htm">Site Map</a></td>
  </tr>
</table>
</body></html>

I can't figure out how to find the rank or child that is, rank or name, when I try to run the program it is an empty set. I do not understand why. Any help would be nice This is my program

    import re
def searchtitle(cont):
    try:        
        title = re.search('Popularity\sin\s(\d\d\d\d)', cont)
        return title.group(0)
    except:
        print "couldn't find title"  
    pass
def searchnames(cont):
    try:
        info = re.findall('(\d+)\s(\w+)(\d+,\d+|\d+)\n\s(\w+)\n(\d+,\d+|\d+)', cont)
        return info
    except:
        print "couldn't find child info"
    pass
if __name__ == '__main__':
    try:
        file = open('Popular_Baby_Names.html')
        cont = file.read()
        file.close()
        ti = searchtitle(cont)
        info = searchnames(cont)
        print ti
        print info
    except:
        print "file couldn't be found"
        SystemExit
    ranks = []
    boysnames = []
    girlsnames = []
    girlsfreq = []
    boysfreq = []
    for info in info:
        ranks.append(int(info[0]))
        boysnames.append(info[1])
        boysfreq.append(int(info[2].replace(',', '')))
        girlsnames.append(info[3])
        girlsfreq.append(int(info[4].replace(',', '')))
    print ranks
    print boysnames
    print boysfreq
    print girlsnames
    print girlsfreq
    print ranks[0] + ranks[1]

    pass

+3

python html regex html-parsing findall

user3325783 15 Sep '14 at 1:16

source to share

2 answers

Alternatively, if you don't want to dive into the wonderful world of regular expressions, parse your HTML with an HTML parser.

Usage example BeautifulSoup

:

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('Popular_Baby_Names.html'))
table = soup.find('table', summary='Popularity for top 25')

boys = []
girls = []
for row in table('tr')[1:-1]:
    cells = row('td')
    boys.append(cells[1].text)
    girls.append(cells[2].text)

print boys
print girls

Printing

[u'Jacob', u'Mason', u'Ethan', u'Noah', u'William', ... ]
[u'Sophia', u'Emma', u'Isabella', u'Olivia', u'Ava', ... ]

Also see the many reasons why you shouldn't parse HTML with regex:

Open RegEx tags, excluding standalone XHTML tags

+3

alecxe 15 Sep '14 at 1:34

source to share

Avinash Raj · Accepted Answer · 2014-09-15T01:26:45+0000

I think you need something like this,

<.*?>(\d{1,2})<.*?>\s*<.*?>(.*?)<.*?>\s*<.*?>(.*?)<.*?>

DEMO

It will capture and store ranks in group 1, the name of the boys in group 2, and the name of the girls in group 3.

Explanation

Findall regex method malfunction

More articles: