Extracting text from script tag with BeautifulSoup in Python

Question

Extracting text from script tag with BeautifulSoup in Python

Could you please help me with this. I am looking to extract the email, phone and name from the below code into a SCRIPT tag (not Body) using Beautiful soup (Python). I'm new to Python and the blog recommends using a nice soup for extraction.

I tried to get the page using the following code -

fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")

This Ajax request code is not repeated on the page again. Can we also write try and catch so that if it doesn't find it on the page it won't throw any errors.

<script type="text/javascript" language='javascript'> 
$(document).ready( function (){

   $('#message').click(function(){
       alert();
   });

    $('#addmessage').click(function(){
        $.ajax({ 
            type: "POST",
            url: 'http://www.example.com',
            data: { 
                email: 'abc@g.com', 
                phone: '9999999999', 
                name: 'XYZ'
            }
        });
    });
});

Once I get this, I also want to save the excel file.

Thanks in anticipation.

+2

python urllib2 beautifulsoup

Chopra 04 Aug At 4:26 am

source to share

2 answers

alecxe · Answer 1 · 2014-08-04T04:49:04+0000

The content of the tag script

can be retrieved via BeautifulSoup

and then a regular expression can be applied to get the required data.

Working example (based on what you described in the question):

import re
from bs4 import BeautifulSoup

data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

soup = BeautifulSoup(data)
script = soup.find('script')

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

Printing

abc@g.com 9999999999 XYZ

I don't really like the solution as this regex approach is really fragile. Anything that breaks can happen. I still think there is a better solution and we don't see the bigger picture here. Providing a link to that particular site will help a lot, but that's what it is.

UPD (fixing OP code):

soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

prints:

abcd@gmail.com 9999999999 Shamita Shetty

alecxe · Answer 2 · 2014-08-04T05:03:32+0000

As an alternative to the regex based approach, you can slimit

parse the javascript code with a module that creates an abstract syntax tree and gives you can get all the assignments and put them in a dictionary:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')

# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
          for node in nodevisitor.visit(tree)
          if isinstance(node, ast.Assign)}

print fields

Printing

{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'abc@g.com'"}

Among other fields, there are email

, name

and phone

that interest you.

Hope it helps.

Extracting text from script tag with BeautifulSoup in Python

More articles: