Extracting text from script tag with BeautifulSoup in Python
Could you please help me with this. I am looking to extract the email, phone and name from the below code into a SCRIPT tag (not Body) using Beautiful soup (Python). I'm new to Python and the blog recommends using a nice soup for extraction.
I tried to get the page using the following code -
fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")
This Ajax request code is not repeated on the page again. Can we also write try and catch so that if it doesn't find it on the page it won't throw any errors.
<script type="text/javascript" language='javascript'>
$(document).ready( function (){
$('#message').click(function(){
alert();
});
$('#addmessage').click(function(){
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: 'abc@g.com',
phone: '9999999999',
name: 'XYZ'
}
});
});
});
Once I get this, I also want to save the excel file.
Thanks in anticipation.
source to share
The content of the tag script
can be retrieved via BeautifulSoup
and then a regular expression can be applied to get the required data.
Working example (based on what you described in the question):
import re
from bs4 import BeautifulSoup
data = """
<html>
<head>
<title>My Sample Page</title>
<script>
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: 'abc@g.com',
phone: '9999999999',
name: 'XYZ'
}
});
</script>
</head>
<body>
<h1>What a wonderful world</h1>
</body>
</html>
"""
soup = BeautifulSoup(data)
script = soup.find('script')
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']
Printing
abc@g.com 9999999999 XYZ
I don't really like the solution as this regex approach is really fragile. Anything that breaks can happen. I still think there is a better solution and we don't see the bigger picture here. Providing a link to that particular site will help a lot, but that's what it is.
UPD (fixing OP code):
soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))
pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']
prints:
abcd@gmail.com 9999999999 Shamita Shetty
source to share
As an alternative to the regex based approach, you can slimit
parse the javascript code with a module that creates an abstract syntax tree and gives you can get all the assignments and put them in a dictionary:
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
<html>
<head>
<title>My Sample Page</title>
<script>
$.ajax({
type: "POST",
url: 'http://www.example.com',
data: {
email: 'abc@g.com',
phone: '9999999999',
name: 'XYZ'
}
});
</script>
</head>
<body>
<h1>What a wonderful world</h1>
</body>
</html>
"""
# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')
# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
for node in nodevisitor.visit(tree)
if isinstance(node, ast.Assign)}
print fields
Printing
{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'abc@g.com'"}
Among other fields, there are email
, name
and phone
that interest you.
Hope it helps.
source to share