How can I consistently extract English text from different web pages on a Tamil site?
The Naalayira Divya Prabandham is a 4,000-style collection of Hindu poems written in Tamil. The website http://dravidaveda.org has a web page for each of the 4000 verses. Each page of the poems contains a Tamil verse, a step-by-step commentary on the Tamil verse, and an English translation. For example, here is the web page for verse 1008.
My question is, is there any way to get the English translations of all 4000 verses so that I can get the complete English translation of Nalalaira Divya Prabandham in one document? For example, on the webpage I linked above, I want to extract "Singavel-Kundram is the place where the pure Lord came as a lion man - while the world was stunned - and tore apart Asura Hiranyas chest Red lion lions offer worship, bowing down the elephant's tusks at his feet. " along with the number 1008 and I want to put it at 1008 in my document.
So how would I go about it? I am guessing it might require some kind of programming, but I have no technical background, so can someone tell me what I need to do? Note that article IDs, such as number 1379 in the URL "dravidaveda.org/index.php?option=com_content&view=article&id=1379&ml=1" programming.
source to share
You can use software / commands that dump the content of web pages to terminal or console. e.g. lynx
, w3m
, links
etc. (Although this is also possible with wget
, curl
, aria2
etc.). For more information visit the manual pages of the respective commands.
Here I provide an example with lynx
:
#!/bin/bash
for i in {47..4568}
do
{
lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -n 1 >> ndp.txt
echo -e "\n" >> ndp.txt
lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | grep 'English Translation' -A 10 >> ndp.txt
echo -e "\n\n" >> ndp.txt
}
done;
Here it {47..4598}
automatically reverses to 47.48, ...., 4568 sequentially. (I found that Nalaira Divya Prabandham can be pulled out of this range)
1 stlynx
will write no. for example (1008) in a file namednpd.txt
2 ndlynx
team write "English translation" for this versenpd.txt
Hence, with a loop for
and depending on the range provided, you won't get. verses with English translation in a file npd.txt
.
Note that as you mentioned the page id does not go afterwards, it is difficult to predict what ids are skipped during encoding. Anyway, I think you can easily remove these lines from unwanted page ids from npd.txt
after that.
However, if you want, you can skip resetting these pages using validation, for example:
if [[ $(lynx -dump ""http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -c 1) = "(" ]]
then
[Your commands here]
fi
Here the expression specified in the condition if
will check if the first character of the page that we are going to discard is "(" or not.
So, the following command might work depending on the content of the web pages:
#!/bin/bash
for i in {47..4568}
do
{
if [[ $(lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -c 1) = "(" ]]
then
{
lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -n 1 >> ndp.txt
echo -e "\n" >> ndp.txt
lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | grep 'English Translation' -A 10 >> ndp.txt
echo -e "\n\n" >> ndp.txt
}
fi
}
done;
I checked and the above script works fine on my PC.
Update / Improvement:
The file ndp.txt
has verses in an inconsistent order, as we receive verses in an inconsistent order from the site. So finally it can be sorted by the following command (thanks @terdon for the perl code):
perl -ne 'if(/^\((\d+)\)\s*$/){$d=$1;} push @{$k{$d}},$_; END{print "@{$k{$_}}\n" for sort { $a <=> $b} keys(%k)} ' npd.txt
source to share