How can I consistently extract English text from different web pages on a Tamil site?

Question

How can I consistently extract English text from different web pages on a Tamil site?

The Naalayira Divya Prabandham is a 4,000-style collection of Hindu poems written in Tamil. The website http://dravidaveda.org has a web page for each of the 4000 verses. Each page of the poems contains a Tamil verse, a step-by-step commentary on the Tamil verse, and an English translation. For example, here is the web page for verse 1008.

My question is, is there any way to get the English translations of all 4000 verses so that I can get the complete English translation of Nalalaira Divya Prabandham in one document? For example, on the webpage I linked above, I want to extract "Singavel-Kundram is the place where the pure Lord came as a lion man - while the world was stunned - and tore apart Asura Hiranyas chest Red lion lions offer worship, bowing down the elephant's tusks at his feet. " along with the number 1008 and I want to put it at 1008 in my document.

So how would I go about it? I am guessing it might require some kind of programming, but I have no technical background, so can someone tell me what I need to do? Note that article IDs, such as number 1379 in the URL "dravidaveda.org/index.php?option=com_content&view=article&id=1379&ml=1" programming.

+3

html url parsing html-parsing html-parser

Keshav Srinivasan May 15 '17 at 6:39

source to share

1 answer

Pandya · Answer 1 · 2017-05-15T11:26:59+0000

You can use software / commands that dump the content of web pages to terminal or console. e.g. lynx

, w3m

, links

etc. (Although this is also possible with wget

, curl

, aria2

etc.). For more information visit the manual pages of the respective commands.

Here I provide an example with lynx

:

#!/bin/bash
for i in {47..4568}
 do
 {
 lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -n 1 >> ndp.txt
 echo -e "\n" >> ndp.txt
 lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | grep 'English Translation' -A 10 >> ndp.txt
 echo -e "\n\n" >> ndp.txt
 }
 done;

Here it {47..4598}

automatically reverses to 47.48, ...., 4568 sequentially. (I found that Nalaira Divya Prabandham can be pulled out of this range)

Command

1 ^stlynx

will write no. for example (1008) in a file namednpd.txt

2 ^ndlynx

team write "English translation" for this versenpd.txt

Hence, with a loop for

and depending on the range provided, you won't get. verses with English translation in a file npd.txt

.

Note that as you mentioned the page id does not go afterwards, it is difficult to predict what ids are skipped during encoding. Anyway, I think you can easily remove these lines from unwanted page ids from npd.txt

after that.

However, if you want, you can skip resetting these pages using validation, for example:

if [[ $(lynx -dump ""http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -c 1) = "(" ]]
then
[Your commands here]
fi

Here the expression specified in the condition if

will check if the first character of the page that we are going to discard is "(" or not.

So, the following command might work depending on the content of the web pages:

#!/bin/bash
for i in {47..4568}
 do
 {
   if [[ $(lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -c 1) = "(" ]]
   then 
     {
     lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -n 1 >> ndp.txt
     echo -e "\n" >> ndp.txt
     lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | grep 'English Translation' -A 10 >> ndp.txt
     echo -e "\n\n" >> ndp.txt
     } 
   fi
 }
 done;

I checked and the above script works fine on my PC.

Update / Improvement:

The file ndp.txt

has verses in an inconsistent order, as we receive verses in an inconsistent order from the site. So finally it can be sorted by the following command (thanks @terdon for the perl code):

perl -ne 'if(/^\((\d+)\)\s*$/){$d=$1;} push @{$k{$d}},$_; END{print "@{$k{$_}}\n" for sort { $a <=> $b} keys(%k)} ' npd.txt

How can I consistently extract English text from different web pages on a Tamil site?

More articles: