How do I split a txt file into html or regex tags to save it as separate txt files in R?

I have a batch download issue of LexisNexis articles in html and txt format. The file itself contains headers, metadata, and a lot of several news articles that I need to systematically separate and save as independent txt files. The head of the txt version looks like this:

> head(textz, 100)
[1] ""                                                                              
[2] "                               1 of 103 DOCUMENTS"                                
[3] ""                                                                                 
[4] ""                                                                                 

[5] "                                Foreign Affairs"                                  

[6] ""                                                                                 
[7] "                              May 2013 - June 2013"                               
[8] ""                                                                                 
[9] "Why the U.S. Army Needs Armor Subtitle: The Case for a Balanced Force"            
[10] ""                                                                                 

[11] "BYLINE: Chris McKinney, Mark Elfendahl, and H. R. McMaster Authors BIOS: CHRIS"   
[12] "MCKINNEY is a Lieutenant Colonel in the U.S. Army and an adviser to the Saudi"    
[13] "Arabian National Guard. MARK ELFENDAHL is a Colonel in the U.S. Army and a"       
[14] "student at the Joint Advanced Warfighting School in Norfolk, Virginia. H. R."     
[15] "MCMASTER is a Major General in the U.S. Army and Commander of the Maneuver"       
[16] "Center of Excellence at Fort Benning, Georgia."                                   

[17] ""                                                                                 

[18] "SECTION: Vol. 92 No. 4 PAGE: 129"                                                 

[19] ""                                                                                 

[20] "LENGTH: 2856 words"                                                               

[21] ""                                                                                 

[22] ""                                                                                 

[23] "Ever since World War II, the United States has depended on armored forces --"     
[24] "forces equipped with tanks and other protected vehicles -- to wage its wars."
....
....

      

A snapshot of the html version looks like this:

<DOC NUMBER=103>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">103 of 103 DOCUMENTS</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">The New York Times</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c4">July</span>
<span class="c2"> 26, 2011 Tuesday</span>
<span class="c2">Â </span>
<span class="c2">Â <br>Late Edition - Final</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c7">A Step Toward Trust With China</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">BYLINE: </span><span class="c2">By MIKE MULLEN. </span></p>
<p class="c9"><span class="c2">Mike Mullen, a </span>
<span class="c4">Navy admiral,</span><span class="c2"> is the chairman of the Joint Chiefs of Staff.
</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">SECTION: </span>
<span class="c2">Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">LENGTH: </span>
<span class="c2">794 words</span></p>
</div>
<br><div class="c5">
<p class="c9"><span class="c2">Washington</span></p>
<p class="c9"><span class="c2">THE military relationship between the United States and China is one of the world most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.
</span></p>

      

Unique documents are separated by "[0-9] lines [0-9] DOCUMENTS" in each, but between the grep and strsplit family I couldn't find a way to split the txt (or html) file in R that clearly separates the component articles and allows me to save them as independent txt files. Searching carefully for other questions was either useless or a necessary use of Python. Any advice would be great!

+3


source to share


2 answers


To split the txt version, assume the text is in doc_text

, and write each to sequentially named files .txt, file2.txt, etc.

lapply for recording files adapted from @P Lapointe



texts <- unlist(strsplit(doc_text, "\\s+\\d+\\sof\\s\\d+\\sDOCUMENTS") )
texts <- texts[-1]  # drop the first empty split

lapply (1:length(texts), function(i){ write(texts[i], paste0("file", i, ".txt"))})

      

+1


source


library rvest

makes parsing html easy. Your documents don't quite align with the headings <DOCFULL>

and <DOC NUMBER >

. In the answer below, your provided document will be expanded to show the following document (104). You can use the lapply structure to do other things like writing a text file to an article. Notice the css selector in html_nodes. There isn't a lot of structure in the html, but if you find some patterns, you can target the bits of each article with selectors.



library(rvest)
library(stringr)

articles  <- str_replace_all(doc, "\\n", " ") %>%    # remove new line to simplify
  str_replace_all("<DOCFULL>\\s+\\-\\->", " " ) %>%  # remove redundant header
  strsplit("<DOC NUMBER=\\d+>") %>%                  # split on DOC NUMBER header
  unlist()                                           # to a vector

# drop the first empty result form the split
articles <- articles[-1]

# use lapply to travers all articles. 
c2_texts <- lapply(articles, function (article) {
  article %>% 
    read_html() %>%           # character input parsed as html
    html_nodes(css=".c2") %>% # find nodes with CSS selector, ex: c2
    html_text() })            # extract text from within the node

c2_texts
# [[1]]
# [1] "103 of 103 DOCUMENTS"                                                                                                                                                                                                                                                                                                                                                           
# [2] "The New York Times"                                                                                                                                                                                                                                                                                                                                                             
# [3] " 26, 2011 Tuesday"                                                                                                                                                                                                                                                                                                                                                              
# [4] "Â "                                                                                                                                                                                                                                                                                                                                                                             
# [5] "Â Late Edition - Final"                                                                                                                                                                                                                                                                                                                                                         
# [6] "By MIKE MULLEN. "                                                                                                                                                                                                                                                                                                                                                               
# [7] "Mike Mullen, a "                                                                                                                                                                                                                                                                                                                                                                
# [8] " is the chairman of the Joint Chiefs of Staff.     "                                                                                                                                                                                                                                                                                                                            
# [9] "Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23"                                                                                                                                                                                                                                                                                                                 
# [10] "794 words"                                                                                                                                                                                                                                                                                                                                                                      
# [11] "Washington"                                                                                                                                                                                                                                                                                                                                                                     
# [12] "THE military relationship between the United States and China is one of the worlds most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.     "
# 
# [[2]]
# [1] "104 of 104 DOCUMENTS" "The Added Item"      

      

+2


source







All Articles