How to parse 380k pages of html (best way from a performance standpoint)?

Question

How to parse 380k pages of html (best way from a performance standpoint)?

What do I have:

380,000 html pages (applications from the news portal from 2010 to 2017).
65.19Gb (68,360,000Kb) total.
170 KB per page (average).
Server 1: 2.7GHz i5, 16GB RAM, SSD, macOS 10.12, version R 3.4.0 (my laptop).
Server 2: 3.5 GHz Xeon E-1240 v5, 32Gb, SSD, Windows Server 2012, R version 3.4.0.

What I need:

I need to parse these html pages and get data that looks like this: Articles | Articles | News | Articles | ArticleContent | ... |

What should I do:

require(rvest)
files <- list.files(file.path(getwd(), "data"), full.names = TRUE, recursive = TRUE, pattern = "index")
downloadedLinks <- c()
for (i in 1:length(files)) {
  currentFile <- files[i]
  pg <- read_html(currentFile, encoding = "UTF-8")
  fileLink <- html_nodes(pg, xpath=".//link[@rel='canonical']") %>% html_attr("href")   
  downloadedLinks <- c(downloadedLinks, fileLink)  
}

I ran this code on 40,000 pages and got this result:

Server 1: 800 sec.
Server 2: 1000 sec.

This means that it will take 7600 seconds or 126 minutes to process 380,000 pages on Server 1 and 9500 seconds or 158 minutes on Server 2.

Because of this, I have few questions and hope the community can help me. I will be glad to hear any ideas, suggestions or criticisms.

How can I improve my code above to reduce processing time?
Why is server 2 (I mean real server) showing poor performance compared to server 1 (my laptop)?
Is there a way to order Server 2 to use more CPU and RAM (that's about 10% CPU usage and 300MB RAM for the Rgui.exe process)

+3

performance html r xml-parsing html-parsing

Ildar Gabdrakhmanov 03 jul. 17 at 8:29

source to share