How to parse 380k pages of html (best way from a performance standpoint)?
What do I have:
- 380,000 html pages (applications from the news portal from 2010 to 2017).
- 65.19Gb (68,360,000Kb) total.
- 170 KB per page (average).
- Server 1: 2.7GHz i5, 16GB RAM, SSD, macOS 10.12, version R 3.4.0 (my laptop).
- Server 2: 3.5 GHz Xeon E-1240 v5, 32Gb, SSD, Windows Server 2012, R version 3.4.0.
What I need:
- I need to parse these html pages and get data that looks like this: Articles | Articles | News | Articles | ArticleContent | ... |
What should I do:
require(rvest)
files <- list.files(file.path(getwd(), "data"), full.names = TRUE, recursive = TRUE, pattern = "index")
downloadedLinks <- c()
for (i in 1:length(files)) {
currentFile <- files[i]
pg <- read_html(currentFile, encoding = "UTF-8")
fileLink <- html_nodes(pg, xpath=".//link[@rel='canonical']") %>% html_attr("href")
downloadedLinks <- c(downloadedLinks, fileLink)
}
I ran this code on 40,000 pages and got this result:
- Server 1: 800 sec.
- Server 2: 1000 sec.
This means that it will take 7600 seconds or 126 minutes to process 380,000 pages on Server 1 and 9500 seconds or 158 minutes on Server 2.
Because of this, I have few questions and hope the community can help me. I will be glad to hear any ideas, suggestions or criticisms.
- How can I improve my code above to reduce processing time?
- Why is server 2 (I mean real server) showing poor performance compared to server 1 (my laptop)?
- Is there a way to order Server 2 to use more CPU and RAM (that's about 10% CPU usage and 300MB RAM for the Rgui.exe process)
+3
source to share
No one has answered this question yet
Check out similar questions: