Web scraper with R, content
I am just getting started with web scraping in R, I put this code:
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
mps %>%
html_nodes("tr") %>%
html_text()
Get the content I need, which I put in a text file. My problem is that I want to eliminate those red dots, but I cannot. could you help me? I think that these points are replaced <b>
and <br>
in a html-code.
+3
source to share
2 answers
Whoever built this page was very frustrating for the table in the table, but is not defined as a tag <table>
, so the easiest thing to do is to override it to make it easier to parse:
library(rvest) mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp") df <- mps %>% html_nodes("tr.Entete1, tr.Tableau1") %>% # get correct rows paste(collapse = '\n') %>% # paste nodes back to a single string paste('<table>', ., '</table>') %>% # add enclosing table node read_html() %>% # reread as HTML html_node('table') %>% html_table(fill = TRUE) %>% # parse as table { setNames(.[-1,], make.names(.[1,], unique = TRUE)) } # grab names from first row head(df) #> X Région NA. Nature NA..1 Type NA..2 #> 2 Prix <NA> NA <NA> NA <NA> NA #> 3 Modifiée NA <NA> NA <NA> NA #> 4 Kelibia NA Terrain NA Terrain nu NA #> 5 Cite El Ghazala NA Location NA App. 4 pièc NA #> 6 Le Bardo NA Location NA App. 1 pièc NA #> 7 Le Bardo NA Location vacance NA App. 3 pièc NA #> Texte.annonce NA..3 Prix Prix.1 X.1 Modifiée #> 2 <NA> NA <NA> <NA> <NA> <NA> #> 3 <NA> NA <NA> <NA> <NA> <NA> #> 4 Terrain a 5 km de kelibi NA 80 000 07/05/2017 #> 5 S plus 3 haut standing c NA 790 07/05/2017 #> 6 Appartements meubles NA 40 000 07/05/2017 #> 7 Un bel appartement au bardo m NA 420 07/05/2017 #> Modifiée.1 NA..4 NA..5 #> 2 <NA> NA NA #> 3 <NA> NA NA #> 4 <NA> NA NA #> 5 <NA> NA NA #> 6 <NA> NA NA #> 7 <NA> NA NA
Note that there are many NA
other cool stuff here yet to be cleaned up, but at least can be used at this point.
+1
source to share