Web scraper with R, content
I am just getting started with web scraping in R, I put this code:
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
mps %>%
html_nodes("tr") %>%
html_text()
Get the content I need, which I put in a text file. My problem is that I want to eliminate those red dots, but I cannot. could you help me? I think that these points are replaced <b>
and <br>
in a html-code.
Whoever built this page was very frustrating for the table in the table, but is not defined as a tag <table>
, so the easiest thing to do is to override it to make it easier to parse:
library(rvest) mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp") df <- mps %>% html_nodes("tr.Entete1, tr.Tableau1") %>% # get correct rows paste(collapse = '\n') %>% # paste nodes back to a single string paste('<table>', ., '</table>') %>% # add enclosing table node read_html() %>% # reread as HTML html_node('table') %>% html_table(fill = TRUE) %>% # parse as table { setNames(.[-1,], make.names(.[1,], unique = TRUE)) } # grab names from first row head(df) #> X Région NA. Nature NA..1 Type NA..2 #> 2 Prix <NA> NA <NA> NA <NA> NA #> 3 Modifiée NA <NA> NA <NA> NA #> 4 Kelibia NA Terrain NA Terrain nu NA #> 5 Cite El Ghazala NA Location NA App. 4 pièc NA #> 6 Le Bardo NA Location NA App. 1 pièc NA #> 7 Le Bardo NA Location vacance NA App. 3 pièc NA #> Texte.annonce NA..3 Prix Prix.1 X.1 Modifiée #> 2 <NA> NA <NA> <NA> <NA> <NA> #> 3 <NA> NA <NA> <NA> <NA> <NA> #> 4 Terrain a 5 km de kelibi NA 80 000 07/05/2017 #> 5 S plus 3 haut standing c NA 790 07/05/2017 #> 6 Appartements meubles NA 40 000 07/05/2017 #> 7 Un bel appartement au bardo m NA 420 07/05/2017 #> Modifiée.1 NA..4 NA..5 #> 2 <NA> NA NA #> 3 <NA> NA NA #> 4 <NA> NA NA #> 5 <NA> NA NA #> 6 <NA> NA NA #> 7 <NA> NA NA
Note that there are many NA
other cool stuff here yet to be cleaned up, but at least can be used at this point.
You can always use regular expressions to remove unwanted characters like
mps <- gsub("•", " ", mps)