Web scraper with R, content

Question

Web scraper with R, content

I am just getting started with web scraping in R, I put this code:

mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")

mps %>%
    html_nodes("tr") %>%
    html_text()

Get the content I need, which I put in a text file. My problem is that I want to eliminate those red dots, but I cannot. could you help me? I think that these points are replaced <b>

and <br>

in a html-code.

enter image description here

+3

r html-parsing web screen-scraping rvest

sanjanasan 07 May '17 at 12:01

source to share

2 answers

You can always use regular expressions to remove unwanted characters like

mps <- gsub("•", " ", mps)

0

mkearney May 07 '17 at 0:52

source to share

alistaire · Accepted Answer · 2017-05-07T00:40:43+0000

Whoever built this page was very frustrating for the table in the table, but is not defined as a tag <table>

, so the easiest thing to do is to override it to make it easier to parse:

library(rvest)

mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")

df <- mps %>%
    html_nodes("tr.Entete1, tr.Tableau1") %>%    # get correct rows
    paste(collapse = '\n') %>%     # paste nodes back to a single string
    paste('<table>', ., '</table>') %>%     # add enclosing table node
    read_html() %>%    # reread as HTML
    html_node('table') %>% 
    html_table(fill = TRUE) %>%    # parse as table
    { setNames(.[-1,], make.names(.[1,], unique = TRUE)) }    # grab names from first row

head(df)
#>          X          Région NA.           Nature NA..1        Type NA..2
#> 2     Prix            <NA>  NA             <NA>    NA        <NA>    NA
#> 3 Modifiée                  NA             <NA>    NA        <NA>    NA
#> 4                  Kelibia  NA          Terrain    NA  Terrain nu    NA
#> 5          Cite El Ghazala  NA         Location    NA App. 4 pièc    NA
#> 6                 Le Bardo  NA         Location    NA App. 1 pièc    NA
#> 7                 Le Bardo  NA Location vacance    NA App. 3 pièc    NA
#>                   Texte.annonce NA..3   Prix Prix.1        X.1 Modifiée
#> 2                          <NA>    NA   <NA>   <NA>       <NA>     <NA>
#> 3                          <NA>    NA   <NA>   <NA>       <NA>     <NA>
#> 4      Terrain a 5 km de kelibi    NA 80 000        07/05/2017         
#> 5      S plus 3 haut standing c    NA    790        07/05/2017         
#> 6          Appartements meubles    NA 40 000        07/05/2017         
#> 7 Un bel appartement au bardo m    NA    420        07/05/2017         
#>   Modifiée.1 NA..4 NA..5
#> 2       <NA>    NA    NA
#> 3       <NA>    NA    NA
#> 4       <NA>    NA    NA
#> 5       <NA>    NA    NA
#> 6       <NA>    NA    NA
#> 7       <NA>    NA    NA

Note that there are many NA

other cool stuff here yet to be cleaned up, but at least can be used at this point.

Web scraper with R, content

More articles: