Extract data from multiple web pages from a website that is automatically loaded in r

Question

Extract data from multiple web pages from a website that is automatically loaded in r

I have seen other posts that show retrieving data from multiple web pages

But the problem is, for my site, when I browse the website to see the number of web pages, to check how many pages are split into data, the page automatically updates the following data, making it impossible to identify the number of web pages. I don't have that good knowledge of html and javascript that I can easily identify the attribute on which the method was called. so I have defined a way that we can get the number of pages. The website when loaded in the browser gives the number of records present by referring to this number and dividing it by 30 (the amount of data per page), for example if the number of records is 90 then 90/30 = 3 page numbers

here is the code to get the number of records found on this page

active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))

And another approach is to get the attribute for the number of pages ie

url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)

here the active one gives me the number of pages ie "1" " 2" " 3" " 4"

Here I cannot figure out how to get the active page data and iterate over another web page number to get all the data.

here is what i tried ( uuu_df2

is a framework with multiple links i want to traverse the data for)

 library(rvest)
 uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
 sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
 Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
 Lacs&BudgetMax=5-Lacs',
                            'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))

    urlList <- llply(uuu_df2[,1], function(url){     

      this_pg <- read_html(url)

      results_count <- this_pg %>% 
        xml_find_first(".//span[@id='resultCount']") %>% 
        xml_text() %>%
        as.integer()

      if(!is.na(results_count) & (results_count > 0)){

        cards <- this_pg %>% 
          xml_find_all('//div[@class="SRCard"]')

        df <- ldply(cards, .fun=function(x){
          y <- data.frame(wine = x %>% xml_find_first('.//span[@class="agentNameh"]') %>% xml_text(),
                          excerpt = x %>% xml_find_first('.//div[@class="postedOn"]') %>% xml_text(),
                          locality = x %>% xml_find_first('.//span[@class="localityFirst"]') %>% xml_text(),
                          society = x %>% xml_find_first('.//div[@class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
          return(y)
        })

      } else {
        df <- NULL
      }

      return(df)   
    }, .progress = 'text')
    names(urlList) <- uuu_df2[,1]

    a=bind_rows(urlList)

But this code just gives me data from the active page and does not iterate through other pages of the given link.

PS: If the link has no entry, the code skips that link and moves to another link from the list.

Any suggestion on what changes should be made to the code would be helpful. Thanks in advance.

+3

r selenium web-scraping rvest rselenium

deepesh June 17. 17 at 5:58

source to share

No one has answered this question yet

See similar questions:

24

clean up infinite scrolling websites

4

How can I use a loop to clean up website data for multiple web pages in R?

or similar:

340

Extracting specific columns from a dataframe

7

skeptically tracks how to clear data from this site (using R)

3

Error while navigating the website to clear data

2

Rselenium - How to clear data from a web page without any id or names of any type

1

Extract CSV files from a website

1

How can I copy data from a framed website using R?

0

Extract text from javascript webpage

0

Failed to get data from a website with multiple pages

0

How do I extract links from this web page (with R)?

0

Scrap data from magibricks.com

Extract data from multiple web pages from a website that is automatically loaded in r

More articles: