Extract data from multiple web pages from a website that is automatically loaded in r
I have seen other posts that show retrieving data from multiple web pages
But the problem is, for my site, when I browse the website to see the number of web pages, to check how many pages are split into data, the page automatically updates the following data, making it impossible to identify the number of web pages. I don't have that good knowledge of html and javascript that I can easily identify the attribute on which the method was called. so I have defined a way that we can get the number of pages. The website when loaded in the browser gives the number of records present by referring to this number and dividing it by 30 (the amount of data per page), for example if the number of records is 90 then 90/30 = 3 page numbers
here is the code to get the number of records found on this page
active_name_data1 <- html_nodes(webpage,'.active')
active1 <- html_text(active_name_data1)
as.numeric(gsub("[^\\d]+", "", word(active1[1],start = 1,end =1), perl=TRUE))
And another approach is to get the attribute for the number of pages ie
url='http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'
webpage <- read_html(url)
active_data_html <- html_nodes(webpage,'a.act')
active <- html_text(active_data_html)
here the active one gives me the number of pages ie "1" " 2" " 3" " 4"
Here I cannot figure out how to get the active page data and iterate over another web page number to get all the data.
here is what i tried ( uuu_df2
is a framework with multiple links i want to traverse the data for)
library(rvest)
uuu_df2 <- data.frame(x = c('http://www.magicbricks.com/property-for-
sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-
Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-
Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=10-Lacs'))
urlList <- llply(uuu_df2[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[@id='resultCount']") %>%
xml_text() %>%
as.integer()
if(!is.na(results_count) & (results_count > 0)){
cards <- this_pg %>%
xml_find_all('//div[@class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[@class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[@class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[@class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[@class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df2[,1]
a=bind_rows(urlList)
But this code just gives me data from the active page and does not iterate through other pages of the given link.
PS: If the link has no entry, the code skips that link and moves to another link from the list.
Any suggestion on what changes should be made to the code would be helpful. Thanks in advance.
source to share
No one has answered this question yet
See similar questions:
or similar: