Use rvest to clear all p after h? (or other R-package)

I am new to the html scraping world and am having a hard time pulling in paragraphs under certain headings using rvest in R.

I want to clear information from several sites, all of which have relatively similar settings. They all have the same headings, but the number of paragraphs under the heading may change. I was able to clear certain paragraphs under the heading with the following code:

unitCode <- data.frame(unit = c('SLE010', 'SLE115', 'MAA103'))

html <- sapply(unitCode, function(x) paste("http://www.deakin.edu.au/current-students/courses/unit.php?unit=", 
                                          x,
                                          "&return_to=%2Fcurrent-students%2Fcourses%2Fcourse.php%3Fcourse%3DS323%26version%3D3", 
                                          sep = ''))
assessment <- html[3] %>%
              html() %>%
              html_nodes(xpath='//*[@id="main"]/div/div/p[3]') %>%
              html_text()

      

The xpath element retrieves the first paragraph under the heading of the assessment. Some pages have multiple paragraphs under the heading of the score that I can get if I change the "xpath" variable to point specifically to them, for example. p [4] or p [5]. Unfortunately I want to repeat this process for hundreds of pages, so changing the xpath every time is not a good fit and I don't even know how many paragraphs there will be on each page.

I think pulling all <p> s after the heading I'm interested in is the best option, given the uncertainty in page setup.

I was wondering if there is a way to clear all <p> s after <h3> Evaluation <h3> using rvest or some other R-scraper?

+3


source to share


1 answer


I have expanded this for demonstration purposes only. You should be able to apply it to your source code. It is not worth rewriting names in the namespaces you are using. Also note that I am using the latest version (github / devtools version) rvest

that uses xml2

and is out of date html

.

The key xpath="//h3[contains(., 'Assessment')]/following-sibling::p"

is thus:



library(rvest)

unitCode <- data.frame(unit = c('SLE010', 'SLE115', 'MAA103'))

sites <- sapply(unitCode, function(x) paste("http://www.deakin.edu.au/current-students/courses/unit.php?unit=", 
                                          x,
                                          "&return_to=%2Fcurrent-students%2Fcourses%2Fcourse.php%3Fcourse%3DS323%26version%3D3", 
                                          sep = ''))

pg <- read_html(sites[1])
pg_2 <- read_html(sites[2])
pg_3 <- read_html(sites[3])

pg %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (2)}
## [1] <p>This unit is assessed on a pass/fail basis. Multiple-choice on-line test   ...
## [2] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

pg_2 %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (3)}
## [1] <p>Mid-trimester test 20%, three assignments (3 x 10%) 30%, examination 50%.</p>
## [2] <p>* Rate for all CSP students, except for those who commenced Education and  ...
## [3] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

pg_3 %>% html_nodes(xpath="//h3[contains(., 'Assessment')]/following-sibling::p")

## {xml_nodeset (6)}
## [1] <p>Assessment 1 (Group of 3 students) - Student video presentation (5-7 mins) ...
## [2] <p>Assessment 2 (Group of 3 students) - Business plan (3500-4000 words) - 30% ...
## [3] <p>Examination (2 hours) - 60%</p>
## [4] <p><a href="http://www.deakin.edu.au/glossary?result_1890_result_page=H" targ ...
## [5] <p>* Rate for all CSP students, except for those who commenced Education and  ...
## [6] <p style="margin-top: 2em;">\n  <a href="/current-students/courses/course.php ...

      

You can probably use this <p style="margin-top: 2em;">

as a stop marker. You must check xml2

as_list

to help.

+8


source







All Articles