Find cell in html table containing specific icon

I am looking for code that can tell me which cell of the html table a particular icon is in. This is what I am working with:

u <- "http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1"
doc <- rvest::html(u)
tab <- rvest::html_table(doc, fill = TRUE)[[6]]

      

Column "Pos." indicates the player's position in the field. Some of them have an additional icon. I see the presence of these icons on the page as follows:

rvest::html_nodes(doc, ".kapitaenicon-table")

      

but that doesn't tell me where they are. I would like my code to return that the icon occurs in rows 2, 10, 11, 27 of the "Pos." In the table. How can i do this?

+3


source to share


1 answer


A bit more rvest

and XPath magic can get you indexes:

library(rvest)
library(magrittr)
library(XML)

pg <- html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")

pg %>% 
  html_nodes("table") %>% 
  extract2(6) %>% 
  html_nodes("tbody > tr") %>% 
  sapply(function(x) {
    length(xpathSApply(x, "./td[8]/span[@class='kapitaenicon-table icons_sprite']")) == 1
  }) %>% which

## [1]  2 10 11 27

      

This gets the 6th table, fetches tr

, then loops through them for the 8th td

with the correct span

/ class

in it. If the XPath search fails, it returns an empty list, so you can use length to determine which strings have td

, with an icon in them, and which don't.

It:

pg %>% 
  html_nodes(xpath="//table[6]/tbody/tr/td[8]") %>% 
  xmlSApply(xpathApply, "boolean(./span[@class='kapitaenicon-table icons_sprite'])") %>% 
  which

      

also works, and it's a little tighter (and faster). It uses XPath operation boolean

to test for existence. This is convenient if you have no other operations to perform on node (s).

This is a version xml2

, although I must believe there xml2

must be a better way to do it:



library(xml2)
library(magrittr)

pg2 <- read_html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")
pg2 %>% 
  xml_find_all("//table[6]/tbody/tr/td[8]") %>% 
  as_list %>% 
  sapply(function(x) {
    inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
  }) %>% which

      

UPDATE

For version 0.1.0.9000

of, xml2

I had to do the following:

pg2 %>% xml_find_all("//table") %>% 
  as_list %>% 
  extract2(6) %>% 
  xml_find_all("./tbody/tr/td[8]") %>% 
  as_list %>% 
  sapply(function(x) {
    inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
  }) %>% which

      

It doesn't have to be, and I wrote a bug report .

Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.0 (2015-04-16)
 system   x86_64, darwin13.4.0        
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            

Packages -----------------------------------------------------------------------------
 package    * version date       source        
 curl       * 0.5     2015-02-01 CRAN (R 3.2.0)
 devtools   * 1.7.0   2015-01-17 CRAN (R 3.2.0)
 magrittr     1.5     2014-11-22 CRAN (R 3.2.0)
 Rcpp       * 0.11.5  2015-03-06 CRAN (R 3.2.0)
 rstudioapi * 0.3.1   2015-04-07 CRAN (R 3.2.0)
 xml2         0.1.0   2015-04-20 CRAN (R 3.2.0)

      

+4


source







All Articles