Golang web scraper ignoring specific table cells
I am working on a small web scraper just to get a feel for the golang. It is currently grabbing information from a wiki from a spreadsheet and then grabbing information specifically from cells. I currently don't have any code (not at home), but it looks something like this:
func main() {
doc, err := goquery.NewDocument("http://monsterhunter.wikia.com/wiki/MH4:_Item_List")
if err != nil {
log.Fatal(err)
}
doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
title := s.Find("td").Text()
fmt.Printf(title)
})
}
The problem is, in this website, the first cell is the image, so it prints the image source, which I don't want. How can I ignore the first cell in every row of a large table?
source to share
Let me clear up some things. A Selection
is a collection of nodes that meet some criteria.
doc.Find()
Selection.Find()
that returns a new Selection
one containing items that match the criteria. And Selection.Each()
iterates over each of the elements of the collection and calls the function value passed to it.
So, in your case, Find("tbody")
will find all the tbody
elements, Each()
will iterate over all the elements, tbody
and call your anonymous function.
There s
is Selection
one element inside your anonymous function tbody
. You call s.Find("td")
that returns a new Selection
one that will contain all the elements of the td
current table. So when you call Text()
it will be the combined text content of each element td
, including their children. This is not what you want.
What you need to do is call another one Each()
on the Selection
returned one s.Find("td")
. And check if the Selection
second anonymous function has a child img
.
Sample code:
doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
// s here is a tbody element
s.Find("td").Each(func(j int, s2 *goquery.Selection) {
// s2 here is a td element
if s3 := s2.Find("img"); s3 != nil && s3.Length() > 0 {
return // This TD has at least one img child, skip it
}
fmt.Printf(s2.Text())
})
})
Alternatively, you can search for elements tr
and skip the first child of td
each row, checking if the index passed to the third anonymous function 0
(first child) has passed , something like this:
doc.Find("tbody").Each(func(i int, s *goquery.Selection) {
// s here is a tbody element
s.Find("tr").Each(func(j int, s2 *goquery.Selection) {
// s2 here is a tr element
s2.Find("td").Each(func(k int, s3 *goquery.Selection) {
// s3 here is a td element
if k == 0 {
return // This is the first TD in the row
}
fmt.Printf(s3.Text())
})
})
})
source to share