Goquery - Extract text from one html tag and add it to next tag

Yes, sorry the title doesn't explain anything. I need to use an example.

This is a follow-up to another question I posted that solved one problem, but not all of them. I've put most of the background information from this question into this one. Also, I've only been looking at Go for about 5 days (and I just started looking into the code a couple of months ago), so I'm 90% sure I'm close to figuring out what I want and that the problem I have is stupid syntax errors.

Situation

I am trying to use a goquery

web page to parse. (I want to put some data into the database after all). This is how it looks:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

      

purpose

I would like to:

  • Extract content <h1..."text"

    .
  • Insert (and merge) this extracted content into content <p..."text"

    .
  • Do this only for the tag <p>

    that immediately follows the tag <h1>

    .
  • Do this for all tags <h1>

    on the page.

Once again, an example explains this better. Here's what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

      

Attempts to solve

Since further extracting tags <h1>

from tags <p>

would provide more parsing possibilities, I figured out how to change the class

tag attributes <h1>

to this:

<html>
    <body>
        <h1>
            <span class="title">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="title">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

      

with this code:

html_code := strings.NewReader(`
code_example_above
`)
doc, _ := goquery.NewDocumentFromReader(html_code)
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
    }
})

      

I know what I can choose <p..."text"

after <h1..."title"

using doc.Find("h1+p")

or s.Next()

inside a function doc.Find("h1").Each

:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
        fmt.Println(s.Next().Text())
    }
})

      

I cannot figure out how to insert text from <h1..."title"

to <p..."text"

. I've tried several options s.After()

, s.Before()

and s.Append()

for example:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        s.After(s.Text())
        fmt.Println(s.Next().Text())
    }
})

      

but I cannot figure out how to do exactly what I want.

If I use s.After(s.Next().Text())

this instead, I get this error output:

panic: expected identifier, found 5 instead

goroutine 1 [running]:
code.google.com/p/cascadia.MustCompile(0xc2082f09a0, 0x62, 0x62)
    /home/*/go/src/code.google.com/p/cascadia/selector.go:59 +0x77
github.com/PuerkitoBio/goquery.(*Selection).After(0xc2082ea630, 0xc2082f09a0, 0x62, 0x5)
    /home/*/go/src/github.com/PuerkitoBio/goquery/manipulation.go:18 +0x32
main.funcยท001(0x0, 0xc2082ea630)
    /home/*/go/test2.go:78 +0x106
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc2082ea600, 0x7cb678, 0x2)
    /home/*/go/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0x173
main.ExampleScrape()
    /home/*/go/test2.go:82 +0x213
main.main()
    /home/*/go/test2.go:175 +0x1b

goroutine 9 [runnable]:
net/http.(*persistConn).readLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:928 +0x9ce
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:660 +0xc9f

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 10 [select]:
net/http.(*persistConn).writeLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:661 +0xcbc
exit status 2

      

(The lines of my script don't match the lines of the examples above, but "line 72" of my script contains the code s.After(s.Next().Text())

. I don't know what exactly means panic: expected identifier, found 5 instead

.)

Summary

All in all, my problem is that I cannot fully dive into the question of how to use goquery

to add text to a tag.

I think I'm near. Could any gopher Jedis be able and willing to help this Padawan?

+3


source to share


1 answer


Something like this code gets the job done, it finds all the nodes <h1>

and then all the nodes <span>

within those nodes <h1>

, looking for one with a class text

. Then it gets the next <h1>

node element , if any <p>

, which it has inside <span>

, then it replaces that last one with <span>

new <span>

new text and deletes <h1>

.

I wonder if it is possible to create nodes using goquery

non-writing html ...



package main

import (
    "fmt"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

var htmlCode string = `<html>
...
<html>`

func main() {
    doc, _ := goquery.NewDocumentFromReader(strings.NewReader((htmlCode)))
    doc.Find("h1").Each(func(i int, h1 *goquery.Selection) {
        h1.Find("span").Each(func(j int, s *goquery.Selection) {
            if s.HasClass("text") {
                if p := h1.Next(); p != nil {
                    if ps := p.Children().First(); ps != nil && ps.HasClass("text") {
                        ps.ReplaceWithHtml(
                            fmt.Sprintf("<span class=\"text\">%s%s</span>)", s.Text(), ps.Text()))
                        h1.Remove()
                    }
                }
            }
        })
    })
    htmlResult, _ := doc.Html()
    fmt.Println(htmlResult)
}

      

+2


source







All Articles