Goquery - combine a tag with the one that follows

For some background information I'm new to Go (3 or 4 days), but I'm more comfortable starting with it.

I am trying to use a goquery

web page to parse. (I want to put some data into the database after all). For my problem, an example would be the simplest way to explain it:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

      

I would like to:

  • Extract content <h1..."text"

    .
  • Insert (and merge) this extracted content into content <p..."text"

    .
  • Do this only for the tag <p>

    that immediately follows the tag <h1>

    .
  • Do this for all tags <h1>

    on the page.

So, I want it to look like this:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

      

When the code starts like this

package main

import (
    "fmt"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

      

I know that I can read <h1..."text"

with:

h3_tag := doc.Find("h3 .text")

      

I also know that I can add content <h1..."text"

to content <p..."text"

like this:

doc.Find("p .text").Before("h3 .text")

      

^ But this command inserts content from each case <h1..."text"

before each case <p..."text"

.

Then I found out how to take a step closer to what I want:

doc.Find("p .text").First().Before("h3 .text")

      

^ This command inserts content from each individual case <h1..."text"

only up to the first case <p..."text"

(which is closer to what I want).

I also tried using the function goquery

Each()

, but I couldn't get close to what I wanted with this method (although I'm sure there is a way to do it using Each()

, right?)

My biggest problem is that I can't figure out how to bind each instance <h1..."text"

to the instance <p..."text"

that immediately follows it.

If it helps, <h1..."text"

always , then <p..."text"

on the webpages that I am trying to analyze.

My brain is made of juice. Do the geniuses of Go Go know how to do this and are ready to explain it? Thanks in advance.

EDIT

I found out something else that I can do:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, "\n\n")
})

      

^ This prints out what I want - the contents of each instance <h1..."text"

followed by its immediate instance <p..."text"

. I thought it s.Next()

outputs the next instance <h1>

, but outputs the next tag in doc

- the *goquery.Selection

, through which it iterates. It is right?

Or, as pointed out mattn

, I might as well use doc.Find("h1+p")

.

I am still having a hard time adding <h1..."text"

to <p..."text"

. I'll post it as another question, because you can break this down into multiple questions and mattn

there is one already answered.

+2


source to share


1 answer


I dont know you are writing code with goquery. But maybe your expected neighbor selector.

h1+p

      



This returns h1 tags that have a p-tag in the suck.

+1


source







All Articles