Is it possible to parse the html a bit with jsoup. After parsing, it is also necessary to keep some tags in the output

I need to parse the lower body html as the result given below. Tags

must be present at the output. The output can be tags {p, i, b, br}. other tags must be removed and only text is required for output.

This is my input.

<!DOCTYPE HTML>
<html>
    <head>
        <title>Introduction</title>
    </head>
    <body>
        <article id="mobi_content">
            <h1 class="mobi-page-title">Introduction</h1>
            <section id="dataSectionInstanceId-431331" class="body-text">This book is about creating a great career. <p>You might be saying to yourself, "I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!" <p>Well, if you're looking, we're going to show you how to get that great job now. That the first, short-term step. <p>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. <p>This book is about today and tomorrow. It about getting a great job now and enjoying a great career for life. <p>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: "I can't believe I get paid for doing this!" Are only a few people entitled to feel that way, but not the rest of us? <p>And what about you? Are you looking forward to a great career? Would you describe your current career as "great"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? <p>Furthermore, just how do you create a great career for yourself? <p>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you.
            </section>
        </article>
    </body>
</html>

      

the output is expected as:

This book is about creating a great career.
<P>You might be saying to yourself, "I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!" 
<P>Well, if you're looking, we're going to show you how to get that great job now. That the first, short-term step. 
<P>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. 
<P>This book is about today and tomorrow. It about getting a great job now and enjoying a great career for life. 
<P>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: "I can't believe I get paid for doing this!" Are only a few people entitled to feel that way, but not the rest of us? 
<P>And what about you? Are you looking forward to a great career? Would you describe your current career as "great"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? 
<P>Furthermore, just how do you create a great career for yourself? 
<P>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you.

      

My code:

doc.body().traverse(new NodeVisitor() {

    @Override
    public void head(Node node, int depth) {

        String name = node.nodeName();
        String paraText = "";

        if (node instanceof TextNode) {

            TextNode tn = ((TextNode) node);

            if (node.nodeName().equals("p")) {
                //finalHtml+="<p>"+tn.text()+"</p>";
            } else {
                finalHtml += tn.text();
            }

        } else if (node instanceof Node) {

            if (node.nodeName() == "p") {
                System.out.println("fnbdnv"+node.toString());
            }
            if (node.nodeName() == "h1") {
                // finalHtml+="<p>"+node.toString()+"<p>";
            } else if (node.nodeName() == "div") {
                node.removeAttr("class");
                finalHtml += node.toString();
            } else if (node.nodeName() == "seection") {
                    finalHtml += node.toString();
            } else if (node.nodeName() == "<b>") {
                finalHtml += node.toString();
            } else if (node.nodeName() == "<i>") {
                finalHtml += "<i>" + node.toString() + "</i>";
            }
        }

    }

    @Override
    public void tail(Node node, int depth) {
        // Do Nothing
    }
});

      

+3


source to share


1 answer


Maybe some regex would be better in this case.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

    public static void main(String[] args) {
        try {
            String html = "<!DOCTYPE HTML>" +
                            "<html>" +
                                "<head>" +
                                    "<title>Introduction</title>" +
                                "</head>" +
                                "<body>" +
                                    "<article id=\"mobi_content\">" +
                                        "<h1 class=\"mobi-page-title\">Introduction</h1>" +
                                        "<section id=\"dataSectionInstanceId-431331\" class=\"body-text\">This <i>book</i> is about creating a great career. <p>You might be saying to yourself, \"I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!\" <p>Well, if you're looking, we're going to show you how to get that great job now. That the first, short-term step. <p>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. <p>This book is about today and tomorrow. It about getting a great job now and enjoying a great career for life. <p>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: \"I can't believe I get paid for doing this!\" Are only a few people entitled to feel that way, but not the rest of us? <p>And what about you? Are you looking forward to a great career? Would you describe your current career as \"great\"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? <p>Furthermore, just how do you create a great career for yourself? <p>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you." +
                                        "</section>" +
                                    "</article>" +
                                "</body>" + 
                                "</html>";

            Document doc = Jsoup.parse(html);


            System.out.println(removeTags(doc.body().toString()));

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static String removeTags(String source) {    
        return source.replaceAll("(?!(</?p>|</?i>|</?b>|<br/?>))(</?.*?>)", " ");
    }
}

      

Update



import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

    public static void main(String[] args) {
        try {
            String html = "<!DOCTYPE HTML>" +
                            "<html>" +
                                "<head>" +
                                    "<title>Introduction</title>" +
                                "</head>" +
                                "<body> <article id=\"mobi_content\"> <h1 class=\"mobi-page-title\">\"Build Your Village\" Tool</h1> <section id=\"dataSectionInstanceId-431408\" class=\"body-text\"><p class=\"nonindent\">Your great career depends not only on you,</p> <p class=\"nonindent\">Sample deposits in the Emotional Bank Account:</p> <ul class=\"bullet\"> <li><p class=\"nonindent\">Congratulate the person on a job well done.</p></li> <li><p class=\"nonindent\">Send birthday greetings.</p></li></section></article></body>" +
                                "</html>";

            Document doc = Jsoup.parse(html);


            System.out.println(removeTags(doc.body().toString()));

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static String removeTags(String source) {    
        return source.replaceAll("(?!(</p>|<p .*?>|</?i>|</?b>|<br/?>))(</?.*?>)", " ");
    }
}

      

Update 2

import java.util.ListIterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {

    public static void main(String[] args) {
        try {
            Pattern pattern = Pattern.compile("/(((?!/).)*)[.]");

            String html = "<!DOCTYPE HTML>" +
                            "<html>" +
                                "<head>" +
                                    "<title>Introduction</title>" +
                                "</head>" +
                                "<body> <article id=\"mobi_content\"> <h1 class=\"mobi-page-title\">\"Build Your Village\" Tool</h1> <section id=\"dataSectionInstanceId-431408\" class=\"body-text\"><p class=\"nonindent\">Your great career depends not only on you,</p> <p class=\"center\"><img src=\"mpla/multimedia/Cove_9781936111107_epub_005_r1.png\" id=\"mobi_image_12776\" class=\"inline-img\" alt=\"PNG\"/></p><p class=\"nonindent\">Sample deposits in the Emotional Bank Account:</p> <ul class=\"bullet\"> <li><p class=\"nonindent\">Congratulate the person on a job well done.</p></li> <li><p class=\"nonindent\">Send birthday greetings.</p></li></section></article></body>" +
                                "</html>";

            Document doc = Jsoup.parse(html);
            Elements imgs = doc.select("img");
            System.out.println(imgs);
            ListIterator<Element> iter = imgs.listIterator();
            while(iter.hasNext()) {
                Element img = iter.next();
                String src = img.attr("src");     
                Matcher matcher = pattern.matcher(src);
                if (matcher.find()) {
                    img.tagName("graphic").text(matcher.group(1)); 
                    removeAttr(img);
                }         
            }

            System.out.println(removeTags(doc.body().toString()));

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void removeAttr(Element e) {
        Attributes at = e.attributes();
        for (Attribute a : at) {
            e.removeAttr(a.getKey());
        }
    }

    public static String removeTags(String source) {    
        return source.replaceAll("(?!(</p>|<p .*?>|</?graphic>|</?i>|</?b>|<br/?>))(</?.*?>)", " ").trim();
    }
}

      

0


source







All Articles