Parsing information from a URL with Jsoup

I need help with my Java project using Jsoup (if you think there is a more efficient way to achieve the goal, please let me know). The goal of my program is to parse certain useful information from different urls and put it in a text file. I am not an expert in HTML or JavaScript, so it was difficult for me to code in Java exactly what I want to analyze. On the website that you see in the code below as one example, the information I am interested in for analysis with Jsoup is all that you can see in the table under Routing (route, location, ship or voyage , container arrival date, container departure date; = Origin, Seattle SSA Terminal T18, June 26 15 A, June 26 15 A ... and so on).So far with Jsoup we can only make out the name of the website, but we have not had any success in getting any body. Here is the code I used, which I got from an online source:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

 public class Jsouptest71115 {

    public static void main(String[] args) throws Exception {
 String url = "http://google.com/gentrack/trackingMain.do "
                + "?trackInput01=999061985";
        Document document = Jsoup.connect(url).get();

        String title = document.title();
        System.out.println("title : " + title);

        String body = document.select("body").text();
        System.out.println("Body: " + body);


        }
    }

      

+3


source to share


2 answers


Working code:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;

public class Sample {
    public static void main(String[] args) {
        String url = "http://homeport8.apl.com/gentrack/blRoutingPopup.do";

        try {
            Connection.Response response = Jsoup.connect(url)
                    .data("blNbr", "999061985")  // tracking number
                    .method(Connection.Method.POST)
                    .execute();

            Element tableElement = response.parse().getElementsByTag("table")
                    .get(2).getElementsByTag("table")
                    .get(2);

            Elements trElements = tableElement.getElementsByTag("tr");
            ArrayList<ArrayList<String>> tableArrayList = new ArrayList<>();

            for (Element trElement : trElements) {
                ArrayList<String> columnList = new ArrayList<>();
                for (int i = 0; i < 5; i++) {
                    columnList.add(i, trElement.children().get(i).text());
                }
                tableArrayList.add(columnList);
            }

            System.out.println("Origin/Location: "
                    +tableArrayList.get(1).get(1));// row and column number

            System.out.println("Discharge Port/Container Arrival Date: "
                    +tableArrayList.get(5).get(3));


        } catch (IOException e) {
            e.printStackTrace();
        }


    }


}

      



Output:

Start / Location: SEATTLE SSA TERMINAL (T18), WA

Unloading port / container Arrival date: 23 July 15 E

+2


source


You need to use the document.select("body")

select input method , which the CSS selector refers to. To learn more about CSS selectors just google, or Read this . With CSS selectors, you can easily identify body parts of a web page.

In your specific case, you will have a different problem, although for example the table you are in is inside IFrame

and if you look at the html of the web page you visit it (iframe) the url is " http: / /homeport8.apl.com/gentrack/blRoutingFrame.do ", so if you visit this URL directly to access its content, you will get an exception, which is possibly some kind of restriction from the Server. To properly get the content you need to visit two URL-addresses through JSoup, 1. http://homeport8.apl.com/gentrack/trackingMain.do?trackInput01=999061985 and 2. <a3>

For the first url, you won't get anything useful, but for the second url, you will get tables of your interest. Try using document.select("table")

which will give you a list of table iterators over that list and find the table you are interested in. After you use Element.select ("tr") table to get the table row, then for each tr use Element.select ("td") to get the table cell data.



The webpage you are visiting did not use CSS class and ID selectors that would make it easier to read with jsoup, so I am afraid that repeating with document.select("table")

is your best and easiest option.

Good luck.

0


source







All Articles