Android Jsoup Parser is very slow on kitkat

Jsoup seems to parse things a lot slower on kitkat then on anything before kitkat. I'm not sure what its ART runtime is, but after running a speed test on a parsing method and found it to be about 5x slower. And I don't know why ..

This piece of my code works in the doInBackground of the Async task.

    JsoupParser parser = new JsoupParser();
    parser.setPath(String.valueOf(application.getCacheDir()));

    Collection<Section> allSections = eguide.getSectionMap().values();
    for (Section section : allSections) {
         parser.createNewAssetList();
         parser.setContent(section.color, section.name, section.text, section.slug);
         if (!TextUtils.isEmpty(section.text)) {
            section.text = parser.setWebViewStringContent();
            section.assets = parser.getAssets();
            for (Asset asset : section.assets)
                asset.heading = section.heading;
         }
    } 

      

I wrote this many centuries ago and is probably not very efficient, but it sets up a parser, loads a list of section objects, for each object, it parses the table and image to extract the html into a list of different objects that are returned to the original section object ..

This is my parser class.

public class JsoupParser{

private List<Asset> assets;
private int assetCount;
private String slug,name,color,path;
private Document doc;

public JsoupParser() {
    assetCount = 0;
    assets = new ArrayList<Asset>();
}

public void setPath(String path) {
    this.path = path;
}

public void setContent(String color, String name, String text, String slug){
    this.color = color;
    this.name = name;
    this.slug = slug;
    doc = Jsoup.parse(text);
}

public void createNewAssetList(){
    assetCount = 0;
    assets = new ArrayList<Asset>();
}

public String setWebViewStringContent() {

    addScriptsAndDivTags();

    //parse images
    Elements images  = doc.select("img[src]");
    parseImages(images);

    //parse tables
    Elements tableTags = doc.select("table");
    parseTables(tableTags);

    return doc.toString();
}

private void addScriptsAndDivTags() {

    Element bodyReference = doc.select("body").first(); //grab head and body ref's
    Element headReference = doc.select("head").first();

    Element new_body = doc.createElement("body");
    //wrap content in extra div and add accodrion tag
    bodyReference.tagName("div");
    bodyReference.attr("id", "accordion");
    new_body.appendChild(bodyReference);
    headReference.after(new_body);
}

private void parseTables(Elements tableTags) {
    if (tableTags != null) {
        int count = 1;
        for (Element table : tableTags) {
            Asset item = new Asset();
            item.setContent(table.toString());
            item.setColor(color);
            item.id = (int) Math.ceil(Math.random() * 10000);
            item.isAsset=1;
            item.keywords = table.attr("keywords");
            String linkHref = table.attr("table_name");
            item.slug = "t_" + slug + " " + count ;
            if(!TextUtils.isEmpty(linkHref)){
               item.name = linkHref;
            }
            else{
               item.name ="Table-" + (assetCount + 1) + " in " + name;
            }
            // replace tables
            String inline = table.attr("inline");
            String button = ("<p>Dummy Button</p>");

            if(!TextUtils.isEmpty(inline)&& inline.contentEquals("false") || TextUtils.isEmpty(inline) )
            {
              table.replaceWith(new DataNode(button, ""));
            }
            else{
                Element div = doc.createElement("div");
                div.attr("class","inlineTableWrapper");
                div.attr("onclick", "window.location ='table://"+item.slug+"';");
                table.replaceWith(div);
                div.appendChild(table);
            }
            assets.add(item);
            assetCount++;
            count++;
        }
    }
}

private void parseImages(Elements images) {
    for (Element image : images) {
        Asset item = new Asset();

        String slug = image.attr("src");
        //remove first forward slash from slug to account for img:// protocol in image linking
        if(slug.charAt(0)=='/')
            slug = slug.substring(1,slug.length());
        image.attr("src", path +"/images/" + slug.substring(slug.lastIndexOf("/")+1, slug.length()));
        image.attr("style", "px; border:1px solid #000000;");
        String image_name = image.attr("image_name");
        if(!TextUtils.isEmpty(image_name)){
           item.name = image_name;
        }
        else{
           item.name ="Image " + (assetCount + 1) + " in " + name;
        }

        // replace tables
        String inline = image.attr("inline");

        String button = ("<p>Dummy Button</p>");
        item.setContent(image.toString()+"<br/><br/><br/><br/>");
        if(!TextUtils.isEmpty(inline)&& inline.contentEquals("false"))
        {
            image.replaceWith(new DataNode(button, ""));
        }
        else{
           image.attr("onclick", "window.location ='img://"+slug+"';");
        }

        item.keywords = image.attr("keywords");
        item.setColor(color);
        item.id = (int) Math.ceil(Math.random() * 10000);
        item.slug = slug;
        item.isAsset =2;
        assets.add(item);
        assetCount++;
    }
}

public String getName() {
    return name;
}

public List<Asset> getAssets() {
    return assets;
}
}

      

Again, it's probably not very efficient, but so far I haven't been able to figure out why it is doing such a performance hit on kitkat. Any information would be greatly appreciated. Thank!

+3


source to share


2 answers


Update April 7, 2015 Author jsoup included my suggestion in the main backbone, at this point checking the ASCII or UTF encoding and skipping the slow (on Android 4.4 and 5) canEncode (), so just update the jsoup source tree and re-create or pull it the last can.

Previous comments and explanation of the problem. ... I found that the problem was at least in my application. The Entities.java jsoup module has an escape () function - used, for example. Element.outerHtml () calls all text nodes. Among other things, it checks every character of every node text if it can be encoded with the current encoder:

 if (encoder.canEncode(c))
    accum.append(c);
 else...

      

The canEncode () call is extremely slow on Android KitKat and Lollipop. Since my HTML output is only in UTF-8 and Unicode can encode pretty much any character, this check is not required. I changed it by checking at the beginning of the escape () function:

boolean encIsUnicode = encoder.charset().name().toUpperCase().startsWith("UTF-");

      

and then when a test is required:



if (encIsUnicode || encoder.canEncode(c))
    accum.append(c);
else ...

      

Now my app works like a charm on KitKat and Lollipop - what used to take 10 seconds now takes less than 1 second. I've issued a pull request to the main jsoup repository with this change and a few smaller optimizations I've made. Not sure if the author of jsoup will combine it. If you like, check my fork for:

https://github.com/gregko/jsoup

If you are working with some other encodings that you know beforehand, you can add your own tests (for example, see if the character is ASCII or something) to avoid the expensive canEncode (c) call.

Greg

+3


source


you use a lot of string concatenation (which can be killer for large amounts of data)

item.name ="Table-" + (assetCount + 1) + " in " + name;

      



according to this post: Is it always a bad idea to use + for string concatenation - you should avoid concatenation in loops, which is consistent with your code .. how about:

item.name = String.format ("Table -% s in% s", computername + 1, name);

0


source







All Articles