How do I know which text to remove Jsoup?

I have the following text:

text<html/>text

      

And use Jsoup library to clean text from html content. Namely, like the code below:

Document clean = new Cleaner(none()).clean(myDirtyDoc);

      

I'm going to log an error for the user: Malisious content was specified: "<html/>".

But I don't know how to correctly identify the line that Jsoup was clean.

I tried using StringUtils.difference (cleanedValue, value) but this method works differently, namely the documentation says:

Compares two Strings, and returns the portion where they differ.
(More precisely, return the remainder of the second String,
starting from where it different from the first.)

      

As a result, it returns the next row follows: <html/>text

.

It is good to know any diff tools that can be easily used in java to compare strings.

+3


source to share


1 answer


google-diff-match-patch

The Diff Match and Patch libraries offer robust algorithms for performing the operations required to synchronize plain text.

Diff: Compare two blocks of plain text and effectively return a list of differences.

Match: Given the search string, find your best fuzzy match in a block of plain text. Weighted for both accuracy and location.



Patch: Apply patch list to plain text. Use hotfix to correct, even if the body text doesn't match.

Java, JavaScript, Dart, C ++, C #, Objective C, Lua, and Python are currently available. Regardless of the language, every library has the same API and the same functionality. All versions also have complete test harnesses.

There is a line or word page that describes how to make linear distinctions.

+1


source







All Articles