Quoting via ArrayList with another Arraylist in Java

I have a large list of sentence sentences and another list of words.

My program goes through a list of arrays and removes an item from that list of arrays if the sentence contains any of the words of the other.

The array list of sentences can be very large and I have coded a quick and dirty nested loop. While this works when there are not many offers, where there are, the time it takes to complete this operation is ridiculously long.

for (int i = 0; i < SENTENCES.size(); i++) {

        for (int k = 0; k < WORDS.size(); k++) {

            if (SENTENCES.get(i).contains(" " + WORDS.get(k) + " ") == true) {

                //Do something
            }
        }
    }

      

Is there a more efficient way to do this and then nested for loop?

+3


source to share


9 replies


There is some inefficiency in your code, but at the end of the day, if you have to search for sentences containing words, then you can't get away from loops.

However, there are a few things to try.

Do WORDS

a first HashSet

, the method contains

will be much faster than for ArrayList

because it does a hash lookup to get the value.

Second, change the logic a little like this:



Iterator<String> sentenceIterator = SENTENCES.iterator();

sentenceLoop:
while (sentenceIterator.hasNext())
{
  String sentence = sentenceIterator.next();

  for (String word : sentence.replaceAll("\\p{P}", " ").toLowerCase().split("\\s+"))
  {
    if (WORDS.contains(word))
    {
      sentenceIterator.remove();
      continue sentenceLoop;
    }
  }      
}    

      

This code (which assumes you are trying to remove sentences containing certain words) uses Iterator

and avoids the concatenation and syntax parsing logic string

you used in your original code (replacing it with a single regex), as from which it should be faster.

But keep in mind that, as with all performance, you will need to test these changes to see how they improve the situation.

+6


source


I̶ ̶w̶o̶u̶l̶d̶ ̶s̶a̶y̶ ̶n̶o̶, ̶ ̶b̶u̶t̶, what you should change is how you handle data deletion. This is stated in this part of the explanation for your problem:

The list of arrays of sentences can be very large (...). While this works when there are not many offers, where there are, the time it takes to complete this operation is ridiculously long.

The reason for this is that the removal time in ArrayList

takes O (N), and since you are doing this inside a loop, then it will take at least O (N ^ 2).



I recommend using LinkedList

instead ArrayList

for storing sentences and using Iterator

rather than your naive List#get

one as it already offers Iterator#remove

O (1) time for LinkedList

.

If you cannot change the design to LinkedList

, I recommend storing the sentences that are valid in the new List

one and replacing the original content List

with this new List

one at the end , thus saving a lot of time.

Apart from this big improvement, you can improve the algorithm even further by using Set

to store words to search for rather than using another List

, since search in Set

is O (1).

+4


source


I'll create a set of words from the second ArrayList:

Set<String> listOfWords = new HashSet<String>();
listOfWords.add("one");
listOfWords.add("two");

      

Then I will iterate over the set and the first ArrayList and use Contains:

for (String word : listOfWords) {
     for(String sentence : Sentences) {
           if (sentence.contains(word)) {
                // do something
           }
     }
 }

      

Also, if you can use any open source open source framework, check this:

search string in another string

+1


source


First, your program has a bug: it won't count words at the beginning and end of a sentence.

Your current program has an execution complexity of O (s * w), where s is the length, in characters, of all sentences, and w is the length of all words, also in characters.

If it is words

relatively small (a few hundred elements or so), you can use a regex to speed things up: build a pattern like this and use it in a loop:

StringBuilder regex = new StringBuilder();
boolean first = true;
// Let say WORDS={"quick", "brown", "fox"}
regex.append("\\b(?:");
for (String w : WORDS) {
    if (!first) {
        regex.append('|');
    } else {
        first = false;
    }
    regex.append(w);
}
regex.append(")\\b");
// Now regex is "\b(?:quick|brown|fox)\b", i.e. your list of words
// separated by OR signs, enclosed in non-capturing groups
// anchored to word boundaries by '\b on both sides.
Pattern p = Pattern.compile(regex.toString());
for (int i = 0; i < SENTENCES.size(); i++) {
    if (p.matcher(SENTENCES.get(i)).find()) {
        // Do something
    }
}

      

Since the regex is precompiled into a structure that is more suitable for fast searches, your program will run in O (s * max (w)), where s

is the length, in characters, of all sentences and w is the length of the longest word. Considering that the number of words in your collection is around 200 or 300, this can lead to an order of magnitude decrease.

+1


source


What you could do is put all your words in a HashSet. This allows you to quickly check if a word is in the set. See https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html for documentation.

HashSet<String> wordSet = new HashSet();
for (String word : WORDS) {
    wordSet.add(word);
}

      

Then it is simply a matter of dividing each sentence into the words that make it up and checking if any of those words are in the set.

for (String sentence : SENTENCES) {
    String[] sentenceWords = sentence.split(" "); // You probably want to use a regex here instead of just splitting on a " ", but this is just an example.
    for (String word : sentenceWords) {
        if (wordSet.contains(word)) {
            // The sentence contains one of the special words.
            // DO SOMETHING
            break;
        }
    }
}

      

+1


source


If you have enough memory, you can fake SENTENCES and put them in a Set. It would then be better in performance and also more correct than the current implementation.

0


source


Ok, looking at your code, I would suggest two things that will improve the performance of each iteration:

  • Remove "== true". The contains operation already returns a boolean, so it is enough that if, comparing it to true, adds one additional operation for each iteration that is not needed.
  • Do not concatenate strings inside a loop ( " " + WORDS.get(k) + " "

    ), as this is quite an expensive operation because the + operator creates new objects. Better to use a string buffer / builder and flush it after each iteration with stringBuffer.setLength(0);

    .

Also, I don't know any other approach for this case, maybe you can use regular expressions if you can abstract the pattern from the words you want to remove and only have one loop.

Hope it helps!

0


source


If you are into efficiency, I believe the most efficient way to do this is to use the Aho-Corasick algorithm . While you have 2 nested loops and method contains()

(which, it seems to me, takes the best sentence length + length of the word ), Aho-Corasick gives you one cycle on the proposals and to check out contain the words you want to sentence length , which is the word length times faster ( + preprocessing time to create a finite state machine, which is relatively short).

0


source


I will approach this in a more theoretical way. If you don't have a memory limit, you can try to simulate the logic in counting the sort

say M1 = sentences .size, M2 = word counts per sentence and N = word.size
Assume all sentences have the same word count just for simplicity
your current approach complexity is O (M1.M2.N)

We can create word matching - position in sentences. Scroll through your sentence array and change them to two flickering word arrays. Loop through the new array, create a HashMap where key, value = words, arraylist of the position of the word (say with length X).
What is O (2M1.M2.X) = O (M1.M2.X)

Then loop through your arraylist words, go to your hashmap of words, go through the word position list. remove each one. What's O (NX)

Say you need to give the result in a string arraylist, we need another loop and whatever. It's O (M1.M2)

The total complexity is O (M1.M2.X) + O (NX) + O (M1.M2)
assumming X less than N, you will probably get better performance

0


source







All Articles