Conjunctive regex match labels in Java

I have a big List<String>

one where each line is a sentence containing 1+ "tokens" (prefixed with "a" or "b" followed by a positive integer):

List<String> tokenList = new ArrayList<String>()
tokenList.add("How now a1 cow.")
tokenList.add("The b1 has oddly-shaped a2.")
tokenList.add("I like a2! b2, b2, b2!")
// etc.

      

I want to write a function that takes a list of vararg tokens and will return a subset of the tokenList

string containing all of the token arguments. For example:

public class TokenMatcher {
    List<String> tokenList; // Same tokenList as above

    List<String> findSentencesWith(String... tokens) {
        List<String> results = new ArrayList<String>();
        StringBuilder sb = new StringBuilder();

        // Build up the regex... (TODO: this is where I'm going wrong)
        for(String t : tokens) {
            sb.append(t);
            sb.append("|");
        }

        String regex = sb.toString();

        for(String sentence : tokenList) {
            if(sentence.matches(regex)) {
                results.add(sentence);
            }
        }

        return results;
    }
}

      

Again, the regex must be designed such that everything tokens

passed to the function must exist within the clause for the match to be true. Hence:

TokenMatcher matcher = new TokenMatcher(tokenList);
List<String> results = matcher.findSentencesWith("a1");     // Returns 1 String ("How now a1 cow")
List<String> results2 = matcher.findSentencesWith("b1");    // Returns 1 String ("The b1 has oddly-shaped a2.")
List<String> results3 = matcher.findSentencesWith("a2");    // Returns the 2 Strings with a2 in them since "a2" is all we care about...
List<String> results4 = matcher.findSentencesWith("a2", "b2");  // Returns 1 String ("I like a2! b2, b2, b2!.") because we care about BOTH tokens

      

The last example ( results4

) is important because although the token "a2" is present in several sentences, results4

we ask the method to give us matches for sentences containing the tag . This is an n-ary conjunctive, which means that if we specified 50 tokens as parameters, we would only like offers with all 50 tokens.

The above example findSentencesWith

is my best attempt. Any ideas?

+3


source to share


1 answer


Given your stated requirements that neither order nor frequency matters, I see no need to use regular expressions at all in this case.

Rather, you can compare each string against all the tokens shown and see if they are all contained in the string. If so, then in the result set. The first time a missing token is found, this row is removed from the result set.

This kind of code would look something like this:

TokenMatcher.java

package so_token;

import java.util.*;    

public class TokenMatcher {

    public TokenMatcher(List<String> tokenList) {
        this.tokenList = tokenList;
    }

    List<String> tokenList;

    List<String> findSentencesWith(String... tokens) {
        List<String> results = new ArrayList<String>();

        // start by assuming they're all good...
        results.addAll(tokenList);

        for (String str : tokenList) {
            for(String t : tokens) {
                // ... and remove it from the result set if we fail to find a token
                if (!str.contains(t)) {
                    results.remove(str);

                    // no point in continuing for this token
                    break;
                }
            }
        }

        return results;
    }

    public static void main (String[] args) throws java.lang.Exception
    {
        List<String> tokenList = new ArrayList<String>();
        tokenList.add("How now a1 cow.");
        tokenList.add("The b1 has oddly-shaped a2.");
        tokenList.add("I like a2! b2, b2, b2!");

        TokenMatcher matcher = new TokenMatcher(tokenList);

        List<String> results = matcher.findSentencesWith("a1");     // Returns 1 String ("How now a1 cow")

        for (String r : results) {
            System.out.println("1 - result: " + r);
        }

        List<String> results2 = matcher.findSentencesWith("b1");    // Returns 1 String ("The b1 has oddly-shaped a2.")

        for (String r : results2) {
            System.out.println("2 - result: " + r);
        }

        List<String> results3 = matcher.findSentencesWith("a2");    // Returns the 2 Strings with a2 in them since "a2" is all we care about...

        for (String r : results3) {
            System.out.println("3 - result: " + r);
        }       

        List<String> results4 = matcher.findSentencesWith("a2", "b2");  // Returns 1 String ("I like a2! b2, b2, b2!.") because we care about BOTH tokens

        for (String r : results4) {
            System.out.println("4 - result: " + r);
        }
    }
}

      



This produces the following result:

1 - result: How now a1 cow.
2 - result: The b1 has oddly-shaped a2.
3 - result: The b1 has oddly-shaped a2.
3 - result: I like a2! b2, b2, b2!
4 - result: I like a2! b2, b2, b2!

      

Slightly modified, executable code (basically does not contain the package name and non-public class, so it will run on the site) on ideone .

Note. ... Based on the information you provided, and since the function accepts a list of tokens, it seems like contains

it would be sufficient to determine if a token is present or not. However, if it turns out that there are additional restrictions for this, for example, a marker must be followed by a space or one of a set of punctuation marks or something like that in order to count as a token, then I would recommend using regular expressions - based on an individual marker - replacing contains

with matches

and passing in a regex specifying that you want to surround the token.

It might also be desirable to have a function that validates yours tokenList

, which is passed to the function findSentencesWith

.

+2


source







All Articles