Spell checker - noisy channel model with java streams

I have a list of query logs with entries that look like this:

Session ID Query
01 Movie atcor
01 Movie actor
02 Award winning axtor
02 Award winning actor
03 Soap opera axtor
03 Soap opera actor
...

      

I need to determine the probability of a spelling sentence being correct. For example, if I wanted to determine the probability that "actor" would be the correct spelling for "axtor", I would calculate this by determining the number of sessions in which "axtor" was replaced by "actor" divided by the number of sessions in which " actor "was the correct spelling of any misspelled word.

This means that in this case the probability is 2/3, since there are two sessions in which "actor" replaces "axtor" and three sessions where "actor" replaces the mishandling ("atcor" and "axtor").

I'm trying to get a little familiar with Java 8 streams, so I'm trying to get a solution using streams.

Here's what I was able to come up with. This is a step in the right direction, but I am still missing some parts.

public int numberOfCorrections(String misspelledWord, String suggestedWord)
{
    return (int) sessionIdsWithWord(misspelledWord)
            .stream()
            .map(sessionId -> getLogsWithSameSessionId(sessionId)
                    .stream()
                    .filter(queryLog -> queryLog.queryContainsWord(suggestedWord))
                    .count()
            ).count();
}

public Set<String> sessionIdsWithWord(String word)
{
    return getQueryLogsThatContainWord(word)
            .stream()
            .map(QueryLog::getSessionId)
            .collect(Collectors.toSet());
}

public List<QueryLog> getQueryLogsThatContainWord(String word)
{
    return logs
            .stream()
            .filter(queryLog -> queryLog.queryContainsWord(word))
            .collect(Collectors.toList());
}

public Map<String, List<QueryLog>> getSessionIdMapping()
{
    return logs
            .stream()
            .collect(Collectors.groupingBy(QueryLog::getSessionId));
}

public List<QueryLog> getLogsWithSameSessionId(String sessionId)
{
    return getSessionIdMapping()
            .get(sessionId);
}

      

What I am doing is not entirely correct. I only filter based on, if suggestedWord

displayed at all in the query log. I need to check and see if it has the word in the right place (the misspelled word in the same place as the fix).

I need a way in numberOfCorrections

, in part of the .map

stream, to check if the request log is suggestedWord

in the same place as it misspelledWord

was in the request. This is where I am stuck. How can i do this?

I think it could be something like this:

.map(sessionId -> getLogsWithSameSessionId(sessionId)
        .stream()
        .filter(queryLog -> //queryLog.getQuery().equals(some other queryLog in the same session)
        .count()
).count();

      

But I don't know if there is a way to compare with another queryLog

in the same session.

I can't move on to the second half of my probability until I can figure out how to filter based on the fact that a given request is similar to another request in the same session.

+3


source to share


1 answer


It is impossible to interpret your methods one by one. Here's a simple solution:

public double countProbability(String misspelledWord, String suggestedWord) {
    try (Stream<String> stream = Files.lines(logFilePath)) {
        return stream.skip(1).map(line -> line.contains(misspelledWord) ? misspelledWord : (line.contains(suggestedWord) ? suggestedWord : ""))
                .filter(w -> !w.equals("")).collect(collectingAndThen(groupingBy(Function.identity(), counting()),
                        m -> m.size() < 2 ? 0d : m.get(misspelledWord).doubleValue() / m.get(suggestedWord)));
    }
}

      



I may be misunderstanding your question.

+2


source







All Articles