Spell checker - noisy channel model with java streams
I have a list of query logs with entries that look like this:
Session ID Query
01 Movie atcor
01 Movie actor
02 Award winning axtor
02 Award winning actor
03 Soap opera axtor
03 Soap opera actor
...
I need to determine the probability of a spelling sentence being correct. For example, if I wanted to determine the probability that "actor" would be the correct spelling for "axtor", I would calculate this by determining the number of sessions in which "axtor" was replaced by "actor" divided by the number of sessions in which " actor "was the correct spelling of any misspelled word.
This means that in this case the probability is 2/3, since there are two sessions in which "actor" replaces "axtor" and three sessions where "actor" replaces the mishandling ("atcor" and "axtor").
I'm trying to get a little familiar with Java 8 streams, so I'm trying to get a solution using streams.
Here's what I was able to come up with. This is a step in the right direction, but I am still missing some parts.
public int numberOfCorrections(String misspelledWord, String suggestedWord)
{
return (int) sessionIdsWithWord(misspelledWord)
.stream()
.map(sessionId -> getLogsWithSameSessionId(sessionId)
.stream()
.filter(queryLog -> queryLog.queryContainsWord(suggestedWord))
.count()
).count();
}
public Set<String> sessionIdsWithWord(String word)
{
return getQueryLogsThatContainWord(word)
.stream()
.map(QueryLog::getSessionId)
.collect(Collectors.toSet());
}
public List<QueryLog> getQueryLogsThatContainWord(String word)
{
return logs
.stream()
.filter(queryLog -> queryLog.queryContainsWord(word))
.collect(Collectors.toList());
}
public Map<String, List<QueryLog>> getSessionIdMapping()
{
return logs
.stream()
.collect(Collectors.groupingBy(QueryLog::getSessionId));
}
public List<QueryLog> getLogsWithSameSessionId(String sessionId)
{
return getSessionIdMapping()
.get(sessionId);
}
What I am doing is not entirely correct. I only filter based on, if suggestedWord
displayed at all in the query log. I need to check and see if it has the word in the right place (the misspelled word in the same place as the fix).
I need a way in numberOfCorrections
, in part of the .map
stream, to check if the request log is suggestedWord
in the same place as it misspelledWord
was in the request. This is where I am stuck. How can i do this?
I think it could be something like this:
.map(sessionId -> getLogsWithSameSessionId(sessionId)
.stream()
.filter(queryLog -> //queryLog.getQuery().equals(some other queryLog in the same session)
.count()
).count();
But I don't know if there is a way to compare with another queryLog
in the same session.
I can't move on to the second half of my probability until I can figure out how to filter based on the fact that a given request is similar to another request in the same session.
source to share
It is impossible to interpret your methods one by one. Here's a simple solution:
public double countProbability(String misspelledWord, String suggestedWord) {
try (Stream<String> stream = Files.lines(logFilePath)) {
return stream.skip(1).map(line -> line.contains(misspelledWord) ? misspelledWord : (line.contains(suggestedWord) ? suggestedWord : ""))
.filter(w -> !w.equals("")).collect(collectingAndThen(groupingBy(Function.identity(), counting()),
m -> m.size() < 2 ? 0d : m.get(misspelledWord).doubleValue() / m.get(suggestedWord)));
}
}
I may be misunderstanding your question.
source to share