Getting the number of unique strings from the list <string []> in the dictionary
I want to enter List<string[]>
and
The output is a dictionary where the keys are unique strings used for the index and the values are an array of floats with each position in the array representing the key counter for string[]
inList<string[]>
So far I have tried
static class CT
{
//Counts all terms in array
public static Dictionary<string, float[]> Termfreq(List<string[]> text)
{
List<string> unique = new List<string>();
foreach (string[] s in text)
{
List<string> groups = s.Distinct().ToList();
unique.AddRange(groups);
}
string[] index = unique.Distinct().ToArray();
Dictionary<string, float[]> countset = new Dictionary<string, float[]>();
return countset;
}
}
static void Main()
{
/* local variable definition */
List<string[]> doc = new List<string[]>();
string[] a = { "That", "is", "a", "cat" };
string[] b = { "That", "bat", "flew","over","the", "cat" };
doc.Add(a);
doc.Add(b);
// Console.WriteLine(doc);
Dictionary<string, float[]> ret = CT.Termfreq(doc);
foreach (KeyValuePair<string, float[]> kvp in ret)
{
Console.WriteLine("Key = {0}, Value = {1}", kvp.Key, kvp.Value);
}
Console.ReadLine();
}
I am stuck on the dictionary part. What's the most efficient way to do this?
source to share
It looks like you could use something like:
var dictionary = doc
.SelectMany(array => array)
.Distinct()
.ToDictionary(word => word,
word => doc.Select(array => array.Count(x => x == word))
.ToArray());
In other words, first find a different set of words, then create a match for each word.
To create a match, look at each array in the original document and find the number of occurrences of a word in that array. (Thus, each array is mapped to int
.) Use LINQ to perform this mapping throughout the document by ToArray
generating int[]
for a specific word ... and what is the meaning for that dictionary entry of the word.
Note that this creates Dictionary<string, int[]>
, not Dictionary<string, float[]>
- it seems more reasonable to me, but you can always distinguish the result from Count
before float
if you really wanted to.
source to share