What algorithms count the frequencies of common elements in a set of sets?

I need information on algorithms that can help bring out the commonality and differences between sets of overlapping data.

Using stackoverflow tag system as an example:

Let's say this question is given 5 tags. Let's say there are 1000 more questions that contain at least one of these tags. Out of these 1000 questions, how many of these questions are tags common to my original post?

Another simpler way to describe this is with an automatic labeling system:

"You tagged your question with [5 tags that I chose]. Other similar questions have been tagged with [list of tags that might be of interest]. Where [list of tags that might be of interest]" are often tags that arenโ€™t on my initial list.

Code examples in C # if possible :)

+1


source to share


2 answers


I don't know of any specific algorithms or data structures, but I could suggest a basic way to solve this problem:

Assumption: Each entry has five unique tags.

  • Collect all records containing any of the five tags (no duplicates).
  • For each entry in the list, use an associative array (hash table) for each tag, incrementing the value.
  • For each entry in the array, add the tag name to that array entry index.

In (sloppy) pseudocode, use two loops (if possible):



for each entry
    if any tag in original_tags
        tag_list[tag]++
end

for next in tag_list
    tag_count[tag_list[next]] += next
end  

      

This should result in a sparse array of concatenated tag names (ok, I didn't include the delimiter, but hey its pseudocode :-). Keep the largest number and then iterate backward for the best offers.

(Cache for optimization, but stay tuned)

Paul.

0


source


Look at the Wager-Hamming distance. This is the Hamming distance, defined on lines, as the number of edits it takes to convert one line to another.



You can also use partial ordering of equivalence classes and ask for inclusion: when questions A and B have the same set of tags before reordering, they are equal, set union, set difference and set intersection, then define partial order for <and> comparison.

+1


source







All Articles