Remove all duplicates generalization

Question

Remove all duplicates generalization

Let be related

a binary predicate. Let us define a "subset of relations" as the set of all elements, so that any element in the subset is related to at least one other element in the subset through a “related” other than it (hence, whether an element is related to itself or has nothing to do with the formation of subsets relations). Note. A subset of relationships is NOT necessarily a tightly coupled component . For example, suppose A is related to B, and B and C are related to each other. Then {A, B, C} is a subset of relations by definition, but not a tightly coupled component because there is no path from B to A or C to A via 'related'.

Note that "linked" is not necessarily symmetrical; related (a, b) == true does not necessarily mean that related (b, a) == true and is not transitive, i.e. related (a, b) == true and related (b, c) == true do not necessarily imply related (a, c) == true. As far as I can see, no constraint on the binary predicate related

is required to split the set into relationship subsets. If an element x is not associated with any element at all (other than itself), then {x} is itself its own subset of relations.

One good problem is to define

template <typename Container, typename BinaryPredicate>
void partition (Container& container, BinaryPredicate related);

which will sort the elements of the container so that the container is split into its relationship subsets. This question came up quickly after trying the original problem:

template <typename Container, typename BinaryPredicate>
void removeRelated (Container& container, BinaryPredicate related);

which must remove (from container

) all elements from every subset of the relationship except the first from every subset found in the container.

Of course, equality is just a special case of "sibling" and hence this generalization of "remove all duplicates" (this is how I came up with this question, trying to generalize this well-known problem). We want to keep one representative of each subset of relationships, namely the first according to the ordering of the container elements.

Below is the code I was trying to implement with the following relationships:

Bob knows Mary
Mary knows Jim
Jim knows Bob
Sally knows Fred
Fred knows no one.

In this example, A refers to B if and only if A knows B. {Bob, Mary, Jim} is a subset of relationships because each person is related to someone else (Bob is related to Mary, Mary is related to Jim, Jim connected with Bob). Note that {Sally, Fred} is NOT a subset of relationships because although Sally is related to Fred, Fred is not related to Sally. Thus, we are left with {Sally}, {Fred} as the other two subsets of the relationship.

So the final answer should be: Bob, Sally, Fred. Note that if a function were defined partition

, it would simply leave (Bob, Mary, Jim, Sally, Fred) unchanged, as Bob, Sally, Fred were the first item in each section. Therefore, even if we had a function partition

(not to be confused with std :: partition, of course, but I'm not thinking about a good name right now), it is still not immediately clear what is needed removeRelated

.

#include <iostream>
#include <algorithm>
#include <list>
#include <set>

template<typename Container, typename BinaryPredicate>
void removeRelated (Container& container, BinaryPredicate related) {
    using element = typename Container::value_type;
    std::set<element> s;
    for (typename Container::iterator it = std::begin(container);  it != std::end(container); ) {
        if (std::find_if(s.begin(), s.end(),
                [=](const element& x)->bool {return related(*it,x);}) != s.end())
            it = container.erase(it);  // *it is related to an element in s.
        else {
            s.insert(*it);
            it++;
        }
    }
}

struct Person {  // Used for testing removeRelated.
    std::string name;
    std::list<Person*> peopleKnown;
    Person (const std::string& n) : name(n) {}
    void learnsAbout (Person* p) {peopleKnown.push_back(p);}
    bool knows (Person* p) const {
        return std::find(peopleKnown.begin(), peopleKnown.end(), p) != peopleKnown.end();
    }
};

int main() {
    Person *bob = new Person("Bob"), *mary = new Person("Mary"), *jim = new Person("Jim"),
        *sally = new Person("Sally"), *fred = new Person("Fred");
    bob->learnsAbout(mary);
    mary->learnsAbout(jim);
    jim->learnsAbout(bob);
    sally->learnsAbout(fred);
    std::list<Person*> everybody {bob, mary, jim, sally, fred};
    removeRelated (everybody, [](Person* a, Person* b)->bool {return a->knows(b);});
    for (const Person* x : everybody) std::cout << x->name << ' ';  // Bob Mary Sally Fred
//  Should be 'Bob Sally Fred' since {Bob, Mary, Jim}, {Sally}, {Fred} are the relation subsets.
}

The error from the above code is that Mary is inserted in 's' because at that moment she is not associated with anyone in s (via "knows"). After all, she is related to Jim, who is related to Bob (and Bob to Maria) through "knows", and therefore {Bob, Mary, Jim} is a subset of relationships, so bob must be the only one of these three people in Russia ".

But during iteration, the function doesn't know this. How can I fix the algorithm? One idea is to perhaps first define the function mentioned above partition

, i.e. sort the container so that the container is split into its relationship subsets (which is a very nice problem in itself and could very well be the main question at hand), and then just grab the first element of each section.

Another idea is to replace the lambda function

[=](const element& x)->bool {return related(*it,x);}

from

[=](const element& x)->bool {return relatedIndirectly(*it,x);}

and I think this might solve the problem where the related helper function looks directly at the chain of relationship with x.

Here's another example, reviewed by Cimbali (see below). Suppose A is related to B, A is related to C, B is related to C. Then {A, B} cannot be a subset of relations, because B is not related to A. Similarly, {A, C} and {B, C } cannot be subset relations (C is not related to A, and C is not related to B), and {A, B, C} is definitely not, since C is not related to anyone. {A}, {B}, {C} is the only partition that satisfies my definition of relationship subsets.

Hypothesis: A subset of relations is always a union of strongly connected components, so that each strongly connected component in a union has at least one element that is related to some element in another strongly connected component in the union. But this would require a mathematical proof.

Update: I have fortified the above definition of a related subset (significantly) so that: BinaryPredicate 'related' should be reflective (related (x, x) == true for any x), symmetric (related (x, y) implies related (y, x)) and transitive (connected (x, y) and connected (y, z) implies connected (x, z)). This breaks any set into equivalence classes. removeRelated removes all elements from each equivalence class except for the first element in the container. This generalizes the classic "remove all duplicates" problem, since equality is a special case of an equivalence relation. The following code now gives the correct results, but I am wondering if there is a way to relax the "sibling" condition and get the same results.

#include <iostream>
#include <algorithm>
#include <list>
#include <set>

template<typename Container, typename BinaryPredicate>
void removeRelated (Container& container, BinaryPredicate related) {
    using element = typename Container::value_type;
    std::set<element> s;
    for (typename Container::iterator it = std::begin(container);  it != std::end(container); ) {
        if (std::find_if(s.begin(), s.end(),
                [=](const element& x)->bool {return related(*it,x);}) != s.end())
            it = container.erase(it);  // *it is related to an element in s.
        else {
            s.insert(*it);
            it++;
        }
    }
}

// Used for testing removeRelated.  Person::isRelativeOf is an equivalence relation.
struct Person {
    std::string name;
    std::list<Person*> relatives;
    Person (const std::string& n) : name(n) {relatives.push_back(this);}  // Forcing reflexivity
    void addRelative (Person* p) {
        for (Person* relatives_of_p : p->relatives)
            relatives.push_back(relatives_of_p);  // Forcing transitivity ('relatives.push_back(p);' included in this)
        p->relatives.push_back(this);  // Forcing symmetry
    }
    bool isRelativeOf (Person* p) const {
        return std::find(relatives.begin(), relatives.end(), p) != relatives.end();
    }
};

int main() {
    Person *bob = new Person("Bob"), *mary = new Person("Mary"), *jim = new Person("Jim"),
        *sally = new Person("Sally"), *fred = new Person("Fred");
    bob->addRelative(mary);  // crashes
    mary->addRelative(jim);
    jim->addRelative(bob);
    sally->addRelative(fred);
    std::list<Person*> everybody {bob, mary, jim, sally, fred};
    removeRelated (everybody, [](Person* a, Person* b)->bool {return a->isRelativeOf(b);});
    for (const Person* x : everybody) std::cout << x->name << ' ';  // Bob Sally (correct)
}

If "linked" "knows" or "is a friend" that doesn't have to be symmetric or transitive, we still have partitioning, and thus removeRelated can still work?

And another question I'm asking about is what is the fastest sorting algorithm to sort above so that the equivalence classes are made up of sequential elements? This is what I came up with:

template<typename Container, typename BinaryPredicate>
void sortByEquivalenceClasses (Container& container, BinaryPredicate related) {
    for (auto it = container.begin();  it != container.end();  ++it)
        for (auto jt = std::next(it);  jt != container.end();  ++jt)
            if (related(*it, *jt)) {
                std::iter_swap(jt, std::next(it));
                break;
            }
}

But sorting does not preserve the original relative order of the elements. How to keep it?

+3

c ++ containers elements

prestokeys 01 dec. 14 at 16:59

source to share

1 answer

Cimbali · Accepted Answer · 2014-12-02T00:47:21+0000

Ok, you define a subset as follows

Let's define a "subset of relations" as the set of all elements, so that any element in the subset is associated with at least one other element in the subset through a "sibling" other than it

This sound is a bit recursive, but as I understand it,

A node n

with no outgoing relationship is in a subset that is a singleton{n}
A node n

that has an outgoing relationship for single numbers only is in a subset that is a singleton{n}
Any cycle (any length), and generally any strongly linked component forms a subset , including all of their predecessors. for a relationship.

Example. The following ratios:

A -> B
B -> A
C -> B
D -> C
E -> F

Define the following subsets: {A, B, C, D}

, {E}

and{F}

Assuming this is correct, we can develop the following algorithm (pseudocode):

int visit(Node n, int s) { // returns 0 iff there is no cycle anywhere beneath

    if(length(n.relations) = 0 || n.singleton == true)
    // leaves, or only leaves below
    {
        n.singleton = true;
        return false;
    }
    else if(n.visited == true || n.bigger_subset > 0)
    // part of a subcycle, or predecessor of one
    {
        n.bigger_subset = s;
        return true;
    }
    else
    // searching for the nature of what is below
    {
        n.visited = true;
        bool found_cycle = 0;

        for each node m such that n is related to m (n -> m)
            found_cycle = max(Visit(m), found_cycle);

        if( found_cycle > 0 )
            n.bigger_subset = found_cycle;
        else
            n.singleton = true; // parent of only singletons

        n.visited = false;
        return found_cycle;
    }
}

s = length(node_set) + 1; // clearly bigger than the maximal number of subsets
for n in node_set:
{
    if( n.singleton == false && n.big_subcycle == 0 )
    {
        // state unknown, it is thus the first of its subset, unless it is a predecessor (or part) of a known subset
        int set = visit(n, s);

        // not a singleton, and not the current set : a previous one
        if( set > s )
            node_set.remove(n);

        s--;
    }
    else
        node_set.remove(n);
}

What this does is basically a depth-first search from each element, marking the nodes that are being visited to detect loops. Keeping in mind the state of each node, any predecessor of the subset can be added to the subset without going back to loop again.

Here is the C code for this algorithm in the example above: http://ideone.com/VNumcN

Remove all duplicates generalization

More articles: