Random selection and removal by constant time

I am trying to implement a list of edges for MultiGraph in Python.

What I've tried so far:

>>> l1 = Counter({(1, 2): 2, (1, 3): 1})
>>> l2 = [(1, 2), (1, 2), (1, 3)]

      

l1

has constant time to remove all edges between two vertices (for example del l1[(1, 2)]

), but random selection in linear time on these edges (for example random.choice(list(l1.elements()))

). Note that you need to make a selection on elements

(versus l1

).

l2

has a random selection in constant time ( random.choice(l2)

), but removing in linear time all elements equal to a given edge ( [i for i in l2 if i != (1, 2)]

).

Question: is there a Python data structure that will give me both random pick and delete at constant time?

+3


source to share


1 answer


I don't think you are trying to do this in theory.

If you are using weighted values ​​to represent duplicates, you cannot get random time constant. The best you could do is some kind of skip-list-type structure that allows you to binary search for an item by a weight index that is logarithmic.

If you are not using weighted values ​​to represent duplicates, you need a structure that allows multiple copies to be stored. And the hash table is not going to do this - the doublets must be independent objects (for example (edge, autoincrement)

), that is, there is no way to delete everything that meets some criterion in constant time.

If you can accept logarithmic time, a tree is the obvious choice. For example using blist

:

>>> l3 = blist.sortedlist(l2)

      

To pick one at random:

>>> edge = random.choice(l3)

      

The documentation doesn't seem to guarantee that this won't do something O (n). But luckily the source for 3.3 and 2.7 shows it is going to do the right thing. If you don't trust this, just write l3[random.randrange(len(l3))]

.

To remove all copies of an edge, you can do it like this:

>>> del l3[l3.bisect_left(edge):l3.bisect_right(edge)]

      

Or:

>>> try:
...     while True:
...         l3.remove(edge)
... except ValueError:
...     pass

      



The documentation explains the exact performance guarantees for each operation. In particular, it len

is constant, whereas indexing, slicing, deleting by index or chunk, halving, and deleting by value are logarithmic, so both operations end up logarithmic.

(It's worth noting that blist

being a B + tree, you can get better performance from a red-black tree, or treap, or whatever. You can find good implementations for most data structures in PyPI.)


As senderle pointed out, if the maximum number of copies of an edge is much less than the size of the collection, you can create a data structure that makes this quadratic over time in the maximum number of copies. Translating his sentence into code:

class MGraph(object):
    def __init__(self):
        self.edgelist = []
        self.edgedict = defaultdict(list)
    def add(self, edge):
        self.edgedict[edge].append(len(self.edgelist))
        self.edgelist.append(edge)
    def remove(self, edge):
        for index in self.edgedict.get(edge, []):
            maxedge = len(self.edgelist) - 1
            lastedge = self.edgelist[maxedge]
            self.edgelist[index], self.edgelist[maxedge] = self.edgelist[maxedge], self.edgelist[index]
            self.edgedict[lastedge] = [i if i != maxedge else index for i in self.edgedict[lastedge]]
            del self.edgelist[-1]
        del self.edgedict[edge]
    def choice(self):
        return random.choice(self.edgelist)

      

(You could, of course, change the replacement-combo-list string with a three-line search and in-place update, but this is still linear in the number of duplicates.)

Obviously, if you plan on using this for real, you can increase the class a bit. You can make it look like list

edges, a set

from tuple

multiple copies of each edge, a Counter

, etc. by implementing multiple methods and providing the appropriate collections.abc.Foo

/ collections.Foo

fill in the rest.


So which is better? Well, in your example example, the average size of the duplicate is half the size of the list, and the maximum size is 2/3. If this were true for your actual data, the tree would be much, much better because it log N

will clearly blow away (N/2)**2

. On the other hand, if duplicates were sparse, senderle's solution would obviously be better, because it W**2

's still 1 if W

equal to 1.

Of course, for a three-element sample, constant overhead and multiples will dominate everything. But apparently your real collection isn't that tiny. (If so, just use list

...)

If you don't know how to characterize your real data, write both implementations and time them with different realistic inputs.

+2


source







All Articles