RDD.sortByKey using a function in python?
Let's say my key is not a simple data type but a class and I need to sort the keys using a compare function. In Scala, I can do this using new Ordering
. How can I achieve the same functionality in Python? For example, what would be the equivalent code in Python?
implicit val someClassOrdering = new Ordering[SomeClass] {
override def compare(a: SomeClass, b: SomeClass) = a.compare(b)
}
source to share
In Python, you can create class methods for comparison by a rich comparison methods : __lt__
, __le__
, __eq__
, __ne__
, __gt__
,__ge__
You can do these methods for anything you need to compare instances of your class, even weird things, but it's a good idea to make them consistent if you want the sort to behave intelligently. :)
Here's a fairly simple example of how they are used in this answer I wrote a month ago: Sorting a list to form the largest possible number .
Here's another nice example from Finding a Partial Match in a List of Tuples that creates a lookup object.
source to share
You can pass an argument keyfunc
:
from numpy.random import seed, randint
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
seed(1)
rdd = sc.parallelize(
(Point(randint(10), randint(10)), randint(100)) for _ in range(5))
Now, let's say you want to sort the points by the y coordinate:
rdd.sortByKey(keyfunc=lambda p: p.y).collect()
and the result is:
[(Point(x=5, y=0), 16),
(Point(x=9, y=2), 20),
(Point(x=5, y=2), 84),
(Point(x=1, y=7), 6),
(Point(x=5, y=8), 9)]
source to share