PySpark 1.6.2 | collect () after orderBy / sort
I don't understand the behavior of this simple piece of PySpark code:
# Create simple test dataframe
l = [('Alice', 1),('Pierre', 3),('Jack', 5), ('Paul', 2)]
df_test = sqlcontext.createDataFrame(l, ['name', 'age'])
# Perform filter then Take 2 oldest
df_test = df_test.sort('age', ascending=False)\
.filter('age < 4') \
.limit(2)
df_test.show(2)
# This outputs as expected :
# +------+---+
# | name|age|
# +------+---+
# |Pierre| 3|
# | Paul| 2|
# +------+---+
df_test.collect()
# This outputs unexpectedly :
# [Row(name=u'Pierre', age=3), Row(name=u'Alice', age=1)]
Is this the expected behavior of collect ()? How can I get my column as a list that preserves the correct order?
thank
+3
source to share