Apache Spark DataSet API: head (n: Int) vs take (n: Int)
The Apache Spark Dataset API has two methods, i, head(n:Int)
and take(n:Int)
.
Dataset.Scala source contains
def take(n: Int): Array[T] = head(n)
Couldn't find any difference in execution code between the two functions. why does the API have two different methods to get the same result?
source to share
The reason is because, in my opinion, the Apache Spark Dataset API is trying to mimic the Pandas DataFrame API, which contains head
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html .
source to share
I experimented and found that head (n) and take (n) give exactly the same replica result. Both are displayed only as a ROW object.
DF.head (2)
[Row (Transaction_date = u'1 / 2/2009 6:17 ', Product = u'Product1', Price = u'1200 ', Payment_Type = u'Mastercard', Name = u'carolina ', City = U'Basildon ', State = u'England', Country = u'United Kingdom '), Row (Transaction_date = u'1 / 2/2009 4:53', Product = u'Product2 ', Price = u'1200', Payment_Type = u'Visa ', Name = u'Betina', City = u'Parkville ', State = u'MO', Country = u'United States')]
DF.take (2)
[Row (Transaction_date = u'1 / 2/2009 6:17 ', Product = u'Product1', Price = u'1200 ', Payment_Type = u'Mastercard', Name = u'carolina ', City = U'Basildon ', State = u'England', Country = u'United Kingdom '), Row (Transaction_date = u'1 / 2/2009 4:53', Product = u'Product2 ', Price = u'1200', Payment_Type = u'Visa ', Name = u'Betina', City = u'Parkville ', State = u'MO', Country = u'United States')]
source to share