Apache Spark DataSet API: head (n: Int) vs take (n: Int)

The Apache Spark Dataset API has two methods, i, head(n:Int)

and take(n:Int)

.

Dataset.Scala source contains

def take(n: Int): Array[T] = head(n) 

      

Couldn't find any difference in execution code between the two functions. why does the API have two different methods to get the same result?

+3


source to share


3 answers


I think this is because the developers of the spark tend to give it a rich API, there are also two methods where

and filter

that do exactly the same thing.



+1


source


The reason is because, in my opinion, the Apache Spark Dataset API is trying to mimic the Pandas DataFrame API, which contains head

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html .



0


source


I experimented and found that head (n) and take (n) give exactly the same replica result. Both are displayed only as a ROW object.

  

DF.head (2)

  

[Row (Transaction_date = u'1 / 2/2009 6:17 ', Product = u'Product1', Price = u'1200 ', Payment_Type = u'Mastercard', Name = u'carolina ', City = U'Basildon ', State = u'England', Country = u'United Kingdom '), Row (Transaction_date = u'1 / 2/2009 4:53', Product = u'Product2 ', Price = u'1200', Payment_Type = u'Visa ', Name = u'Betina', City = u'Parkville ', State = u'MO', Country = u'United States')]

  

DF.take (2)

  

[Row (Transaction_date = u'1 / 2/2009 6:17 ', Product = u'Product1', Price = u'1200 ', Payment_Type = u'Mastercard', Name = u'carolina ', City = U'Basildon ', State = u'England', Country = u'United Kingdom '), Row (Transaction_date = u'1 / 2/2009 4:53', Product = u'Product2 ', Price = u'1200', Payment_Type = u'Visa ', Name = u'Betina', City = u'Parkville ', State = u'MO', Country = u'United States')]

0


source







All Articles