How do I get the length of lists in one column in data spark?

Question

How do I get the length of lists in one column in data spark?

I have a df whose "products" column is lists as shown below:

+----------+---------+--------------------+
|member_srl|click_day|            products|
+----------+---------+--------------------+
|        12| 20161223|  [2407, 5400021771]|
|        12| 20161226|        [7320, 2407]|
|        12| 20170104|              [2407]|
|        12| 20170106|              [2407]|
|        27| 20170104|        [2405, 2407]|
|        28| 20161212|              [2407]|
|        28| 20161213|      [2407, 100093]|
|        28| 20161215|           [1956119]|
|        28| 20161219|      [2407, 100093]|
|        28| 20161229|           [7905970]|
|       124| 20161011|        [5400021771]|
|      6963| 20160101|         [103825645]|
|      6963| 20160104|[3000014912, 6626...|
|      6963| 20160111|[99643224, 106032...|

How do I add a new column product_cnt

that is the length of the list products

? And how to filter df to get the given strings with the condition of the length of the given products? Thank.

+3

pyspark

yanachen June 14. 17 at 10:05

source to share

2 answers

DavidWayne · Answer 1 · 2017-06-14T14:29:03+0000

Pyspark has a built-in feature to achieve exactly what you want to call it size

. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size . To add it as a column, you can simply call it at the time of your selection.

from pyspark.sql.functions import size

countdf = df.select('*',size('products').alias('product_cnt'))

Filtering works exactly the same as @ titiro89 described. Alternatively, you can use the function size

in the filter. This will allow you to get around adding an extra column (if you want to do that) like this.

filterdf = df.filter(size('products')==given_products_length)

titiro89 · Answer 2 · 2017-06-14T13:49:02+0000

First question :

How do I add a new column product_cnt that is the length of the product list?

>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])]
>>> df = spark.createDataFrame(a,
["member_srl","click_day","products"])
>>> df.show()
+----------+---------+------------------+
|member_srl|click_day|          products|
+----------+---------+------------------+
|        12| 20161223|[2407, 5400021771]|
|        12| 20161226|[7320, 2407, 4344]|
+----------+---------+------------------+

You can find a similar example here

>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import udf

>>> slen = udf(lambda s: len(s), IntegerType())

>>> df2 = df.withColumn("product_cnt", slen(df.products))
>>> df2.show()
+----------+---------+------------------+-----------+
|member_srl|click_day|          products|product_cnt|
+----------+---------+------------------+-----------+
|        12| 20161223|[2407, 5400021771]|          2|
|        12| 20161226|[7320, 2407, 4344]|          3|
+----------+---------+------------------+-----------+

Second question :

And how to filter df to get given strings conditional on the length of given products?

You can use the filter function docs here

>>> givenLength = 2
>>> df3 = df2.filter(df2.product_cnt==givenLength)
>>> df3.show()
+----------+---------+------------------+-----------+
|member_srl|click_day|          products|product_cnt|
+----------+---------+------------------+-----------+
|        12| 20161223|[2407, 5400021771]|          2|
+----------+---------+------------------+-----------+

How do I get the length of lists in one column in data spark?

More articles: