How do I get the length of lists in one column in data spark?

I have a df whose "products" column is lists as shown below:

+----------+---------+--------------------+
|member_srl|click_day|            products|
+----------+---------+--------------------+
|        12| 20161223|  [2407, 5400021771]|
|        12| 20161226|        [7320, 2407]|
|        12| 20170104|              [2407]|
|        12| 20170106|              [2407]|
|        27| 20170104|        [2405, 2407]|
|        28| 20161212|              [2407]|
|        28| 20161213|      [2407, 100093]|
|        28| 20161215|           [1956119]|
|        28| 20161219|      [2407, 100093]|
|        28| 20161229|           [7905970]|
|       124| 20161011|        [5400021771]|
|      6963| 20160101|         [103825645]|
|      6963| 20160104|[3000014912, 6626...|
|      6963| 20160111|[99643224, 106032...|

      

How do I add a new column product_cnt

that is the length of the list products

? And how to filter df to get the given strings with the condition of the length of the given products? Thank.

+3


source to share


2 answers


Pyspark has a built-in feature to achieve exactly what you want to call it size

. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.size . To add it as a column, you can simply call it at the time of your selection.

from pyspark.sql.functions import size

countdf = df.select('*',size('products').alias('product_cnt'))

      



Filtering works exactly the same as @ titiro89 described. Alternatively, you can use the function size

in the filter. This will allow you to get around adding an extra column (if you want to do that) like this.

filterdf = df.filter(size('products')==given_products_length)

      

+5


source


First question :

How do I add a new column product_cnt that is the length of the product list?

>>> a = [(12,20161223, [2407,5400021771]),(12,20161226,[7320,2407])]
>>> df = spark.createDataFrame(a,
["member_srl","click_day","products"])
>>> df.show()
+----------+---------+------------------+
|member_srl|click_day|          products|
+----------+---------+------------------+
|        12| 20161223|[2407, 5400021771]|
|        12| 20161226|[7320, 2407, 4344]|
+----------+---------+------------------+

      

You can find a similar example here

>>> from pyspark.sql.types import IntegerType
>>> from pyspark.sql.functions import udf

>>> slen = udf(lambda s: len(s), IntegerType())

>>> df2 = df.withColumn("product_cnt", slen(df.products))
>>> df2.show()
+----------+---------+------------------+-----------+
|member_srl|click_day|          products|product_cnt|
+----------+---------+------------------+-----------+
|        12| 20161223|[2407, 5400021771]|          2|
|        12| 20161226|[7320, 2407, 4344]|          3|
+----------+---------+------------------+-----------+

      



Second question :

And how to filter df to get given strings conditional on the length of given products?

You can use the filter function docs here

>>> givenLength = 2
>>> df3 = df2.filter(df2.product_cnt==givenLength)
>>> df3.show()
+----------+---------+------------------+-----------+
|member_srl|click_day|          products|product_cnt|
+----------+---------+------------------+-----------+
|        12| 20161223|[2407, 5400021771]|          2|
+----------+---------+------------------+-----------+

      

+1


source







All Articles