Applying a mapping function to a DataFrame

I just started using databricks / pyspark. Im using python / spark 2.1. I have loaded data into a table. This table is one column full of rows. I want to apply a mapping function to every item in a column. I am loading a table into a dataframe:

df = spark.table("mynewtable")

      

The only way I could see is with others talking about converting it to RDD to apply the mapping function and then returning to the dataframe to show the data. But this leads to the failure of the failed stage:

df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()

      

All I want to do is just apply any display function to my data in the table. For example, add something to each row in a column, or do a char split and then put it back in the dataframe so I can .show () or display it.

+3


source to share


1 answer


You can not:

  • Use flatMap

    because it will smooth outRow

  • You cannot use append

    because:

    • tuple

      or Row

      don't have an add method
    • append

      (if present in collection) is executed for side effects and returns None

I would use withColumn

:

df.withColumn("foo", lit("anything"))

      

but map

should work also:



df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()

      

Edit (subject to comment):

You probably want udf

from pyspark.sql.functions import udf

def iplookup(s):
    return ... # Some lookup logic

iplookup_udf = udf(iplookup)

df.withColumn("foo", iplookup_udf("c0"))

      

The default return type is StringType

, so if you want something else, you should adjust it.

+5


source







All Articles