Applying a mapping function to a DataFrame
I just started using databricks / pyspark. Im using python / spark 2.1. I have loaded data into a table. This table is one column full of rows. I want to apply a mapping function to every item in a column. I am loading a table into a dataframe:
df = spark.table("mynewtable")
The only way I could see is with others talking about converting it to RDD to apply the mapping function and then returning to the dataframe to show the data. But this leads to the failure of the failed stage:
df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()
All I want to do is just apply any display function to my data in the table. For example, add something to each row in a column, or do a char split and then put it back in the dataframe so I can .show () or display it.
source to share
You can not:
- Use
flatMap
because it will smooth outRow
-
You cannot use
append
because:-
tuple
orRow
don't have an add method -
append
(if present in collection) is executed for side effects and returnsNone
-
I would use withColumn
:
df.withColumn("foo", lit("anything"))
but map
should work also:
df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()
Edit (subject to comment):
You probably want udf
from pyspark.sql.functions import udf
def iplookup(s):
return ... # Some lookup logic
iplookup_udf = udf(iplookup)
df.withColumn("foo", iplookup_udf("c0"))
The default return type is StringType
, so if you want something else, you should adjust it.
source to share