Applying a mapping function to a DataFrame

Question

Applying a mapping function to a DataFrame

I just started using databricks / pyspark. Im using python / spark 2.1. I have loaded data into a table. This table is one column full of rows. I want to apply a mapping function to every item in a column. I am loading a table into a dataframe:

df = spark.table("mynewtable")

The only way I could see is with others talking about converting it to RDD to apply the mapping function and then returning to the dataframe to show the data. But this leads to the failure of the failed stage:

df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()

All I want to do is just apply any display function to my data in the table. For example, add something to each row in a column, or do a char split and then put it back in the dataframe so I can .show () or display it.

+3

python apache-spark pyspark databricks

yahalom 30 jul. 17 at 20:57

source to share

1 answer

user8371915 · Accepted Answer · 2017-07-30T21:03:59+0000

You can not:

Use flatMap

because it will smooth outRow
You cannot use append

because:
- tuple
  
  or Row
  
  don't have an add method
- append
  
  (if present in collection) is executed for side effects and returns None

I would use withColumn

:

df.withColumn("foo", lit("anything"))

but map

should work also:

df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()

Edit (subject to comment):

You probably want udf

from pyspark.sql.functions import udf

def iplookup(s):
    return ... # Some lookup logic

iplookup_udf = udf(iplookup)

df.withColumn("foo", iplookup_udf("c0"))

The default return type is StringType

, so if you want something else, you should adjust it.

Applying a mapping function to a DataFrame

More articles: