What's the best way to do string computations in Spark? Details below

Question

What's the best way to do string computations in Spark? Details below

Ok guys, I have a situation where I have the following schema for my dataframe:

Customer_Id Role_Code Start_TimeStamp End_Timestamp

Ray123          1          2015            2017

Kate123         --         2016            2017

I want to decide the Role_Code of a given client (say "Ray123"

) based on multiple conditions. Let's say his Role_Code exits 1. I process the next line and the next client (say "Kate123") has an overlapping time with Ray123, then she can challenge Ray123 and can win against him to have Role_Code 1 (based on some other conditions) ... So if she wins, to overlap the time period, I need to set Ray123's Role_Code to 2 so that the data looks like this:

Customer_Id Role_Code Start_TimeStamp End_Timestamp

Ray123         1           2015            2016

Ray123         2           2016            2017

Kate123        1           2016            2017

There are similar things where I need to go back and forth, select rows and compare timestamps and some other fields and then take unions and do other than, etc. to get the final data frame with the correct set of clients with the correct set of codes roles. The problem is the solution works fine if I have 5-6 lines, but if I check eg. 70 lines, YARN container kills the job, it always runs out of memory. I don't know how else to solve this problem without a few actions like head (), first (), etc. that are appropriate to process each line and then split the lines efficiently. It looks like some other frameworks are better suited for this. I am grateful for any suggestion!

+3

scala hdfs yarn apache-spark spark-dataframe

gmoksh 10 jul. 17 at 5:13

source to share