What's the best way to do string computations in Spark? Details below
Ok guys, I have a situation where I have the following schema for my dataframe:
Customer_Id Role_Code Start_TimeStamp End_Timestamp
Ray123 1 2015 2017
Kate123 -- 2016 2017
I want to decide the Role_Code of a given client (say "Ray123"
) based on multiple conditions. Let's say his Role_Code exits 1. I process the next line and the next client (say "Kate123") has an overlapping time with Ray123, then she can challenge Ray123 and can win against him to have Role_Code 1 (based on some other conditions) ... So if she wins, to overlap the time period, I need to set Ray123's Role_Code to 2 so that the data looks like this:
Customer_Id Role_Code Start_TimeStamp End_Timestamp
Ray123 1 2015 2016
Ray123 2 2016 2017
Kate123 1 2016 2017
There are similar things where I need to go back and forth, select rows and compare timestamps and some other fields and then take unions and do other than, etc. to get the final data frame with the correct set of clients with the correct set of codes roles. The problem is the solution works fine if I have 5-6 lines, but if I check eg. 70 lines, YARN container kills the job, it always runs out of memory. I don't know how else to solve this problem without a few actions like head (), first (), etc. that are appropriate to process each line and then split the lines efficiently. It looks like some other frameworks are better suited for this. I am grateful for any suggestion!
source to share
No one has answered this question yet
Check out similar questions: