How to generate UUID in Mapreduce?

I want to write a MapReduce java program where I need to generate a UUID for a dataset in a csv / txt file. The data will be customer data with a set of rows and columns. The csv input is in the HDFS directory.

You just need to generate the UUID using Mapreduce. I have an input file that has colors a, b and c and has 5 lines. I want column d with a UUID with 5 rows, i.e. 5 different UUIDs

How can i do this?

Here is the code for the Mapper class:

public class MapRed_Mapper extends Mapper {

public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
     Text uuid = new Text(UUID.randomUUID().toString());
    context.write(key, uuid);
}

      

}

+3


source to share


2 answers


  • Approach using Mapreduce java

1) Read your lines in mapper class map method from text file

2) add the UUID as below in the minify method as an extra column (use one reducer to shrink your csv as an extra column)

3) emit it through context.write

java.util.UUID

available with JDK 5.

Generate a random UUID (universally unique identifier).

To get the value of the generated random string, we need to call the method UUID.toString()

.

    UUID uuid = UUID.randomUUID();
    String randomUUIDString = uuid.toString();

    System.out.println("Random UUID String = " + randomUUIDString);
   // System.out.println("UUID version       = " + uuid.version());
   // System.out.println("UUID variant       = " + uuid.variant());

      



For CSV Generation:
Use TextOutputFormat

. The default key / value separator is a tab character. Change the delimiter by setting the property mapred.textoutputformat.separatorText

to your driver.

conf.set("mapred.textoutputformat.separatorText", ",");

      

  • Original approach (since you added the spark shortcut which I thought of pointing below the pointer):

There is an already existing answer on SO, see pls.

add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator

Then you can do below to convert to csv format.

df.write.format("com.databricks.spark.csv").save(filepath)

      

+1


source


Maybe I'm not asking the question, but you can just create a UUID for each call to the card by doing:



@Override
public void map(Text key, Text value, Context context) throws IOException, InterruptedException
{
    context.write(key, new Text(UUID.randomUUID().toString());
}

      

0


source







All Articles