Doesn't Spark support the Arrayist when writing in elasticsearch?

I have the following structure:

mylist = [{"key1":"val1"}, {"key2":"val2"}]
myrdd = value_counts.map(lambda item: ('key', { 
    'field': somelist 
}))

      

I am getting error: 15/02/10 15:54:08 INFO scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 6) on executor ip-10-80-15-145.ec2.internal: org.apache.spark .SparkException (java.util.ArrayList data cannot be used) [duplicate 1]

rdd.saveAsNewAPIHadoopFile( 
            path='-', 
            outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", 
            keyClass="org.apache.hadoop.io.NullWritable", 
            valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
            conf={ 
        "es.nodes" : "localhost", 
        "es.port" : "9200", 
        "es.resource" : "mboyd/mboydtype" 
    }) 

      

What I would like the document to end up like when writing in ES:

{
field:[{"key1":"val1"}, {"key2":"val2"}]
}

      

+3


source to share


2 answers


A bit late in the game, but this is the solution we faced after we did it yesterday. Add 'es.input.json': 'true'

to your conf and then run json.dumps()

on your data.

Modifying your example, it looks like this:



import json

rdd = sc.parallelize([{"key1": ["val1", "val2"]}])
json_rdd = rdd.map(json.dumps)
json_rdd.saveAsNewAPIHadoopFile( 
    path='-', 
    outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat", 
    keyClass="org.apache.hadoop.io.NullWritable", 
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable", 
    conf={ 
        "es.nodes" : "localhost", 
        "es.port" : "9200", 
        "es.resource" : "mboyd/mboydtype",
        "es.input.json": "true"
    }
) 

      

+2


source


This problem had a problem and the solution is by converting all the lists to tuples. The conversion to json does the same.



+2


source







All Articles