Spark option - csv read

Question

Spark option - csv read

I am using spark 2.1 and trying to read a csv file.

compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.1'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.1.0'

Here is my code.

import java.io.{BufferedWriter, File, FileWriter}
import java.sql.{Connection, DriverManager}
import net.sf.log4jdbc.sql.jdbcapi.ConnectionSpy
import org.apache.spark.sql.{DataFrame, SparkSession, Column, SQLContext}
import org.apache.spark.sql.functions._
import org.postgresql.jdbc.PgConnection

spark.read
    .option("charset", "utf-8")
    .option("header", "true")
    .option("quote", "\"")
    .option("delimiter", ",")
    .csv(...)

It works well. The problem is that the spark check key (DataFrameReader) is not the same as the link ( link ). the link said I should use "encoding" for encoding, but doesn't work, but encoding works well. Is the link wrong?

+3

apache-spark

J.Done 21 jul. 17 at 2:09

source to share

1 answer

soote · Accepted Answer · 2017-07-21T02:47:57+0000

You can see here :

val charset = parameters.getOrElse("encoding", 
       parameters.getOrElse("charset",StandardCharsets.UTF_8.name()))

Both encodings and encoding are valid parameters and you shouldn't have any problems using or configuring the encoding.

Charset just exists for deprecated support when the spark csv code was from databricks spark csv project which was merged with spark project since 2.x. The same thing happens with the separator (now sep).

Pay attention to the default values for reading csv, you can remove encoding, quote and separator from your code as you are only using default values. Leaving you just:

spark.read.option("header", "true").csv(...)

Spark option - csv read

More articles: