Project_Bank.csv is not a Parquet file. expected magic number in the tail [80, 65, 82, 49], but found [110, 111, 13, 10]

So, I tried to load the csv file by invoking a custom schema , but every time I get the following errors:

Project_Bank.csv is not a Parquet file. expected magic number in the tail [80, 65, 82, 49], but found [110, 111, 13, 10]

This is how my program and my entries in the csv file looks like,

age, work, family, education, default, balance, housing; credit, contact, day, month, duration; campaigns; pdays; previous; poutcome; not 58; control; marriage; tertiary; not; 2143; Yes; no, unknown; five; can; 261; 1; -1; 0; unknown, no 44; technician; single; secondary; not; 29; Yes; no, unknown; five; can; 151; 1; -1; 0; unknown, no not 33, entrepreneur; marriage; secondary; not; 2; yes, yes, unknown; five; can; 76; 1; -1; 0; unknown, no

My code:

$ spark-shell --packages com.databricks: spark-csv_2.10: 1.5.0

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext   
import sqlContext.implicits._    
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val bankSchema = StructType(Array(
  StructField("age", IntegerType, true),
  StructField("job", StringType, true),
  StructField("marital", StringType, true),
  StructField("education", StringType, true),
  StructField("default", StringType, true),
  StructField("balance", IntegerType, true),
  StructField("housing", StringType, true),
  StructField("loan", StringType, true),
  StructField("contact", StringType, true),
  StructField("day", IntegerType, true),
  StructField("month", StringType, true),
  StructField("duration", IntegerType, true),
  StructField("campaign", IntegerType, true),
  StructField("pdays", IntegerType, true),
  StructField("previous", IntegerType, true),
  StructField("poutcome", StringType, true),
  StructField("y", StringType, true)))

 val df = sqlContext.
  option("header", "true").
  option("delimiter", ";").

  val distinctage = sqlContext.sql("select distinct age from people")


Any suggestion why I can't seem to work with the csv file here after clicking the correct schema. Thanks in advance for your advice.

Thanks Amit K


source to share

1 answer

The problem here is that the Data Frame is expecting a Parquet file while processing it. To process data in CSV. Here's what you can do.

First of all, remove the header row from the data.



Next, we will write the following code to read data.

Create case class

case class BankSchema(age: Int, job: String, marital:String, education:String, default:String, balance:Int, housing:String, loan:String, contact:String, day:Int, month:String, duration:Int, campaign:Int, pdays:Int, previous:Int, poutcome:String, y:String)


Reading data from HDFS and analyzing it

val bankData = sc.textFile("/user/myuser/Project_Bank.csv").map(_.split(";")).map(p => BankSchema(p(0).toInt, p(1), p(2),p(3),p(4), p(5).toInt, p(6), p(7), p(8), p(9).toInt, p(10), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toInt, p(15), p(16))).toDF()


And then register the table and run queries.

val distinctage = sqlContext.sql("select distinct age from bankData")


This is what will look like

| 33|
| 44|
| 58|




All Articles