Project_Bank.csv is not a Parquet file. expected magic number in the tail [80, 65, 82, 49], but found [110, 111, 13, 10]

So, I tried to load the csv file by invoking a custom schema , but every time I get the following errors:

Project_Bank.csv is not a Parquet file. expected magic number in the tail [80, 65, 82, 49], but found [110, 111, 13, 10]

This is how my program and my entries in the csv file looks like,

age, work, family, education, default, balance, housing; credit, contact, day, month, duration; campaigns; pdays; previous; poutcome; not 58; control; marriage; tertiary; not; 2143; Yes; no, unknown; five; can; 261; 1; -1; 0; unknown, no 44; technician; single; secondary; not; 29; Yes; no, unknown; five; can; 151; 1; -1; 0; unknown, no not 33, entrepreneur; marriage; secondary; not; 2; yes, yes, unknown; five; can; 76; 1; -1; 0; unknown, no

My code:

$ spark-shell --packages com.databricks: spark-csv_2.10: 1.5.0

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext   
import sqlContext.implicits._    
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val bankSchema = StructType(Array(
  StructField("age", IntegerType, true),
  StructField("job", StringType, true),
  StructField("marital", StringType, true),
  StructField("education", StringType, true),
  StructField("default", StringType, true),
  StructField("balance", IntegerType, true),
  StructField("housing", StringType, true),
  StructField("loan", StringType, true),
  StructField("contact", StringType, true),
  StructField("day", IntegerType, true),
  StructField("month", StringType, true),
  StructField("duration", IntegerType, true),
  StructField("campaign", IntegerType, true),
  StructField("pdays", IntegerType, true),
  StructField("previous", IntegerType, true),
  StructField("poutcome", StringType, true),
  StructField("y", StringType, true)))


 val df = sqlContext.
  read.
  schema(bankSchema).
  option("header", "true").
  option("delimiter", ";").
  load("/user/amit.kudnaver_gmail/hadoop/project_bank/Project_Bank.csv").toDF()

  df.registerTempTable("people")
  df.printSchema()
  val distinctage = sqlContext.sql("select distinct age from people")

      

Any suggestion why I can't seem to work with the csv file here after clicking the correct schema. Thanks in advance for your advice.

Thanks Amit K

+3


source to share


1 answer


The problem here is that the Data Frame is expecting a Parquet file while processing it. To process data in CSV. Here's what you can do.

First of all, remove the header row from the data.

58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no

      

Next, we will write the following code to read data.

Create case class

case class BankSchema(age: Int, job: String, marital:String, education:String, default:String, balance:Int, housing:String, loan:String, contact:String, day:Int, month:String, duration:Int, campaign:Int, pdays:Int, previous:Int, poutcome:String, y:String)

      



Reading data from HDFS and analyzing it

val bankData = sc.textFile("/user/myuser/Project_Bank.csv").map(_.split(";")).map(p => BankSchema(p(0).toInt, p(1), p(2),p(3),p(4), p(5).toInt, p(6), p(7), p(8), p(9).toInt, p(10), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toInt, p(15), p(16))).toDF()

      

And then register the table and run queries.

bankData.registerTempTable("bankData")
val distinctage = sqlContext.sql("select distinct age from bankData")

      

This is what will look like

+---+
|age|
+---+
| 33|
| 44|
| 58|
+---+

      

+1


source







All Articles