You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kanagha Kumar <kp...@salesforce.com> on 2017/10/01 05:57:21 UTC

Error - Spark reading from HDFS via dataframes - Java

Hi,

I'm trying to read data from HDFS in spark as dataframes. Printing the
schema, I see all columns are being read as strings. I'm converting it to
RDDs and creating another dataframe by passing in the correct schema ( how
the rows should be interpreted finally).

I'm getting the following error:

Caused by: java.lang.RuntimeException: java.lang.String is not a valid
external type for schema of bigint



Spark read API:

Dataset<Row> hdfs_dataset = new SQLContext(spark).read().option("header",
"false").csv("hdfs:/inputpath/*");

Dataset<Row> ds = new
SQLContext(spark).createDataFrame(hdfs_dataset.toJavaRDD(),
conversionSchema);
This is the schema to be converted to:
StructType(StructField(COL1,StringType,true),
StructField(COL2,StringType,true),
StructField(COL3,LongType,true),
StructField(COL4,StringType,true),
StructField(COL5,StringType,true),
StructField(COL6,LongType,true))

This is the original schema obtained once read API was invoked
StructType(StructField(_c1,StringType,true),
StructField(_c2,StringType,true),
StructField(_c3,StringType,true),
StructField(_c4,StringType,true),
StructField(_c5,StringType,true),
StructField(_c6,StringType,true))

My interpretation is even when a JavaRDD is cast to dataframe by passing in
the new schema, values are not getting type casted.
This is occurring because the above read API reads data as string types
from HDFS.

How can I  convert an RDD to dataframe by passing in the correct schema
once it is read?
How can the values by type cast correctly during this RDD to dataframe
conversion?

Or how can I read data from HDFS with an input schema in java?
Any suggestions are helpful. Thanks!

RE: Error - Spark reading from HDFS via dataframes - Java

Posted by JG Perrin <jp...@lumeris.com>.

@Anastasios: just a word of caution, this is Spark 1.x CSV parser, there a few (minor) changes for Spark 2.x, you can have a look at http://jgp.net/2017/10/01/loading-csv-in-spark/.

From: Anastasios Zouzias [mailto:zouzias@gmail.com]
Sent: Sunday, October 01, 2017 2:05 AM
To: Kanagha Kumar <kp...@salesforce.com>
Cc: user @spark <us...@spark.apache.org>
Subject: Re: Error - Spark reading from HDFS via dataframes - Java

Hi,

Set the inferschema option to true in spark-csv. you may also want to set the mode option. See readme below

https://github.com/databricks/spark-csv/blob/master/README.md

Best,
Anastasios

Am 01.10.2017 07:58 schrieb "Kanagha Kumar" <kp...@salesforce.com>>:
Hi,

I'm trying to read data from HDFS in spark as dataframes. Printing the schema, I see all columns are being read as strings. I'm converting it to RDDs and creating another dataframe by passing in the correct schema ( how the rows should be interpreted finally).

I'm getting the following error:

Caused by: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of bigint

Spark read API:

Dataset<Row> hdfs_dataset = new SQLContext(spark).read().option("header", "false").csv("hdfs:/inputpath/*");

Dataset<Row> ds = new SQLContext(spark).createDataFrame(hdfs_dataset.toJavaRDD(), conversionSchema);
This is the schema to be converted to:
StructType(StructField(COL1,StringType,true),
StructField(COL2,StringType,true),
StructField(COL3,LongType,true),
StructField(COL4,StringType,true),
StructField(COL5,StringType,true),
StructField(COL6,LongType,true))

This is the original schema obtained once read API was invoked
StructType(StructField(_c1,StringType,true),
StructField(_c2,StringType,true),
StructField(_c3,StringType,true),
StructField(_c4,StringType,true),
StructField(_c5,StringType,true),
StructField(_c6,StringType,true))

My interpretation is even when a JavaRDD is cast to dataframe by passing in the new schema, values are not getting type casted.
This is occurring because the above read API reads data as string types from HDFS.

How can I  convert an RDD to dataframe by passing in the correct schema once it is read?
How can the values by type cast correctly during this RDD to dataframe conversion?

Or how can I read data from HDFS with an input schema in java?
Any suggestions are helpful. Thanks!

Re: Error - Spark reading from HDFS via dataframes - Java

Posted by Anastasios Zouzias <zo...@gmail.com>.

Hi,

Set the inferschema option to true in spark-csv. you may also want to set
the mode option. See readme below

https://github.com/databricks/spark-csv/blob/master/README.md

Best,
Anastasios

Am 01.10.2017 07:58 schrieb "Kanagha Kumar" <kp...@salesforce.com>:

Hi,

I'm trying to read data from HDFS in spark as dataframes. Printing the
schema, I see all columns are being read as strings. I'm converting it to
RDDs and creating another dataframe by passing in the correct schema ( how
the rows should be interpreted finally).

I'm getting the following error:

Caused by: java.lang.RuntimeException: java.lang.String is not a valid
external type for schema of bigint



Spark read API:

Dataset<Row> hdfs_dataset = new SQLContext(spark).read().option("header",
"false").csv("hdfs:/inputpath/*");

Dataset<Row> ds = new
SQLContext(spark).createDataFrame(hdfs_dataset.toJavaRDD(),
conversionSchema);
This is the schema to be converted to:
StructType(StructField(COL1,StringType,true),
StructField(COL2,StringType,true),
StructField(COL3,LongType,true),
StructField(COL4,StringType,true),
StructField(COL5,StringType,true),
StructField(COL6,LongType,true))

This is the original schema obtained once read API was invoked
StructType(StructField(_c1,StringType,true),
StructField(_c2,StringType,true),
StructField(_c3,StringType,true),
StructField(_c4,StringType,true),
StructField(_c5,StringType,true),
StructField(_c6,StringType,true))

My interpretation is even when a JavaRDD is cast to dataframe by passing in
the new schema, values are not getting type casted.
This is occurring because the above read API reads data as string types
from HDFS.

How can I  convert an RDD to dataframe by passing in the correct schema
once it is read?
How can the values by type cast correctly during this RDD to dataframe
conversion?

Or how can I read data from HDFS with an input schema in java?
Any suggestions are helpful. Thanks!