You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Chen, Mingrui" <mi...@mail.smu.edu> on 2017/04/23 16:13:47 UTC

Cannot convert from JavaRDD to Dataframe

Hello everyone!


I am a new Spark learner and trying to do a task seems very simple. I want to read a text file, save the content to JavaRDD and convert it to Dataframe, so I can use it for Word2Vec Model in the future. The code looks pretty simple but I cannot make it work:


SparkSession spark = SparkSession.builder().appName("Word2Vec").getOrCreate();
JavaRDD<String> lines = spark.sparkContext().textFile("input.txt", 10).toJavaRDD();
JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
public Row call(String line){
return RowFactory.create(Arrays.asList(line.split(" ")));
}
});
StructType schema = new StructType(new StructField[] {
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
Dataset<Row> input = spark.createDataFrame(rows, schema);
input.show(3);


It throws an exception at input.show(3):


Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD


Seems it has problem converting the JavaRDD<Row> to Dataframe. However I cannot figure out what mistake I make here and the exception message is hard to understand. Anyone can help? Thanks!


Re: Cannot convert from JavaRDD to Dataframe

Posted by Radhwane Chebaane <r....@mindlytix.com>.
Hi,

DataTypes is a Scala Array which corresponds in Java to Java Array. So you
must use a String[]. However since RowFactory.create expects an array of
Object as Columns content, it should be:

   public Row call(String line){
      return RowFactory.create(new String[][]{line.split(" ")});
   }

More details in this Stackoverflow question
<http://stackoverflow.com/questions/43411492/createdataframe-throws-exception-when-pass-javardd-that-contains-arraytype-col/43585039#43585039>
.
Hope this works for you,

Cheers

2017-04-23 18:13 GMT+02:00 Chen, Mingrui <mi...@mail.smu.edu>:

> Hello everyone!
>
>
> I am a new Spark learner and trying to do a task seems very simple. I want
> to read a text file, save the content to JavaRDD and convert it to
> Dataframe, so I can use it for Word2Vec Model in the future. The code looks
> pretty simple but I cannot make it work:
>
>
> SparkSession spark = SparkSession.builder().appName("Word2Vec").
> getOrCreate();
> JavaRDD<String> lines = spark.sparkContext().textFile("input.txt",
> 10).toJavaRDD();
> JavaRDD<Row> rows = lines.map(new Function<String, Row>(){
> public Row call(String line){
> return RowFactory.create(Arrays.asList(line.split(" ")));
> }
> });
> StructType schema = new StructType(new StructField[] {
> new StructField("text", new ArrayType(DataTypes.StringType, true), false,
> Metadata.empty())
> });
> Dataset<Row> input = spark.createDataFrame(rows, schema);
> input.show(3);
>
> It throws an exception at input.show(3):
>
>
> Caused by: java.lang.ClassCastException: cannot assign instance of
> scala.collection.immutable.List$SerializationProxy to field
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>
> Seems it has problem converting the JavaRDD<Row> to Dataframe. However I
> cannot figure out what mistake I make here and the exception message is
> hard to understand. Anyone can help? Thanks!
>
>


-- 

[image: photo] Radhwane Chebaane
Distributed systems engineer, Mindlytix

Mail: radhwane@mindlytix.com  <ra...@mindlytix.com>
Mobile: +33 695 588 906 <+33+695+588+906>
<https://mail.google.com/mail/u/0/#>
Skype: rad.cheb  <https://mail.google.com/mail/u/0/#>
LinkedIn <https://fr.linkedin.com/in/radhwane-chebaane-483b3a7b>
<https://mail.google.com/mail/u/0/#>