You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "raj.kumar" <ra...@hooklogic.com> on 2016/02/26 03:49:10 UTC

Saving and Loading Dataframes

Hi,

I am using mllib. I use the ml vectorization tools to create the vectorized
input dataframe for
the ml/mllib machine-learning models with schema:
 
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

To avoid repeated vectorization, I am trying to save and load this dataframe
using
   df.write.format("json").mode("overwrite").save( url )
    val data = Spark.sqlc.read.format("json").load( url )

However when I load the dataframe, the newly loaded dataframe has the
following schema:
root
 |-- features: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- label: double (nullable = true)

which the machine-learning models do not recognize. 

Is there a way I can save and load this dataframe without the schema
changing. 
I assume it has to do with the fact that Vector is not a basic type. 

thanks
-Raj





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Saving and Loading Dataframes

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Raj,

If you choose JSON as the storage format, Spark SQL will store VectorUDT as
Array of Double.
So when you load back to memory, it can not be recognized as Vector.
One workaround is storing the DataFrame as parquet format, it will be
loaded and recognized as expected.

df.write.format("parquet").mode("overwrite").save(output)
> val data = sqlContext.read.format("parquet").load(output)


Thanks
Yanbo

2016-02-27 2:01 GMT+08:00 Raj Kumar <ra...@hooklogic.com>:

> Thanks for the response Yanbo. Here is the source (it uses the
> sample_libsvm_data.txt file used in the
> mlliv examples).
>
> -Raj
> ————— IOTest.scala -------------
>
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.DataFrame
>
> object IOTest {
>   val InputFile = "/tmp/sample_libsvm_data.txt"
>   val OutputDir ="/tmp/out"
>
>   val sconf = new SparkConf().setAppName("test").setMaster("local[*]")
>   val sqlc  = new SQLContext( new SparkContext( sconf ))
>   val df = sqlc.read.format("libsvm").load( InputFile  )
>   df.show; df.printSchema
>
>   df.write.format("json").mode("overwrite").save( OutputDir )
>   val data = sqlc.read.format("json").load( OutputDir )
>   data.show; data.printSchema
>
>   def main( args: Array[String]):Unit = {}
> }
>
>
> -----------------------
>
> On Feb 26, 2016, at 12:47 AM, Yanbo Liang <yb...@gmail.com> wrote:
>
> Hi Raj,
>
> Could you share your code which can help others to diagnose this issue?
> Which version did you use?
> I can not reproduce this problem in my environment.
>
> Thanks
> Yanbo
>
> 2016-02-26 10:49 GMT+08:00 raj.kumar <ra...@hooklogic.com>:
>
>> Hi,
>>
>> I am using mllib. I use the ml vectorization tools to create the
>> vectorized
>> input dataframe for
>> the ml/mllib machine-learning models with schema:
>>
>> root
>>  |-- label: double (nullable = true)
>>  |-- features: vector (nullable = true)
>>
>> To avoid repeated vectorization, I am trying to save and load this
>> dataframe
>> using
>>    df.write.format("json").mode("overwrite").save( url )
>>     val data = Spark.sqlc.read.format("json").load( url )
>>
>> However when I load the dataframe, the newly loaded dataframe has the
>> following schema:
>> root
>>  |-- features: struct (nullable = true)
>>  |    |-- indices: array (nullable = true)
>>  |    |    |-- element: long (containsNull = true)
>>  |    |-- size: long (nullable = true)
>>  |    |-- type: long (nullable = true)
>>  |    |-- values: array (nullable = true)
>>  |    |    |-- element: double (containsNull = true)
>>  |-- label: double (nullable = true)
>>
>> which the machine-learning models do not recognize.
>>
>> Is there a way I can save and load this dataframe without the schema
>> changing.
>> I assume it has to do with the fact that Vector is not a basic type.
>>
>> thanks
>> -Raj
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Re: Saving and Loading Dataframes

Posted by Raj Kumar <ra...@hooklogic.com>.

Thanks for the response Yanbo. Here is the source (it uses the sample_libsvm_data.txt file used in the
mlliv examples).

-Raj
————— IOTest.scala -------------

import org.apache.spark.{SparkConf,SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.DataFrame

object IOTest {
  val InputFile = "/tmp/sample_libsvm_data.txt"
  val OutputDir ="/tmp/out"

  val sconf = new SparkConf().setAppName("test").setMaster("local[*]")
  val sqlc  = new SQLContext( new SparkContext( sconf ))
  val df = sqlc.read.format("libsvm").load( InputFile  )
  df.show; df.printSchema

  df.write.format("json").mode("overwrite").save( OutputDir )
  val data = sqlc.read.format("json").load( OutputDir )
  data.show; data.printSchema

  def main( args: Array[String]):Unit = {}
}


-----------------------

On Feb 26, 2016, at 12:47 AM, Yanbo Liang <yb...@gmail.com>> wrote:

Hi Raj,

Could you share your code which can help others to diagnose this issue? Which version did you use?
I can not reproduce this problem in my environment.

Thanks
Yanbo

2016-02-26 10:49 GMT+08:00 raj.kumar <ra...@hooklogic.com>>:
Hi,

I am using mllib. I use the ml vectorization tools to create the vectorized
input dataframe for
the ml/mllib machine-learning models with schema:

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

To avoid repeated vectorization, I am trying to save and load this dataframe
using
   df.write.format("json").mode("overwrite").save( url )
    val data = Spark.sqlc.read.format("json").load( url )

However when I load the dataframe, the newly loaded dataframe has the
following schema:
root
 |-- features: struct (nullable = true)
 |    |-- indices: array (nullable = true)
 |    |    |-- element: long (containsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- type: long (nullable = true)
 |    |-- values: array (nullable = true)
 |    |    |-- element: double (containsNull = true)
 |-- label: double (nullable = true)

which the machine-learning models do not recognize.

Is there a way I can save and load this dataframe without the schema
changing.
I assume it has to do with the fact that Vector is not a basic type.

thanks
-Raj





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
Sent from the Apache Spark User List mailing list archive at Nabble.com<http://nabble.com>.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<ma...@spark.apache.org>

Re: Saving and Loading Dataframes

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Raj,

Could you share your code which can help others to diagnose this issue?
Which version did you use?
I can not reproduce this problem in my environment.

Thanks
Yanbo

2016-02-26 10:49 GMT+08:00 raj.kumar <ra...@hooklogic.com>:

> Hi,
>
> I am using mllib. I use the ml vectorization tools to create the vectorized
> input dataframe for
> the ml/mllib machine-learning models with schema:
>
> root
>  |-- label: double (nullable = true)
>  |-- features: vector (nullable = true)
>
> To avoid repeated vectorization, I am trying to save and load this
> dataframe
> using
>    df.write.format("json").mode("overwrite").save( url )
>     val data = Spark.sqlc.read.format("json").load( url )
>
> However when I load the dataframe, the newly loaded dataframe has the
> following schema:
> root
>  |-- features: struct (nullable = true)
>  |    |-- indices: array (nullable = true)
>  |    |    |-- element: long (containsNull = true)
>  |    |-- size: long (nullable = true)
>  |    |-- type: long (nullable = true)
>  |    |-- values: array (nullable = true)
>  |    |    |-- element: double (containsNull = true)
>  |-- label: double (nullable = true)
>
> which the machine-learning models do not recognize.
>
> Is there a way I can save and load this dataframe without the schema
> changing.
> I assume it has to do with the fact that Vector is not a basic type.
>
> thanks
> -Raj
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>