You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gary Malouf <ma...@gmail.com> on 2014/07/07 21:36:31 UTC

SparkSQL with sequence file RDDs

Has anyone reported issues using SparkSQL with sequence files (all of our
data is in this format within HDFS)?  We are considering whether to burn
the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
point for us.

Re: SparkSQL with sequence file RDDs

Posted by Michael Armbrust <mi...@databricks.com>.

Here is a simple example of registering an RDD of Products as a table.  It
is important that all of the fields are val defined in the constructor and
that you implement canEqual, productArity and productElement.

class Record(val x1: String) extends Product with Serializable {
  def canEqual(that: Any) = that.isInstanceOf[Record]
  def productArity = 1
  def productElement(n: Int) = n match {
    case 0 => x1
  }
}

sparkContext.parallelize(new Record("a") :: Nil).registerAsTable("records")

sql("SELECT x1 FROM records").collect()


On Mon, Jul 7, 2014 at 6:39 PM, Haoming Zhang <ha...@outlook.com>
wrote:

> Hi Michael,
>
> Thanks for the reply.
>
> Actually last week I tried to play with Product interface, but I'm not
> really sure I did correct or not. Here is what I did:
>
> 1. Created an abstract class A with Product interface, which has 20
> parameters,
> 2. Created case class B extends A, and B has 20 parameters.
>
> I can get all the parameters of A, and also B's parameters by
> productElement function, I just curious is that possbile to convert this
> kind of case class to schema? Because I need to use the .registerAsTable
> function to insert the case classes into table.
>
> Best,
> Haoming
>
> ------------------------------
> From: michael@databricks.com
> Date: Mon, 7 Jul 2014 17:52:34 -0700
>
> Subject: Re: SparkSQL with sequence file RDDs
> To: user@spark.apache.org
>
> We know Scala 2.11 has remove the limitation of parameter number, but
> Spark 1.0 is not compatible with it. So now we are considering use java
> beans instead of Scala case classes.
>
>
> You can also manually create a class that implements scala's Product
> interface.  Finally, SPARK-2179
> <https://issues.apache.org/jira/browse/SPARK-2179> will give you
> programatic non-classed based way to describe the schema.  Someone is
> working on this now.
>

RE: SparkSQL with sequence file RDDs

Posted by Haoming Zhang <ha...@outlook.com>.

Hi Michael,

Thanks for the reply.

Actually last week I tried to play with Product interface, but I'm not really sure I did correct or not. Here is what I did:

1. Created an abstract class A with Product interface, which has 20 parameters,
2. Created case class B extends A, and B has 20 parameters.

I can get all the parameters of A, and also B's parameters by productElement function, I just curious is that possbile to convert this kind of case class to schema? Because I need to use the .registerAsTable function to insert the case classes into table.

Best,
Haoming

From: michael@databricks.com
Date: Mon, 7 Jul 2014 17:52:34 -0700
Subject: Re: SparkSQL with sequence file RDDs
To: user@spark.apache.org



We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 is not compatible with it. So now we are considering use java beans instead of Scala case classes.



You can also manually create a class that implements scala's Product interface.  Finally, SPARK-2179 will give you programatic non-classed based way to describe the schema.  Someone is working on this now.

Re: SparkSQL with sequence file RDDs

Posted by Michael Armbrust <mi...@databricks.com>.

>
> We know Scala 2.11 has remove the limitation of parameter number, but
> Spark 1.0 is not compatible with it. So now we are considering use java
> beans instead of Scala case classes.
>

You can also manually create a class that implements scala's Product
interface.  Finally, SPARK-2179
<https://issues.apache.org/jira/browse/SPARK-2179> will give you
programatic non-classed based way to describe the schema.  Someone is
working on this now.

RE: SparkSQL with sequence file RDDs

Posted by Haoming Zhang <ha...@outlook.com>.

Hi Gray,

Like Michael mentioned, you need to take care of the scala case classes or java beans, because SparkSQL need the schema.

Currently we are trying insert our data to HBase with Scala 2.10.4 and Spark 1.0. 

All the data are tables. We created one case class for each rows, which means the parameter number of case class should as the same as the column number. But Scala 2.10.4 has a limitation that is the max parameter number for case class is 22. So here the problem occurs. If the table is small, and the column number less than 22, everything will be fine. But if we got a larger table with more than 22 columns, then error will be reported.

We know Scala 2.11 has remove the limitation of parameter number, but Spark 1.0 is not compatible with it. So now we are considering use java beans instead of Scala case classes.

Best,
Haoming



From: michael@databricks.com
Date: Mon, 7 Jul 2014 17:12:42 -0700
Subject: Re: SparkSQL with sequence file RDDs
To: user@spark.apache.org

I haven't heard any reports of this yet, but I don't see any reason why it wouldn't work. You'll need to manually convert the objects that come out of the sequence file into something where SparkSQL can detect the schema (i.e. scala case classes or java beans) before you can register the RDD as a table.


If you run into any issues please let me know.

On Mon, Jul 7, 2014 at 12:36 PM, Gary Malouf <ma...@gmail.com> wrote:


Has anyone reported issues using SparkSQL with sequence files (all of our data is in this format within HDFS)?  We are considering whether to burn the time upgrading to Spark 1.0 from 0.9 now and this is a main decision point for us.

Re: SparkSQL with sequence file RDDs

Posted by Michael Armbrust <mi...@databricks.com>.

I haven't heard any reports of this yet, but I don't see any reason why it
wouldn't work. You'll need to manually convert the objects that come out of
the sequence file into something where SparkSQL can detect the schema (i.e.
scala case classes or java beans) before you can register the RDD as a
table.

If you run into any issues please let me know.

On Mon, Jul 7, 2014 at 12:36 PM, Gary Malouf <ma...@gmail.com> wrote:

> Has anyone reported issues using SparkSQL with sequence files (all of our
> data is in this format within HDFS)?  We are considering whether to burn
> the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
> point for us.
>