You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by matthes <md...@sensenetworks.com> on 2014/09/26 02:05:43 UTC

Is it possible to use Parquet with Dremel encoding

Hi again!

At the moment I try to use parquet and I want to keep the data into the
memory in an efficient way to make requests against the data as fast as
possible.
I read about parquet it is able to encode nested columns. Parquet uses the
Dremel encoding with definition and repetition levels. 
Is it at the moment possible to use this in spark as well or is it actually
not implemented? If yes, I’m not sure how to do it. I saw some examples,
they try to put some arrays or case classes in other case classes, nut I
don’t think that is the right way.  The other thing that I saw in this
relation was SchemaRDDs. 

Input:

Col1	|	Col2	|	Col3	|	Col4
Int	|	long	|	long	|	int
---------------------------------------------
14	|	1234	|	1422	|	3
14	|	3212	|	1542	|	2
14	|	8910	|	1422	|	8
15	|	1234	|	1542	|	9
15	|	8897	|	1422	|	13

Want this Parquet-format:
Col3	|	Col1	|	Col4	|	Col2
long	|	int	|	int	|	long
--------------------------------------------
1422	|	14	|	3	|	1234
“	|	“	|	8	|	8910
“	|	15	|	13	|	8897
1542	|	14	|	2	|	3212
“	|	15	|	9	|	1234

It would be awesome if somebody could give me a good hint how can I do that
or maybe a better way.

Best,
Matthes




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is it possible to use Parquet with Dremel encoding

Posted by matthes <md...@sensenetworks.com>.

Thank you so much guys for helping me, but I have some more questions about
it!

Do we have to presort the columns to get the benefits of the run length
encoding or do I have to group the data first and wrap it into a case class?

I try to sort the data first and write it down and I get different sizes as
result:
65.191.222 Bytes	unsorted
62.576.598 Bytes	sorted

I see no run time encoding in the debug output:

14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.572.354B for
[col1] INT64: 683.189 values, 5.465.512B raw, 4.572.211B comp, 6 pages,
encodings: [PLAIN, BIT_PACKED]
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 4.687.432B for
[col2] INT64: 683.189 values, 5.465.512B raw, 4.687.289B comp, 6 pages,
encodings: [PLAIN, BIT_PACKED]
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 847.267B for
[col3] INT32: 683.189 values, 852.104B raw, 847.198B comp, 3 pages,
encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 713 entries, 2.852B raw,
713B comp}
14/09/29 11:20:59 INFO ColumnChunkPageWriteStore: written 796.082B for
[col4] INT32: 683.189 values, 907.744B raw, 796.013B comp, 3 pages,
encodings: [PLAIN_DICTIONARY, BIT_PACKED], dic { 1.262 entries, 5.048B raw,
1.262B comp}


By the way why is the schema wrong? I include there repeated values, I'm
very confused!

Thanks 
Matthes



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is it possible to use Parquet with Dremel encoding

Posted by Michael Armbrust <mi...@databricks.com>.

Based on your first example it looks like what you want is actually run
length encoding (which parquet does support
<https://github.com/Parquet/parquet-format/blob/master/Encodings.md>).
Repetition and definition levels are used to reconstruct nested or repeated
(arrays) data that has been shredded so that each column can be stored
separately (allowing you to avoid reading bits for columns you don't care
about).

Spark SQL is likely the easiest way for you to achieve what you want and
does support nested and array data (though it does not look like your
schema has that).  Given the original data you could save it as parquet as
follows:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext._

case class Data(col1: Int, col2: Long, col2: long, col4: Int)

sc.parallelize(
  Data(14, 1234, 1422, 3)  ::
  Data(14, 3212, 1542, 2)  ::
  Data(14, 8910, 1422, 8)  ::
  Data(15, 1234, 1542, 9)  ::
  Data(15, 8897, 1422, 13) :: Nil).saveAsParquetFile(...)

Note that this is only an illustration of the API, and if the data is large
you will not want to construct it all as a static List on the driver and
parallelize.  Instead transform it into the case class representation using
a map or something similar and then saveAsParquetFile.

On Fri, Sep 26, 2014 at 9:00 AM, Frank Austin Nothaft <fnothaft@berkeley.edu
> wrote:

> Matthes,
>
> Ah, gotcha! Repeated items in Parquet seem to correspond to the ArrayType
> in Spark-SQL. I only use Spark, but it does looks like that should be
> supported in Spark-SQL 1.1.0. I’m not sure though if you can apply
> predicates on repeated items from Spark-SQL.
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466
>
> On Sep 26, 2014, at 8:48 AM, matthes <md...@sensenetworks.com> wrote:
>
> > Hi Frank,
> >
> > thanks al lot for your response, this is a very helpful!
> >
> > Actually I'm try to figure out does the current spark version supports
> > Repetition levels
> > (https://blog.twitter.com/2013/dremel-made-simple-with-parquet) but now
> it
> > looks good to me.
> > It is very hard to find some good things about that. Now I found this as
> > well:
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala;h=1dc58633a2a68cd910c1bab01c3d5ee1eb4f8709;hb=f479cf37
> >
> > I wasn't sure of that because nested data can be many different things!
> > If it works with SQL, to find the firstRepeatedid or secoundRepeatedid
> would
> > be awesome. But if it only works with kind of map/reduce job than it also
> > good. The most important thing is to filter the first or secound
> repeated
> > value as fast as possible and in combination as well.
> > I start now to play with this things to get the best search results!
> >
> > Me schema looks like this:
> >
> > val nestedSchema =
> >    """message nestedRowSchema
> > {
> >                 int32 firstRepeatedid;
> >                 repeated group level1
> >                 {
> >                       int64 secoundRepeatedid;
> >                       repeated group level2
> >                     {
> >                       int64   value1;
> >                       int32   value2;
> >                     }
> >                 }
> >       }
> >    """
> >
> > Best,
> > Matthes
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15239.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Is it possible to use Parquet with Dremel encoding

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.

Matthes,

Ah, gotcha! Repeated items in Parquet seem to correspond to the ArrayType in Spark-SQL. I only use Spark, but it does looks like that should be supported in Spark-SQL 1.1.0. I’m not sure though if you can apply predicates on repeated items from Spark-SQL.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Sep 26, 2014, at 8:48 AM, matthes <md...@sensenetworks.com> wrote:

> Hi Frank,
> 
> thanks al lot for your response, this is a very helpful!
> 
> Actually I'm try to figure out does the current spark version supports
> Repetition levels
> (https://blog.twitter.com/2013/dremel-made-simple-with-parquet) but now it
> looks good to me.
> It is very hard to find some good things about that. Now I found this as
> well: 
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala;h=1dc58633a2a68cd910c1bab01c3d5ee1eb4f8709;hb=f479cf37
> 
> I wasn't sure of that because nested data can be many different things!
> If it works with SQL, to find the firstRepeatedid or secoundRepeatedid would
> be awesome. But if it only works with kind of map/reduce job than it also
> good. The most important thing is to filter the first or secound  repeated
> value as fast as possible and in combination as well.
> I start now to play with this things to get the best search results!
> 
> Me schema looks like this:
> 
> val nestedSchema =
>    """message nestedRowSchema 
> {
> 		  int32 firstRepeatedid;
> 		  repeated group level1
> 		  {
> 		  	int64 secoundRepeatedid;
> 		  	repeated group level2 
> 		      {
> 		      	int64	value1;
> 		      	int32	value2;
> 		      }
> 		  }
> 	}
>    """
> 
> Best,
> Matthes
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15239.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is it possible to use Parquet with Dremel encoding

Posted by matthes <md...@sensenetworks.com>.

Hi Frank,

thanks al lot for your response, this is a very helpful!

Actually I'm try to figure out does the current spark version supports
Repetition levels
(https://blog.twitter.com/2013/dremel-made-simple-with-parquet) but now it
looks good to me.
It is very hard to find some good things about that. Now I found this as
well: 
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=blob;f=sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala;h=1dc58633a2a68cd910c1bab01c3d5ee1eb4f8709;hb=f479cf37

I wasn't sure of that because nested data can be many different things!
If it works with SQL, to find the firstRepeatedid or secoundRepeatedid would
be awesome. But if it only works with kind of map/reduce job than it also
good. The most important thing is to filter the first or secound  repeated
value as fast as possible and in combination as well.
I start now to play with this things to get the best search results!

Me schema looks like this:

val nestedSchema =
    """message nestedRowSchema 
{
		  int32 firstRepeatedid;
		  repeated group level1
		  {
		  	int64 secoundRepeatedid;
		  	repeated group level2 
		      {
		      	int64	value1;
		      	int32	value2;
		      }
		  }
	}
    """

Best,
Matthes



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15239.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is it possible to use Parquet with Dremel encoding

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.

Hi Matthes,

Can you post an example of your schema? When you refer to nesting, are you referring to optional columns, nested schemas, or tables where there are repeated values? Parquet uses run-length encoding to compress down columns with repeated values, which is the case that your example seems to refer to. The point Matt is making in his post is that if you have a Parquet files with contain records with a nested schema, e.g.:

record MyNestedSchema {
  int nestedSchemaField;
}

record MySchema {
  int nonNestedField;
  MyNestedSchema nestedRecord;
}

Not all systems support queries against these schemas. If you want to load the data directly into Spark, it isn’t an issue. I’m not familiar with how SparkSQL is handling this, but I believe the bit you quoted is saying that support for nested queries (e.g., select ... from … where nestedRecord.nestedSchemaField == 0) will be added in Spark 1.0.1 (which is currently available, BTW).

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

On Sep 26, 2014, at 7:38 AM, matthes <md...@sensenetworks.com> wrote:

> Thank you Jey,
> 
> That is a nice introduction but it is a may be to old (AUG 21ST, 2013)
> 
> "Note: If you keep the schema flat (without nesting), the Parquet files you
> create can be read by systems like Shark and Impala. These systems allow you
> to query Parquet files as tables using SQL-like syntax. The Parquet files
> created by this sample application could easily be queried using Shark for
> example."
> 
> But in this post
> (http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-td8377.html)
> I found this: Nested parquet is not supported in 1.0, but is part of the
> upcoming 1.0.1 release.
> 
> So the question now is, can I use it in the benefit way of nested parquet
> files to find fast with sql or do I have to write a special map/reduce job
> to transform and find my data?
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15234.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is it possible to use Parquet with Dremel encoding

Posted by matthes <md...@sensenetworks.com>.

Thank you Jey,

That is a nice introduction but it is a may be to old (AUG 21ST, 2013)

"Note: If you keep the schema flat (without nesting), the Parquet files you
create can be read by systems like Shark and Impala. These systems allow you
to query Parquet files as tables using SQL-like syntax. The Parquet files
created by this sample application could easily be queried using Shark for
example."

But in this post
(http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Nested-CaseClass-Parquet-failure-td8377.html)
I found this: Nested parquet is not supported in 1.0, but is part of the
upcoming 1.0.1 release.

So the question now is, can I use it in the benefit way of nested parquet
files to find fast with sql or do I have to write a special map/reduce job
to transform and find my data?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186p15234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Is it possible to use Parquet with Dremel encoding

Posted by Jey Kottalam <je...@cs.berkeley.edu>.

Hi Matthes,

You may find the following blog post relevant:
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

Hope that helps,
-Jey

On Thu, Sep 25, 2014 at 5:05 PM, matthes <md...@sensenetworks.com> wrote:
> Hi again!
>
> At the moment I try to use parquet and I want to keep the data into the
> memory in an efficient way to make requests against the data as fast as
> possible.
> I read about parquet it is able to encode nested columns. Parquet uses the
> Dremel encoding with definition and repetition levels.
> Is it at the moment possible to use this in spark as well or is it actually
> not implemented? If yes, I’m not sure how to do it. I saw some examples,
> they try to put some arrays or case classes in other case classes, nut I
> don’t think that is the right way.  The other thing that I saw in this
> relation was SchemaRDDs.
>
> Input:
>
> Col1    |       Col2    |       Col3    |       Col4
> Int     |       long    |       long    |       int
> ---------------------------------------------
> 14      |       1234    |       1422    |       3
> 14      |       3212    |       1542    |       2
> 14      |       8910    |       1422    |       8
> 15      |       1234    |       1542    |       9
> 15      |       8897    |       1422    |       13
>
> Want this Parquet-format:
> Col3    |       Col1    |       Col4    |       Col2
> long    |       int     |       int     |       long
> --------------------------------------------
> 1422    |       14      |       3       |       1234
> “       |       “       |       8       |       8910
> “       |       15      |       13      |       8897
> 1542    |       14      |       2       |       3212
> “       |       15      |       9       |       1234
>
> It would be awesome if somebody could give me a good hint how can I do that
> or maybe a better way.
>
> Best,
> Matthes
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-use-Parquet-with-Dremel-encoding-tp15186.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org