You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Vishnusaran Ramaswamy <vi...@gmail.com> on 2014/12/02 21:43:53 UTC

SchemaRDD + SQL , loading projection columns

Hi,

I have 16 GB of parquet files in /tmp/logs/ folder with the following schema 

request_id(String), module(String), payload(Array[Byte])

Most of my 16 GB data is the payload field, the request_id, and module
fields take less than 200 MB.

I want to load the payload only when my filter condition matches. 

val sqlContext = new SQLContext(sc)
val files = sqlContext.parquetFile("/tmp/logs")
files.registerTempTable("logs")
val filteredLogs = sqlContext.sql("select request_id, payload from logs
where rid = 'dd4455ee' and module = 'query' ")

when i run filteredLogs.collect.foreach(println) , i see all of the 16GB
data loaded.

How do I load only the columns used in filters first and then load the
payload for the row matching the filter criteria?

Let me know if this can be done in a different way.

Thanks you,
Vishnu.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SchemaRDD + SQL , loading projection columns

Posted by Vishnusaran Ramaswamy <vi...@gmail.com>.
Thanks for the help.. Let me find more info on how to enable statistics in
parquet.

-Vishnu


Michael Armbrust wrote
> There is not a super easy way to do what you are asking since in general
> parquet needs to read all the data in a column. As far as I understand it
> does not have indexes that would allow you to jump to a specific entry in
> a
> column.  There is support for pushing down predicates, but unfortunately
> this is turned off by default (in Spark 1.2) due to bugs in the parquet
> library.  Even with this feature though I believe you still read the data
> and just skip the cost of materializing the row.
> 
> One thing that could speed up that particular query is to sort by 'rid
> before storing to parquet.  Then (when filter pushdown is turned on),
> parquet will keep statistics on the min/max value for each column in a
> given row group.  That would allow it to completely skip row groups that
> cannot contain a given 'rid.
> 
> Michael
> 
> On Tue, Dec 2, 2014 at 12:43 PM, Vishnusaran Ramaswamy <

> vishnusaran@

>> wrote:
> 
>> Hi,
>>
>> I have 16 GB of parquet files in /tmp/logs/ folder with the following
>> schema
>>
>> request_id(String), module(String), payload(Array[Byte])
>>
>> Most of my 16 GB data is the payload field, the request_id, and module
>> fields take less than 200 MB.
>>
>> I want to load the payload only when my filter condition matches.
>>
>> val sqlContext = new SQLContext(sc)
>> val files = sqlContext.parquetFile("/tmp/logs")
>> files.registerTempTable("logs")
>> val filteredLogs = sqlContext.sql("select request_id, payload from logs
>> where rid = 'dd4455ee' and module = 'query' ")
>>
>> when i run filteredLogs.collect.foreach(println) , i see all of the 16GB
>> data loaded.
>>
>> How do I load only the columns used in filters first and then load the
>> payload for the row matching the filter criteria?
>>
>> Let me know if this can be done in a different way.
>>
>> Thanks you,
>> Vishnu.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: 

> user-unsubscribe@.apache

>> For additional commands, e-mail: 

> user-help@.apache

>>
>>





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189p20278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SchemaRDD + SQL , loading projection columns

Posted by Michael Armbrust <mi...@databricks.com>.
There is not a super easy way to do what you are asking since in general
parquet needs to read all the data in a column. As far as I understand it
does not have indexes that would allow you to jump to a specific entry in a
column.  There is support for pushing down predicates, but unfortunately
this is turned off by default (in Spark 1.2) due to bugs in the parquet
library.  Even with this feature though I believe you still read the data
and just skip the cost of materializing the row.

One thing that could speed up that particular query is to sort by 'rid
before storing to parquet.  Then (when filter pushdown is turned on),
parquet will keep statistics on the min/max value for each column in a
given row group.  That would allow it to completely skip row groups that
cannot contain a given 'rid.

Michael

On Tue, Dec 2, 2014 at 12:43 PM, Vishnusaran Ramaswamy <
vishnusaran@gmail.com> wrote:

> Hi,
>
> I have 16 GB of parquet files in /tmp/logs/ folder with the following
> schema
>
> request_id(String), module(String), payload(Array[Byte])
>
> Most of my 16 GB data is the payload field, the request_id, and module
> fields take less than 200 MB.
>
> I want to load the payload only when my filter condition matches.
>
> val sqlContext = new SQLContext(sc)
> val files = sqlContext.parquetFile("/tmp/logs")
> files.registerTempTable("logs")
> val filteredLogs = sqlContext.sql("select request_id, payload from logs
> where rid = 'dd4455ee' and module = 'query' ")
>
> when i run filteredLogs.collect.foreach(println) , i see all of the 16GB
> data loaded.
>
> How do I load only the columns used in filters first and then load the
> payload for the row matching the filter criteria?
>
> Let me know if this can be done in a different way.
>
> Thanks you,
> Vishnu.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-SQL-loading-projection-columns-tp20189.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>