You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mick Davies <mi...@gmail.com> on 2015/01/16 17:17:07 UTC

Optimize encoding/decoding strings when using Parquet

Hi,

It seems that a reasonably large proportion of query time using Spark SQL
seems to be spent decoding Parquet Binary objects to produce Java Strings.
Has anyone considered trying to optimize these conversions as many are
duplicated.

Details are outlined in the conversation in the user mailing list
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html>
, I have copied a bit of that discussion here.

It seems that as Spark processes each row from Parquet it makes a call to
convert the Binary representation for each String column into a Java String.
However in many (probably most) circumstances the underlying Binary instance
from Parquet will have come from a Dictionary, for example when column
cardinality is low. Therefore Spark is converting the same byte array to a
copy of the same Java String over and over again. This is bad due to extra
cpu, extra memory used for these strings, and probably results in more
expensive grouping comparisons.

I tested a simple hack to cache the last Binary->String conversion per
column in ParquetConverter and this led to a 25% performance improvement for
the queries I used. Admittedly this was over a data set with lots or runs of
the same Strings in the queried columns.

These costs are quite significant for the type of data that I expect will be
stored in Parquet which will often have denormalized tables and probably
lots of fairly low cardinality string columns

I think a good way to optimize this would be if changes could be made to
Parquet so that the encoding/decoding of Objects to Binary is handled on
Parquet side of fence. Parquet could deal with Objects (Strings) as the
client understands them and only use encoding/decoding to store/read from
underlying storage medium. Doing this I think Parquet could ensure that the
encoding/decoding of each Object occurs only once.

Does anyone have an opinion on this, has it been considered already?

Cheers Mick

--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Optimize encoding/decoding strings when using Parquet

Posted by Michael Davies <mi...@gmail.com>.

Added PR https://github.com/apache/spark/pull/4139 <https://github.com/apache/spark/pull/4139> - I think tests have been re-arranged so merge necessary

Mick


> On 19 Jan 2015, at 18:31, Reynold Xin <rx...@databricks.com> wrote:
> 
> Definitely go for a pull request!
> 
> 
> On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies <michael.belldavies@gmail.com <ma...@gmail.com>> wrote:
> 
> Looking at Parquet code - it looks like hooks are already in place to
> support this.
> 
> In particular PrimitiveConverter has methods hasDictionarySupport and
> addValueFromDictionary for this purpose. These are not used by
> CatalystPrimitiveConverter.
> 
> I think that it would be pretty straightforward to add this. Has anyone
> considered this? Shall I get a pull request  together for it.
> 
> Mick
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html <http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html>
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Optimize encoding/decoding strings when using Parquet

Posted by Reynold Xin <rx...@databricks.com>.

Definitely go for a pull request!


On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies <mi...@gmail.com>
wrote:

>
> Looking at Parquet code - it looks like hooks are already in place to
> support this.
>
> In particular PrimitiveConverter has methods hasDictionarySupport and
> addValueFromDictionary for this purpose. These are not used by
> CatalystPrimitiveConverter.
>
> I think that it would be pretty straightforward to add this. Has anyone
> considered this? Shall I get a pull request  together for it.
>
> Mick
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Optimize encoding/decoding strings when using Parquet

Posted by Mick Davies <mi...@gmail.com>.

Looking at Parquet code - it looks like hooks are already in place to
support this.

In particular PrimitiveConverter has methods hasDictionarySupport and
addValueFromDictionary for this purpose. These are not used by
CatalystPrimitiveConverter.

I think that it would be pretty straightforward to add this. Has anyone
considered this? Shall I get a pull request  together for it.

Mick



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10195.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Optimize encoding/decoding strings when using Parquet

Posted by Mick Davies <mi...@gmail.com>.

I have put in a PR on Parquet to support dictionaries when filters are pushed
down, which should reduce binary conversion overhear when Spark pushes down
string predicates on columns that are dictionary encoded.

https://github.com/apache/incubator-parquet-mr/pull/117

It's blocked at the moment as I part of my parquet build fails on my Mac due
to issue getting thrift 0.7 installed. Installation instructions available
on Parquet do not seem to work I think due to this issue
https://issues.apache.org/jira/browse/THRIFT-2229
<https://issues.apache.org/jira/browse/THRIFT-2229>.

This is not directly related to Spark but I wondered if anyone has got
thrift 0.7 working on Mac Yosemite 10.0, or can suggest a work round.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10617.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Optimize encoding/decoding strings when using Parquet

Posted by Mick Davies <mi...@gmail.com>.

Here are some timings showing effect of caching last Binary->String
conversion. Query times are reduced significantly and variation in timings
due to reduction in garbage is very significant.

Set of sample queries selecting various columns, applying some filtering and
then aggregating

Spark 1.2.0
Query 1 mean time 8353.3 millis, std deviation 480.91511147441025 millis
Query 2 mean time 8677.6 millis, std deviation 3193.345518417949 millis
Query 3 mean time 11302.5 millis, std deviation 2989.9406998950476 millis
Query 4 mean time 10537.0 millis, std deviation 5166.024024549462 millis
Query 5 mean time 9559.9 millis, std deviation 4141.487667493409 millis
Query 6 mean time 12638.1 millis, std deviation 3639.4505522430477 millis


Spark 1.2.0 - cache last Binary->String conversion
Query 1 mean time 5118.9 millis, std deviation 549.6670608448152 millis
Query 2 mean time 3761.3 millis, std deviation 202.57785883183013 millis
Query 3 mean time 7358.8 millis, std deviation 242.58918176850162 millis
Query 4 mean time 4173.5 millis, std deviation 179.802515122688 millis
Query 5 mean time 3857.0 millis, std deviation 140.71957930579526 millis
Query 6 mean time 7512.0 millis, std deviation 198.32633040858022 millis




--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10193.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Optimize encoding/decoding strings when using Parquet

Posted by Mick Davies <mi...@gmail.com>.

Added a JIRA to track
https://issues.apache.org/jira/browse/SPARK-5309



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Optimize encoding/decoding strings when using Parquet

Posted by Michael Armbrust <mi...@databricks.com>.

+1 to adding such an optimization to parquet.  The bytes are tagged
specially as UTF8 in the parquet schema so it seem like it would be
possible to add this.

On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies <mi...@gmail.com>
wrote:

> Hi,
>
> It seems that a reasonably large proportion of query time using Spark SQL
> seems to be spent decoding Parquet Binary objects to produce Java Strings.
> Has anyone considered trying to optimize these conversions as many are
> duplicated.
>
> Details are outlined in the conversation in the user mailing list
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-amp-Parquet-data-are-reading-very-very-slow-td21061.html
> >
> , I have copied a bit of that discussion here.
>
> It seems that as Spark processes each row from Parquet it makes a call to
> convert the Binary representation for each String column into a Java
> String.
> However in many (probably most) circumstances the underlying Binary
> instance
> from Parquet will have come from a Dictionary, for example when column
> cardinality is low. Therefore Spark is converting the same byte array to a
> copy of the same Java String over and over again. This is bad due to extra
> cpu, extra memory used for these strings, and probably results in more
> expensive grouping comparisons.
>
>
> I tested a simple hack to cache the last Binary->String conversion per
> column in ParquetConverter and this led to a 25% performance improvement
> for
> the queries I used. Admittedly this was over a data set with lots or runs
> of
> the same Strings in the queried columns.
>
> These costs are quite significant for the type of data that I expect will
> be
> stored in Parquet which will often have denormalized tables and probably
> lots of fairly low cardinality string columns
>
> I think a good way to optimize this would be if changes could be made to
> Parquet so that  the encoding/decoding of Objects to Binary is handled on
> Parquet side of fence. Parquet could deal with Objects (Strings) as the
> client understands them and only use encoding/decoding to store/read from
> underlying storage medium. Doing this I think Parquet could ensure that the
> encoding/decoding of each Object occurs only once.
>
> Does anyone have an opinion on this, has it been considered already?
>
> Cheers Mick
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>