You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2016/03/01 00:17:51 UTC
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/11437
[SPARK-13582] [SQL] defer dictionary decoding in parquet reader
## What changes were proposed in this pull request?
This PR defer the resolution from a id of dictionary to value until the column is actually accessed (inside getInt/getLong), this is very useful for those columns and rows that are filtered out. It's also useful for binary type, we will not need to copy all the byte arrays.
## How was this patch tested?
Manually test TPCDS Q7 with scale factor 10, saw about 30% improvements (after PR #11274).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark decode_dict
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11437.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11437
----
commit 6676e746b887730eadf9cca297ede4cff7a0de2f
Author: Davies Liu <da...@databricks.com>
Date: 2016-02-29T23:08:52Z
defer dictionary decoding
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190473490
**[Test build #52205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52205/consoleFull)** for PR 11437 at commit [`081e6fe`](https://github.com/apache/spark/commit/081e6fe81e2280e4b8041bf376066b9b1d82cc57).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190446110
**[Test build #52202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52202/consoleFull)** for PR 11437 at commit [`6676e74`](https://github.com/apache/spark/commit/6676e746b887730eadf9cca297ede4cff7a0de2f).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190899490
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52251/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190507323
**[Test build #52207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52207/consoleFull)** for PR 11437 at commit [`6fce801`](https://github.com/apache/spark/commit/6fce80141c76604167914a8cbb39847f1a4f457a).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190855205
**[Test build #52251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52251/consoleFull)** for PR 11437 at commit [`e539d8a`](https://github.com/apache/spark/commit/e539d8a94735668c370459ca8bf5a937ee22321d).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190446127
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52202/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/11437
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190812391
Can you run the ColumnarBatch/ParquetRead benchmark? Does this have perf problems if there is no dictionary or there is no filter?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190477587
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190473494
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190477384
**[Test build #52206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52206/consoleFull)** for PR 11437 at commit [`5faa786`](https://github.com/apache/spark/commit/5faa786628f4b3d61774973f4351693015ba017c).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190904837
Merging this into master.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190899484
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190507431
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190507434
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52207/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190473496
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52205/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190479973
**[Test build #52207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52207/consoleFull)** for PR 11437 at commit [`6fce801`](https://github.com/apache/spark/commit/6fce80141c76604167914a8cbb39847f1a4f457a).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190444065
cc @nongli
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190445236
**[Test build #52202 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52202/consoleFull)** for PR 11437 at commit [`6676e74`](https://github.com/apache/spark/commit/6676e746b887730eadf9cca297ede4cff7a0de2f).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190446124
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190603255
**[Test build #2593 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2593/consoleFull)** for PR 11437 at commit [`6fce801`](https://github.com/apache/spark/commit/6fce80141c76604167914a8cbb39847f1a4f457a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190845510
@nongli There is no visible difference on all existing benchmarks (ColumnarBatch and ParquetRead), they don't use dictionary encoding.
After changed the intStringScan to use dictionary encoding (small number unique values), here is the result:
Before this patch
```
Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader 1248 / 1281 8.4 119.0 1.0X
SQL Parquet MR 1962 / 2093 5.3 187.1 0.6X
SQL Parquet Vectorized 876 / 1018 12.0 83.5 1.4X
ParquetReader 741 / 755 14.1 70.7 1.7X
```
After the patch
```
Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
Int and String Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
-------------------------------------------------------------------------------------------
SQL Parquet Reader 1247 / 1279 8.4 118.9 1.0X
SQL Parquet MR 1809 / 1851 5.8 172.5 0.7X
SQL Parquet Vectorized 805 / 909 13.0 76.8 1.5X
ParquetReader 742 / 756 14.1 70.7 1.7X
```
We can see 10% improvement on SQL Parquet Vectorized, but no difference on ParquetReader, I don't know why. (I didn't included #11274 )
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190846489
Cool. Lgtm
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190563231
**[Test build #2593 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2593/consoleFull)** for PR 11437 at commit [`6fce801`](https://github.com/apache/spark/commit/6fce80141c76604167914a8cbb39847f1a4f457a).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190477582
**[Test build #52206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52206/consoleFull)** for PR 11437 at commit [`5faa786`](https://github.com/apache/spark/commit/5faa786628f4b3d61774973f4351693015ba017c).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/11437#discussion_r54597392
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
@@ -695,28 +684,28 @@ private void decodeDictionaryIds(int rowId, int num, ColumnVector column) {
case INT64:
if (column.dataType() == DataTypes.LongType ||
DecimalType.is64BitDecimalType(column.dataType())) {
- for (int i = rowId; i < rowId + num; ++i) {
- column.putLong(i, dictionary.decodeToLong(dictionaryIds.getInt(i)));
- }
+ column.setDictionary(dictionary);
} else {
throw new NotImplementedException("Unimplemented type: " + column.dataType());
}
break;
case FLOAT:
- for (int i = rowId; i < rowId + num; ++i) {
- column.putFloat(i, dictionary.decodeToFloat(dictionaryIds.getInt(i)));
- }
+ column.setDictionary(dictionary);
break;
case DOUBLE:
- for (int i = rowId; i < rowId + num; ++i) {
- column.putDouble(i, dictionary.decodeToDouble(dictionaryIds.getInt(i)));
- }
+ column.setDictionary(dictionary);
break;
case FIXED_LEN_BYTE_ARRAY:
- if (DecimalType.is64BitDecimalType(column.dataType())) {
+ // DecimalType written in the legacy mode
+ if (DecimalType.is32BitDecimalType(column.dataType())) {
+ for (int i = rowId; i < rowId + num; ++i) {
+ Binary v = dictionary.decodeToBinary(dictionaryIds.getInt(i));
+ column.putInt(i,(int) CatalystRowConverter.binaryToUnscaledLong(v));
--- End diff --
missing space after ,
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190898706
**[Test build #52251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52251/consoleFull)** for PR 11437 at commit [`e539d8a`](https://github.com/apache/spark/commit/e539d8a94735668c370459ca8bf5a937ee22321d).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190473204
**[Test build #52205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52205/consoleFull)** for PR 11437 at commit [`081e6fe`](https://github.com/apache/spark/commit/081e6fe81e2280e4b8041bf376066b9b1d82cc57).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:
https://github.com/apache/spark/pull/11437#discussion_r54597312
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java ---
@@ -620,13 +624,6 @@ private void readBatch(int total, ColumnVector column) throws IOException {
}
int num = Math.min(total, leftInPage);
if (useDictionary) {
- // Data is dictionary encoded. We will vector decode the ids and then resolve the values.
- if (dictionaryIds == null) {
--- End diff --
Remove dictionaryIds from this class.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-13582] [SQL] defer dictionary decoding ...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11437#issuecomment-190477588
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52206/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org