You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MaxGekk <gi...@git.apache.org> on 2018/05/10 12:59:32 UTC
[GitHub] spark pull request #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrame...
GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21292
[SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader's options to Text datasource on schema inferring
## What changes were proposed in this pull request?
While reading CSV or JSON files, DataFrameReader's options are converted to Hadoop's parameters, for example there:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L302
but the options are not propagated to Text datasource on schema inferring, for instance:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala#L184-L188
The PR proposes propagation of user's options to Text datasource on scheme inferring in similar way as user's options are converted to Hadoop parameters if schema is specified.
## How was this patch tested?
The changes were tested manually by using https://github.com/twitter/hadoop-lzo:
```
hadoop-lzo> mvn clean package
hadoop-lzo> ln -s ./target/hadoop-lzo-0.4.21-SNAPSHOT.jar ./hadoop-lzo.jar
```
Create 2 test files in JSON and CSV format and compress them:
```shell
$ cat test.csv
col1|col2
a|1
$ lzop test.csv
$ cat test.json
{"col1":"a","col2":1}
$ lzop test.json
```
Run `spark-shell` with hadoop-lzo:
```
bin/spark-shell --jars ~/hadoop-lzo/hadoop-lzo.jar
```
reading compressed CSV and JSON without schema:
```scala
spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("inferSchema",true).option("header",true).option("sep","|").csv("test.csv.lzo").show()
+----+----+
|col1|col2|
+----+----+
| a| 1|
+----+----+
```
```scala
spark.read.option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").option("multiLine", true).json("test.json.lzo").printSchema
root
|-- col1: string (nullable = true)
|-- col2: long (nullable = true)
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 text-options-backport-v2.3
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21292.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21292
----
commit 9092faa573cf39faa7171f1335d612309452b644
Author: Maxim Gekk <ma...@...>
Date: 2018-04-27T13:23:40Z
Propagating DataFrameReader's options to the text datasource on schema inferring
commit 7b4a6b40625028c7c367090f0fe48e0ec26bc79a
Author: Maxim Gekk <ma...@...>
Date: 2018-04-28T07:58:31Z
Make textOptions serializable
commit fe6c3c2cc9a113f7cd38185a0315484e3a3c99cc
Author: Maxim Gekk <ma...@...>
Date: 2018-05-05T09:16:44Z
Adding @transient to textOptions because they shouldn't be serialized
commit 831441b292c67c8de93eb25894df02579cbc0fd3
Author: Maxim Gekk <ma...@...>
Date: 2018-05-06T08:09:37Z
Removing the separate val for textOptions
commit f6ab928c1abcac239f9d857d86d2e2a966f8e091
Author: Maxim Gekk <ma...@...>
Date: 2018-05-06T08:53:25Z
Removing unused imports
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21292
**[Test build #90458 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90458/testReport)** for PR 21292 at commit [`f6ab928`](https://github.com/apache/spark/commit/f6ab928c1abcac239f9d857d86d2e2a966f8e091).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21292
add to whitelist
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21292
**[Test build #90458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90458/testReport)** for PR 21292 at commit [`f6ab928`](https://github.com/apache/spark/commit/f6ab928c1abcac239f9d857d86d2e2a966f8e091).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21292
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21292
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90458/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21292
@MaxGekk, BTW, it's not automatically closed when backporting PR is merged into other branches. Mind manually closing this please?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrame...
Posted by MaxGekk <gi...@git.apache.org>.
Github user MaxGekk closed the pull request at:
https://github.com/apache/spark/pull/21292
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/21292
Merged to branch-2.3.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21292
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21292: [SPARK-24068][BACKPORT-2.3] Propagating DataFrameReader'...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on the issue:
https://github.com/apache/spark/pull/21292
LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org