You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ScrapCodes <gi...@git.apache.org> on 2016/07/12 10:45:47 UTC

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as data source f...

GitHub user ScrapCodes opened a pull request:

    https://github.com/apache/spark/pull/14151

    [SPARK-16496][SQL] Add wholetext as data source for SQL.

    ## What changes were proposed in this pull request?
    
    In multiple text analysis, problems it is not often desirable for the rows to be split by "\n". There exists a wholeText reader for RDD API, and this JIRA just adds the same support for Dataset API.
    
    
    ## How was this patch tested?
    
    Added two new tests.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ScrapCodes/spark SPARK-16496/wholetext

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14151.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14151
    
----
commit bd2936dfbecd9f65b2e3ccea088ebf13fa1acc5d
Author: Prashant Sharma <pr...@in.ibm.com>
Date:   2016-07-08T08:48:37Z

    [SPARK-16496][SQL] Add wholetext as data source for SQL.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Actually what you said sounds like a nice idea, I was considering is it possible to propogate this as an option in all other formats like CSV and Json too ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #80203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80203/testReport)** for PR 14151 at commit [`2edc7fe`](https://github.com/apache/spark/commit/2edc7fe4d0278ec91f0ca6051c426aba185f0019).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r137595716
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala ---
    @@ -185,8 +233,7 @@ class TextSuite extends QueryTest with SharedSQLContext {
         val data = df.collect()
         assert(data(0) == Row("This is a test file for the text data source"))
         assert(data(1) == Row("1+1"))
    -    // non ascii characters are not allowed in the code, so we disable the scalastyle here.
    -    // scalastyle:off
    +    // scalastyle:off nonascii
         assert(data(2) == Row("数据砖头"))
         // scalastyle:on
    --- End diff --
    
    `// scalastyle:on nonascii`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    cc @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80205/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by frreiss <gi...@git.apache.org>.

Github user frreiss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r74804217
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -533,6 +533,12 @@ object SQLConf {
           .timeConf(TimeUnit.MILLISECONDS)
           .createWithDefault(10L)
     
    +  val WHOLETEXT =
    +    SQLConfigBuilder("spark.sql.wholetext")
    --- End diff --
    
    Should this really be a session-global configuration? It seems like something that is specific to a particular input file and should only be set when opening a given file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78870/testReport)** for PR 14151 at commit [`88ead2a`](https://github.com/apache/spark/commit/88ead2a988591ca3955d93f602c274ab7d09acd0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84896/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    @gatorsmile Ping !   


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    This python pydoc style is failing at `[Row(value=u'hello\nthis')]`. I could not find a way to fix it. Any help will be appreciated. It does not like the literal `'\n'`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84704/testReport)** for PR 14151 at commit [`66d5b45`](https://github.com/apache/spark/commit/66d5b453cd2aaaea08a3843f4966fc9036451b6c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r155139949
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala ---
    @@ -39,6 +39,54 @@ class TextSuite extends QueryTest with SharedSQLContext {
         verifyFrame(spark.read.text(testFile))
       }
     
    +  test("reading text file with option wholetext=true") {
    +    val df = spark.read.option("wholetext", "true")
    +      .format("text").load(testFile)
    +    // schema
    +    assert(df.schema == new StructType().add("value", StringType))
    +
    +    // verify content
    +    val data = df.collect()
    +    assert(data(0) ==
    +      Row(
    +        // scalastyle:off nonascii
    +        """This is a test file for the text data source
    +          |1+1
    +          |数据砖头
    +          |"doh"
    +          |""".stripMargin))
    +    // scalastyle:on
    --- End diff --
    
    nit: // scalastyle:on nonascii


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84602/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84704/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Looks the escaping is ok.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r155706338
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -313,11 +313,16 @@ def text(self, paths):
             Each line in the text file is a new row in the resulting DataFrame.
     
             :param paths: string, or list of strings, for input path(s).
    +        :param wholetext: if true, read each file from input path(s) as a single row.
     
             >>> df = spark.read.text('python/test_support/sql/text-test.txt')
             >>> df.collect()
             [Row(value=u'hello'), Row(value=u'this')]
    +        >>> df = spark.read.text('python/test_support/sql/text-test.txt', wholetext=True)
    +        >>> df.collect()
    +        [Row(value=u'hello\nthis')]
    --- End diff --
    
    That would fail the test, I suppose. I can give that a try though.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r137601004
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean
    --- End diff --
    
    Move it to `TextOptions`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    @rxin Ping !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Thanks! Merged to master. 
    
    The code style issues will be addressed by my other PRs. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64773/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66164/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #62308 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62308/consoleFull)** for PR 14151 at commit [`82952e7`](https://github.com/apache/spark/commit/82952e7a1b504f61ec09dc50a31105e5cb405587).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #66164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66164/consoleFull)** for PR 14151 at commit [`3f8a177`](https://github.com/apache/spark/commit/3f8a17764b79a35ea33e02454b0f0da2622116a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r81289022
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean
    --- End diff --
    
    Actually, we need to document this within `readwriter.py` too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    @rxin Do you think it looks okay now ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r74903349
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -533,6 +533,12 @@ object SQLConf {
           .timeConf(TimeUnit.MILLISECONDS)
           .createWithDefault(10L)
     
    +  val WHOLETEXT =
    +    SQLConfigBuilder("spark.sql.wholetext")
    --- End diff --
    
    They are removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r137601363
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean
    +
    +    readToUnsafeMem(broadcastedHadoopConf, requiredSchema,
    +      wholeText)
    +  }
    +
    +  private[datasources] def readToUnsafeMem(conf: Broadcast[SerializableConfiguration],
    +      requiredSchema: StructType, wholeTextMode: Boolean):
    --- End diff --
    
    ```Scala
      private def readToUnsafeMem(
          conf: Broadcast[SerializableConfiguration],
          requiredSchema: StructType,
          wholeTextMode: Boolean): (PartitionedFile) => Iterator[UnsafeRow]
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78884 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78884/testReport)** for PR 14151 at commit [`3fb5889`](https://github.com/apache/spark/commit/3fb588962228ec58e8b76692ec7f0a767746f334).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/80203/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66375/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #82799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82799/testReport)** for PR 14151 at commit [`c5cb735`](https://github.com/apache/spark/commit/c5cb7356e4ccf844298e4c0cc7a1dcaccde458b9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #66164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66164/consoleFull)** for PR 14151 at commit [`3f8a177`](https://github.com/apache/spark/commit/3f8a17764b79a35ea33e02454b0f0da2622116a6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Thanks @gatorsmile. I was actually wondering, where can I document this option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84905/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r155527201
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -313,11 +313,16 @@ def text(self, paths):
             Each line in the text file is a new row in the resulting DataFrame.
     
             :param paths: string, or list of strings, for input path(s).
    +        :param wholetext: if true, read each file from input path(s) as a single row.
     
             >>> df = spark.read.text('python/test_support/sql/text-test.txt')
             >>> df.collect()
             [Row(value=u'hello'), Row(value=u'this')]
    +        >>> df = spark.read.text('python/test_support/sql/text-test.txt', wholetext=True)
    +        >>> df.collect()
    +        [Row(value=u'hello\nthis')]
    --- End diff --
    
    Hm, can't we just do `\\n`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r130267182
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala ---
    @@ -0,0 +1,57 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.Closeable
    +import java.net.URI
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.mapreduce._
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
    +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    +
    +import org.apache.spark.input.WholeTextFileRecordReader
    +
    +/**
    + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines
    + * in that file.
    + */
    +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)
    --- End diff --
    
    We may want to override `isSplitable` of `TextFileFormat` and return false when `wholetext` option is enabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    I have a question, should we keep a column with filenames ? in current approach we ignore key column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62233/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as data source for SQL.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #62158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62158/consoleFull)** for PR 14151 at commit [`bd2936d`](https://github.com/apache/spark/commit/bd2936dfbecd9f65b2e3ccea088ebf13fa1acc5d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] `
      * `class WholeTextFileFormat extends TextFileFormat `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as data source for SQL.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62161/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #63839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63839/consoleFull)** for PR 14151 at commit [`2540018`](https://github.com/apache/spark/commit/2540018c7c503a860c0376a8698718fbce284146).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #80205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80205/testReport)** for PR 14151 at commit [`a918ccc`](https://github.com/apache/spark/commit/a918ccc2d9034370823fc87b4db9470be1508d82).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84645 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84645/testReport)** for PR 14151 at commit [`7e91020`](https://github.com/apache/spark/commit/7e9102040ee9d6817288b5712e4c3c353390fba1).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14151


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #80205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80205/testReport)** for PR 14151 at commit [`a918ccc`](https://github.com/apache/spark/commit/a918ccc2d9034370823fc87b4db9470be1508d82).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    @viirya Can you please take another look?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    retest this please.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84704 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84704/testReport)** for PR 14151 at commit [`66d5b45`](https://github.com/apache/spark/commit/66d5b453cd2aaaea08a3843f4966fc9036451b6c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class WholeTextFileSuite extends QueryTest with SharedSQLContext `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84602 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84602/testReport)** for PR 14151 at commit [`dd2ed3d`](https://github.com/apache/spark/commit/dd2ed3dce09dfd4c68dffe6fab831469273f7d5e).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as data source for SQL.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #62161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62161/consoleFull)** for PR 14151 at commit [`dafe981`](https://github.com/apache/spark/commit/dafe981176dac55a623c24dcd8f439ecdf27d72a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63839/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r155143739
  
    --- Diff: python/pyspark/sql/readwriter.py ---
    @@ -313,11 +313,13 @@ def text(self, paths):
             Each line in the text file is a new row in the resulting DataFrame.
     
             :param paths: string, or list of strings, for input path(s).
    +        :param wholetext: if true, read each file from input path(s) as a single row.
     
             >>> df = spark.read.text('python/test_support/sql/text-test.txt')
             >>> df.collect()
             [Row(value=u'hello'), Row(value=u'this')]
             """
    --- End diff --
    
    Can you add a doctest for `wholetext` too?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78884/testReport)** for PR 14151 at commit [`3fb5889`](https://github.com/apache/spark/commit/3fb588962228ec58e8b76692ec7f0a767746f334).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r137601163
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean
    +
    +    readToUnsafeMem(broadcastedHadoopConf, requiredSchema,
    +      wholeText)
    +  }
    +
    +  private[datasources] def readToUnsafeMem(conf: Broadcast[SerializableConfiguration],
    +      requiredSchema: StructType, wholeTextMode: Boolean):
    --- End diff --
    
    `private[datasources]` => `private`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78797 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78797/testReport)** for PR 14151 at commit [`88ead2a`](https://github.com/apache/spark/commit/88ead2a988591ca3955d93f602c274ab7d09acd0).
     * This patch **fails due to an unknown error code, -10**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #66375 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66375/consoleFull)** for PR 14151 at commit [`e263b15`](https://github.com/apache/spark/commit/e263b1508a77424b371a0796ea4f9c05bc1c0121).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r130859483
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala ---
    @@ -39,6 +39,52 @@ class TextSuite extends QueryTest with SharedSQLContext {
         verifyFrame(spark.read.text(testFile))
       }
     
    +  test("reading text file with option wholetext=true") {
    +    val df = spark.read.option("wholetext", "true")
    +      .format("text").load(testFile)
    +    // schema
    +    assert(df.schema == new StructType().add("value", StringType))
    +
    +    // verify content
    +    val data = df.collect()
    +    assert(data(0) ==
    +      Row(
    +        // scalastyle:off nonascii
    +        """This is a test file for the text data source
    +          |1+1
    +          |数据砖头
    +          |"doh"
    +          |""".stripMargin))
    +    // scalastyle:on
    +    assert(data.length == 1)
    +  }
    +
    +  test("reading multiple text files with option wholetext=true") {
    +    import org.apache.spark.sql.catalyst.util._
    +    withTempDir { dir =>
    +      val file1 = new File(dir, "text1.txt")
    +      stringToFile(file1,
    +        """text file 1 contents.
    +          |From: None to: ??
    +        """.stripMargin)
    +      val file2 = new File(dir, "text2.txt")
    +      stringToFile(file2, "text file 2 contents.")
    +      val file3 = new File(dir, "text3.txt")
    +      stringToFile(file3, "text file 3 contents.")
    +      val df = spark.read.option("wholetext", "true").text(dir.getAbsolutePath)
    +      // Since wholetext option reads each file into a single row, df.length should be no. of files.
    +      val data = df.sort("value").collect()
    +      assert(data.length == 3)
    +      // Each files should represent a single Row/element in Dataframe/Dataset
    +      assert(data(0) == Row(
    +        """text file 1 contents.
    +          |From: None to: ??
    +        """.stripMargin))
    +      assert(data(1) == Row(
    +        """text file 2 contents.""".stripMargin))
    --- End diff --
    
    nit: should we also check for `data(2)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84512/testReport)** for PR 14151 at commit [`da64f2d`](https://github.com/apache/spark/commit/da64f2da4030c1ac7ffec3070b1c57ae6b2e2a32).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #80203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/80203/testReport)** for PR 14151 at commit [`2edc7fe`](https://github.com/apache/spark/commit/2edc7fe4d0278ec91f0ca6051c426aba185f0019).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84905 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84905/testReport)** for PR 14151 at commit [`021039b`](https://github.com/apache/spark/commit/021039bd1382392282faaec1e1f5c0d39e650a93).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84896 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84896/testReport)** for PR 14151 at commit [`021039b`](https://github.com/apache/spark/commit/021039bd1382392282faaec1e1f5c0d39e650a93).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84896 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84896/testReport)** for PR 14151 at commit [`021039b`](https://github.com/apache/spark/commit/021039bd1382392282faaec1e1f5c0d39e650a93).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    @gatorsmile @jiangxb1987  ping!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78197/testReport)** for PR 14151 at commit [`e263b15`](https://github.com/apache/spark/commit/e263b1508a77424b371a0796ea4f9c05bc1c0121).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78197/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as data source for SQL.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62158/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r131100257
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala ---
    @@ -0,0 +1,57 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.Closeable
    +import java.net.URI
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.mapreduce._
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
    +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    +
    +import org.apache.spark.input.WholeTextFileRecordReader
    +
    +/**
    + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines
    + * in that file.
    + */
    +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)
    --- End diff --
    
    Thank you, for catching this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by frreiss <gi...@git.apache.org>.

Github user frreiss commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r74805700
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala ---
    @@ -39,6 +39,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
         verifyFrame(spark.read.text(testFile))
       }
     
    +  test("reading text file with wholetext option on") {
    --- End diff --
    
    As far as I'm aware, the most common use case for reading entire files is using a glob to read a directory or directory tree containing multiple files. For example, one might download the Enron corpus (see [https://www.cs.cmu.edu/~./enron/]), which comes packaged with one file per email message. With a large number of files on the input, it's important that the work of processing the files be split among many cores. So the test for the `wholetext` option really should have multiple input files and verify that different files end up in different partitions of the resulting RDD or Dataframe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78797/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82799/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r81292587
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean
    --- End diff --
    
    Good reminder ! @HyukjinKwon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78870/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84905 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84905/testReport)** for PR 14151 at commit [`021039b`](https://github.com/apache/spark/commit/021039bd1382392282faaec1e1f5c0d39e650a93).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84645/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62308/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #66161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66161/consoleFull)** for PR 14151 at commit [`74a5f28`](https://github.com/apache/spark/commit/74a5f285de0314969f7f3959ccdb20dc57fb864a).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    @jiangxb1987 @viirya your feedback has been incorporated, please take another look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r155143333
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -97,14 +109,26 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         assert(
           requiredSchema.length <= 1,
           "Text data source only produces a single data column named \"value\".")
    -
    +    val textOptions = new TextOptions(options)
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    readToUnsafeMem(broadcastedHadoopConf, requiredSchema, textOptions.wholeText)
    +  }
    +
    +  private def readToUnsafeMem(conf: Broadcast[SerializableConfiguration],
    +      requiredSchema: StructType, wholeTextMode: Boolean):
    +  (PartitionedFile) => Iterator[UnsafeRow] = {
    +
         (file: PartitionedFile) => {
    -      val reader = new HadoopFileLinesReader(file, broadcastedHadoopConf.value.value)
    +      val confValue = conf.value.value
    +      var reader: Iterator[Text] with Closeable = null
    +      if (!wholeTextMode) {
    +        reader = new HadoopFileLinesReader(file, confValue)
    +      } else {
    +        reader = new HadoopFileWholeTextReader(file, confValue)
    +      }
    --- End diff --
    
    We can avoid using `var`:
    ```scala
    val reader = if (!wholeTextMode) {
      new HadoopFileLinesReader(file, confValue)
    } else {
      new HadoopFileWholeTextReader(file, confValue)
    }
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84512/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84648 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84648/testReport)** for PR 14151 at commit [`7e91020`](https://github.com/apache/spark/commit/7e9102040ee9d6817288b5712e4c3c353390fba1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84648/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by HyukjinKwon <gi...@git.apache.org>.

Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #82839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82839/testReport)** for PR 14151 at commit [`da64f2d`](https://github.com/apache/spark/commit/da64f2da4030c1ac7ffec3070b1c57ae6b2e2a32).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r130265982
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala ---
    @@ -0,0 +1,57 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources
    +
    +import java.io.Closeable
    +import java.net.URI
    +
    +import org.apache.hadoop.conf.Configuration
    +import org.apache.hadoop.fs.Path
    +import org.apache.hadoop.io.Text
    +import org.apache.hadoop.mapreduce._
    +import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit
    +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    +
    +import org.apache.spark.input.WholeTextFileRecordReader
    +
    +/**
    + * An adaptor from a [[PartitionedFile]] to an [[Iterator]] of [[Text]], which is all of the lines
    + * in that file.
    + */
    +class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration)
    --- End diff --
    
    I'd like to remind that a `PartitionedFile` can be just a part of a input file, instead of a whole file. So you cannot guarantee that in this case the reader reads all content of a file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #82839 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82839/testReport)** for PR 14151 at commit [`da64f2d`](https://github.com/apache/spark/commit/da64f2da4030c1ac7ffec3070b1c57ae6b2e2a32).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #82799 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82799/testReport)** for PR 14151 at commit [`c5cb735`](https://github.com/apache/spark/commit/c5cb7356e4ccf844298e4c0cc7a1dcaccde458b9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by jiangxb1987 <gi...@git.apache.org>.

Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r130859352
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala ---
    @@ -185,11 +231,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
         val data = df.collect()
         assert(data(0) == Row("This is a test file for the text data source"))
         assert(data(1) == Row("1+1"))
    -    // non ascii characters are not allowed in the code, so we disable the scalastyle here.
    -    // scalastyle:off
    +    // scalastyle:off nonascii
         assert(data(2) == Row("数据砖头"))
         // scalastyle:on
         assert(data(3) == Row("\"doh\""))
         assert(data.length == 4)
       }
    +
    --- End diff --
    
    nit: remove this empty line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82839/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #62233 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62233/consoleFull)** for PR 14151 at commit [`6e83f46`](https://github.com/apache/spark/commit/6e83f46b8658ce622b809a8aa08cf2f60c6b6021).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class HadoopFileWholeTextReader(file: PartitionedFile, conf: Configuration) extends Iterator[Text] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #62233 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62233/consoleFull)** for PR 14151 at commit [`6e83f46`](https://github.com/apache/spark/commit/6e83f46b8658ce622b809a8aa08cf2f60c6b6021).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    For now let's just do it for text file. I took a look - I guess it is ok to leave them in datasources for now.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82838/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r74902549
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala ---
    @@ -39,6 +39,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
         verifyFrame(spark.read.text(testFile))
       }
     
    +  test("reading text file with wholetext option on") {
    --- End diff --
    
    Thanks Fred, Good point !. I have added a test below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #62308 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62308/consoleFull)** for PR 14151 at commit [`82952e7`](https://github.com/apache/spark/commit/82952e7a1b504f61ec09dc50a31105e5cb405587).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r137595908
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -101,10 +111,24 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean
    +
    +    readToUnsafeMem(broadcastedHadoopConf, requiredSchema,
    +      wholeText)
    --- End diff --
    
    It fits one line


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #64773 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64773/consoleFull)** for PR 14151 at commit [`8ac37c1`](https://github.com/apache/spark/commit/8ac37c1b774046efe39173e4e8fa91c0feb68f49).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78795/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Hey @rxin, do you have further comments ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r77118505
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -99,8 +100,22 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         val broadcastedHadoopConf =
           sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf))
     
    +    val wholeText: Boolean = options.getOrElse("wholetext", "false").toBoolean
    --- End diff --
    
    Like what we did for csv and json, could you document this new option in `DataFrameReader`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78197 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78197/testReport)** for PR 14151 at commit [`e263b15`](https://github.com/apache/spark/commit/e263b1508a77424b371a0796ea4f9c05bc1c0121).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/78884/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #84602 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84602/testReport)** for PR 14151 at commit [`dd2ed3d`](https://github.com/apache/spark/commit/dd2ed3dce09dfd4c68dffe6fab831469273f7d5e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78795/testReport)** for PR 14151 at commit [`cab3323`](https://github.com/apache/spark/commit/cab33239f3f4be21b7f13678d28cef35030a30f6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14151: [SPARK-16496][SQL] Add wholetext as option for re...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14151#discussion_r137600781
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala ---
    @@ -53,6 +57,12 @@ class TextFileFormat extends TextBasedFileFormat with DataSourceRegister {
         }
       }
     
    +  override def isSplitable(sparkSession: SparkSession,
    +      options: Map[String, String], path: Path): Boolean = {
    --- End diff --
    
    ```
      override def isSplitable(
          sparkSession: SparkSession,
          options: Map[String, String],
          path: Path): Boolean = {
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by ScrapCodes <gi...@git.apache.org>.

Github user ScrapCodes commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    @sameeragarwal Do you think this change still makes sense? Can I improve it somehow?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #66161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66161/consoleFull)** for PR 14151 at commit [`74a5f28`](https://github.com/apache/spark/commit/74a5f285de0314969f7f3959ccdb20dc57fb864a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14151: [SPARK-16496][SQL] Add wholetext as option for reading t...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14151
  
    **[Test build #78795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78795/testReport)** for PR 14151 at commit [`cab3323`](https://github.com/apache/spark/commit/cab33239f3f4be21b7f13678d28cef35030a30f6).
     * This patch **fails due to an unknown error code, -10**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org