You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2016/05/09 13:57:41 UTC

[GitHub] spark pull request: fix CSV file data-line with newline at first l...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/13007

    fix CSV file data-line with newline at first line load error

    ## What changes were proposed in this pull request?
    
    fix CSV file data-line with newline at first line load error.
    
    
    ## How was this patch tested?
    
    test as here describe;
    https://issues.apache.org/jira/browse/SPARK-15226
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark fix_firstline_csv

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13007.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13007
    
----
commit 049743be3672e5b04e5b02597add0ff49dd80de5
Author: WeichenXu <we...@outlook.com>
Date:   2016-05-09T15:25:54Z

    fix CSV file data-line with newline at first line load error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-218040423
  
    @WeichenXu123 Also, I realised the test in `CSVSuite` is only for data. If we have a test for header, this will fail. I mean some might want a header
    
    ```
    ["r1","r2","\"r"]
    ```
    
    with the data below
    
    ```
    r1,r2,"r
    3",r4,r5
    v1,v2,v3,v4,v5
    ```
    
    while others might want a header below
    
    ```
    ["r1","r2","r3","r4","r5"]
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-217872808
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-218018326
  
    In addition, I think a test is needed for this as well in `CSVSuite`. Also it would be nicer if we don't comment the original code. I blieve current change breaks the existimg tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-218037385
  
    @HyukjinKwon I run existing test against this patch and all pass. If need I will add a new test in CSVSuit.
    And I think the only reason cause the bug is reading first line handling in the code. Because a row with multi-line in csv format is legal and the current csv datasource code using LIB `com.univocity.parsers.csv` also support it. Only when a data row with multi-line appear at file beginning will cause bug and I think this patch can fix it easily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13007#discussion_r62587043
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -63,7 +63,9 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         val header = if (csvOptions.headerFlag) {
           firstRow
         } else {
    -      firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
    +      // firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
    +      new BulkCsvReader(rdd.toLocalIterator, csvOptions, null).next()
    +        .zipWithIndex.map { case (value, index) => s"C$index" }
    --- End diff --
    
    Also, I think this is related with [`UnescspedQuoteHandling`](https://github.com/apache/spark/blob/ac12b35d31ef1d1663511bf6ae826a9cc0278f20/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVParser.scala#L50).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-218014048
  
    I think if we really need to solve this problem, we need a option for unescaped quote handling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-218039838
  
    @WeichenXu123 [External CSV data source](https://github.com/databricks/spark-csv) supports this but has an issue for parsing unescaped quotes, here, https://issues.apache.org/jira/browse/SPARK-14103.
    
    In this JIRA, I introduced the usage of `UnescapedQuoteHandling` to deal with the problem. So, if we need to support the original behaviour like the external CSV data source, we need an option to deal with the unescaped quotes.
    
    Personally, I think we should not allow CSV parsing across multiple lines. CSV data source currently uses `TextInputFormat` which reads the data line by line. So, a record (across multiple lines) would mean a record across multiple HDFS blocks, which will end up with failing to read correctly. 
    
    So, I think we should not support this feature for now until we have a clear solution.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-218097561
  
    @WeichenXu123 Currently it is possible to support multiple lines because the lines read from `LineRecordReader` becomes `Reader` (a byte stream) by `StringIteratorReader` which converts `Iterator` to `Reader` with manually inserted `\n`.
    
    But supporting multiple lines has potential problems I said above and this is one of the reasons why I opened a PR here, https://github.com/apache/spark/pull/12268.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on the pull request:

    https://github.com/apache/spark/pull/13007#issuecomment-218094762
  
    @HyukjinKwon 
    En..current cvs load code use Hadoop `LineRecordReader`, so not allow a row split into mulit-lines, so I think the code should disable csv multi-line format, or should replace `LineRecordReader` with csv multiline supported reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 closed the pull request at:

    https://github.com/apache/spark/pull/13007


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-15226][SQL]fix CSV file data-line with ...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13007#discussion_r62586730
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala ---
    @@ -63,7 +63,9 @@ class DefaultSource extends FileFormat with DataSourceRegister {
         val header = if (csvOptions.headerFlag) {
           firstRow
         } else {
    -      firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
    +      // firstRow.zipWithIndex.map { case (value, index) => s"C$index" }
    +      new BulkCsvReader(rdd.toLocalIterator, csvOptions, null).next()
    +        .zipWithIndex.map { case (value, index) => s"C$index" }
    --- End diff --
    
    For the current implementation of CSV datasource, I believe we better use `LineCsvReader` since it will reads a single line (including `\n` character).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org