You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/01/27 02:53:25 UTC

[jira] [Commented] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

    [ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15840872#comment-15840872 ] 

Hyukjin Kwon commented on SPARK-14480:
--------------------------------------

removed `StringIteratorReader` concatenates the lines in each iterator into reader in each partition IIRC.

New line in the column was not supported correctly up to my understanding because rows can spawn across multiple blocks. This is a similar problem that we have not supported multiple JSON lines before up to my knowledge. 

Currently,  we have some open PRs for dealing with multiple lines support by using something like `wholeTextFile` for text or dealing with each file as a multiple line json, which I think we could solve this problem in that way if any of it is merged.

I guess we introduced several regression or behaviour changes when we porting (which I believe are properly told ahead to committers before). 

(Actually, _if I remember this correctly_, I told about this problem several times to few committers/PMCs. I can try to find the JIRA or mailing thread if anyone feels wanting to verify this.)

> Remove meaningless StringIteratorReader for CSV data source for better performance
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14480
>                 URL: https://issues.apache.org/jira/browse/SPARK-14480
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>             Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems.
> Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 1000000.
> - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>       .read
>       .format("csv")
>       .option("header", "false")
>       .option("delimiter", "|")
>       .load(path)
> time(df.take(1000000))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org