You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/04/09 05:38:25 UTC
[jira] [Assigned] (SPARK-14480) Simplify CSV parsing process with a better performance

     [ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-14480:
------------------------------------

    Assignee:     (was: Apache Spark)

> Simplify CSV parsing process with a better performance 
> -------------------------------------------------------
>
>                 Key: SPARK-14480
>                 URL: https://issues.apache.org/jira/browse/SPARK-14480
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems.
> Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 1000000.
> - End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>       .read
>       .format("csv")
>       .option("header", "false")
>       .option("delimiter", "|")
>       .load(path)
> time(df.take(1000000))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org