You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/04/08 07:37:25 UTC
[jira] [Created] (SPARK-14480) Simplify CSV parsing process with a better performance

Hyukjin Kwon created SPARK-14480:
------------------------------------

             Summary: Simplify CSV parsing process with a better performance 
                 Key: SPARK-14480
                 URL: https://issues.apache.org/jira/browse/SPARK-14480
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Hyukjin Kwon


Currently, CSV data source reads and parses CSV data bytes by bytes (not line by line).

In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think is made like this for better performance. However, it looks there are two problems.

Firstly, it was actually not faster than processing line by line with {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.

Secondly, this brought a bit of complexity because it needs additional logics to allow every line to be read bytes by bytes. So, it was pretty difficult to figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes in {{CSVParser}} might not be needed.

I made a rough patch and tested this. The test results for the first problem are below:

h4. Results

- Original codes with {{Reader}} wrapping {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 14116265034 | 2008277960 |

- New codes with {{Iterator}}

||End-to-end (ns)||Parse Time (ns)||
| 13451699644 | 1549050564 |

In more details,

h4. Method

- TCP-H lineitem table is being tested.
- The results are collected only by 1000000 due to the lack of resources.
- End-to-end tests and parsing time tests are performed 10 times and averages are calculated for each.

h4. Environment

- Machine: MacBook Pro Retina
- CPU: 4
- Memory: 8GB


h4. Dataset

- [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)] 
- Size : 724.66 MB

h4.  Test Codes

- Function to measure time
{code}
def time[A](f: => A) = {
  val s = System.nanoTime
  val ret = f
  println("time: "+(System.nanoTime-s)/1e6+"ms")
  ret
}
{code}

- End-to-end test

{code}
val path = "lineitem.tbl"
val df = sqlContext
      .read
      .format("csv")
      .option("header", "false")
      .option("delimiter", "|")
      .load(path)
time(df.take(1000000))
{code}

- Parsing time test for original (in {{BulkCsvParser}})

{code}
...
time(parser.parseNext())
...
{code}


- Parsing time test for new (in {{BulkCsvParser}})

{code}
...
time(parser.parseLine(filteredIter.next()))
...
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org