You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Henri Yandell <fl...@gmail.com> on 2006/01/15 06:31:01 UTC

[csv] feature and performance analysis

Spent a little time over the last week doing both performance and
feature set analysis of the 5 open-source CSV libraries that I'm aware
of.

First up, feature sets:

http://people.apache.org/~bayard/commons-csv/csv-features.xhtml

Secondly, performance. The code/data is sitting in:

http://people.apache.org/~bayard/commons-csv/csv-perf/

Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].

Results

http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv

Take with a grain of salt, these aren't quite the pattern I was seeing
when running a few days ago on the plane. Ideally needs to run
multiple time and take median or some such.

Generally, Ostermiller is fastest for parsing, with Skife then Open a
chunk behind. GJ a little behind them and Commons lagging by a lot.

On printing, Open edges Skife, GJ a bit behind, Commons and then
Ostermiller lagging a lot.

-=-=-=-=-=-=-=-=-=-=-=-=-=-

What did I learn from this.

* Lots of features out there, no library contains all of them. I don't
think that many of them are mutually exclusive.

* Ostermiller's parser is very quick. Possibly because he's built on
top of JFlex? The printer is very, very slow.

* The current Commons parser is very slow. As is the printer.

* I also did some bug checking. Given that I only had 7 quick lines of
pain, I don't think any parser managed to parse them with much
success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependability.csv

Mostly; that despite the odd looks I get when I mention having a
commons-csv (people think they're dumb simple things), we all have
lots of room for improvement.

-----

So what next? Are the poor performance stats for Commons-CSV a worry?
Are they offset enough by having more features? Should we look into a
lexical tool approach as it seems to work for Ostermiller?

Class-wise, I'd like to see something like:

Csv         (instead of using String[][] or List of String[])
CsvPrinter
CsvParser
CsvException
CsvStrategy  (used by both printer and parser)

I don't see any reason to not want every feature in the feature file.

That's it for the night. If you're not on commons-dev, mail
commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely
to last too long on a thread. Make sure you keep the [csv] on the
emails and future ones, useful way of separating the components out.

Hen

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org

Re: [csv] feature and performance analysis

Posted by Stefan Rufer <st...@netcetera.ch>.

Nice compilation!

Just to let you know: I will try to reserve some Netcetera time to work on 
commons-csv. Unfortunately the project schedule looks "bad" for the next 
months, but let's see what we can do.

cu
Stefan

On Sun, 15 Jan 2006, Henri Yandell wrote:

> From: Henri Yandell <fl...@gmail.com>
> To: Jakarta Commons Developers List <co...@jakarta.apache.org>
> Cc: Sean C. Sullivan <se...@seansullivan.com>,
>     Steven Caswell <st...@gmail.com>,
>     Brian McCallister <br...@apache.org>, Glen Smith <gl...@bytecode.com.au>,
>     Stefan Rufer <st...@netcetera.ch>,
>     Urs Hardegger <ur...@netcetera.ch>
> Subject: [csv] feature and performance analysis
> Date: Sun, 15 Jan 2006 00:31:01 -0500
> 
> Spent a little time over the last week doing both performance and
> feature set analysis of the 5 open-source CSV libraries that I'm aware
> of.
>
> First up, feature sets:
>
> http://people.apache.org/~bayard/commons-csv/csv-features.xhtml
>
> Secondly, performance. The code/data is sitting in:
>
> http://people.apache.org/~bayard/commons-csv/csv-perf/
>
> Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].
>
> Results
>
> http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv
>
> Take with a grain of salt, these aren't quite the pattern I was seeing
> when running a few days ago on the plane. Ideally needs to run
> multiple time and take median or some such.
>
> Generally, Ostermiller is fastest for parsing, with Skife then Open a
> chunk behind. GJ a little behind them and Commons lagging by a lot.
>
> On printing, Open edges Skife, GJ a bit behind, Commons and then
> Ostermiller lagging a lot.
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-
>
> What did I learn from this.
>
> * Lots of features out there, no library contains all of them. I don't
> think that many of them are mutually exclusive.
>
> * Ostermiller's parser is very quick. Possibly because he's built on
> top of JFlex? The printer is very, very slow.
>
> * The current Commons parser is very slow. As is the printer.
>
> * I also did some bug checking. Given that I only had 7 quick lines of
> pain, I don't think any parser managed to parse them with much
> success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependability.csv
>
> Mostly; that despite the odd looks I get when I mention having a
> commons-csv (people think they're dumb simple things), we all have
> lots of room for improvement.
>
> -----
>
> So what next? Are the poor performance stats for Commons-CSV a worry?
> Are they offset enough by having more features? Should we look into a
> lexical tool approach as it seems to work for Ostermiller?
>
> Class-wise, I'd like to see something like:
>
> Csv         (instead of using String[][] or List of String[])
> CsvPrinter
> CsvParser
> CsvException
> CsvStrategy  (used by both printer and parser)
>
> I don't see any reason to not want every feature in the feature file.
>
> That's it for the night. If you're not on commons-dev, mail
> commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely
> to last too long on a thread. Make sure you keep the [csv] on the
> emails and future ones, useful way of separating the components out.
>
> Hen
>

--
Stefan Rufer | stefan.rufer@netcetera.ch
phone +41 (0)44 247 79 92 | fax +41 (0)44 247 70 75
Netcetera AG | 8040 Zürich | Switzerland | http://netcetera.ch

Re: [csv] feature and performance analysis

Posted by Rory Winston <rw...@eircom.net>.

Some good points there. A CSV lexer-based implementation may be a good 
approach. There are a couple of helpful pointers/references here:

http://www.ricebridge.com/products/csvman/reference.htm
http://www.boyet.com/Articles/CsvParser.html

Did you run the Commons::CSV component through a profiling process?

Rory

Henri Yandell wrote:

>Spent a little time over the last week doing both performance and
>feature set analysis of the 5 open-source CSV libraries that I'm aware
>of.
>
>First up, feature sets:
>
>http://people.apache.org/~bayard/commons-csv/csv-features.xhtml
>
>Secondly, performance. The code/data is sitting in:
>
>http://people.apache.org/~bayard/commons-csv/csv-perf/
>
>Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].
>
>Results
>
>http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv
>
>Take with a grain of salt, these aren't quite the pattern I was seeing
>when running a few days ago on the plane. Ideally needs to run
>multiple time and take median or some such.
>
>Generally, Ostermiller is fastest for parsing, with Skife then Open a
>chunk behind. GJ a little behind them and Commons lagging by a lot.
>
>On printing, Open edges Skife, GJ a bit behind, Commons and then
>Ostermiller lagging a lot.
>
>-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>What did I learn from this.
>
>* Lots of features out there, no library contains all of them. I don't
>think that many of them are mutually exclusive.
>
>* Ostermiller's parser is very quick. Possibly because he's built on
>top of JFlex? The printer is very, very slow.
>
>* The current Commons parser is very slow. As is the printer.
>
>* I also did some bug checking. Given that I only had 7 quick lines of
>pain, I don't think any parser managed to parse them with much
>success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependability.csv
>
>Mostly; that despite the odd looks I get when I mention having a
>commons-csv (people think they're dumb simple things), we all have
>lots of room for improvement.
>
>-----
>
>So what next? Are the poor performance stats for Commons-CSV a worry?
>Are they offset enough by having more features? Should we look into a
>lexical tool approach as it seems to work for Ostermiller?
>
>Class-wise, I'd like to see something like:
>
>Csv         (instead of using String[][] or List of String[])
>CsvPrinter
>CsvParser
>CsvException
>CsvStrategy  (used by both printer and parser)
>
>I don't see any reason to not want every feature in the feature file.
>
>That's it for the night. If you're not on commons-dev, mail
>commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely
>to last too long on a thread. Make sure you keep the [csv] on the
>emails and future ones, useful way of separating the components out.
>
>Hen
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org