You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Henri Yandell <fl...@gmail.com> on 2006/01/15 06:31:01 UTC
[csv] feature and performance analysis
Spent a little time over the last week doing both performance and
feature set analysis of the 5 open-source CSV libraries that I'm aware
of.
First up, feature sets:
http://people.apache.org/~bayard/commons-csv/csv-features.xhtml
Secondly, performance. The code/data is sitting in:
http://people.apache.org/~bayard/commons-csv/csv-perf/
Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].
Results
http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv
Take with a grain of salt, these aren't quite the pattern I was seeing
when running a few days ago on the plane. Ideally needs to run
multiple time and take median or some such.
Generally, Ostermiller is fastest for parsing, with Skife then Open a
chunk behind. GJ a little behind them and Commons lagging by a lot.
On printing, Open edges Skife, GJ a bit behind, Commons and then
Ostermiller lagging a lot.
-=-=-=-=-=-=-=-=-=-=-=-=-=-
What did I learn from this.
* Lots of features out there, no library contains all of them. I don't
think that many of them are mutually exclusive.
* Ostermiller's parser is very quick. Possibly because he's built on
top of JFlex? The printer is very, very slow.
* The current Commons parser is very slow. As is the printer.
* I also did some bug checking. Given that I only had 7 quick lines of
pain, I don't think any parser managed to parse them with much
success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependability.csv
Mostly; that despite the odd looks I get when I mention having a
commons-csv (people think they're dumb simple things), we all have
lots of room for improvement.
-----
So what next? Are the poor performance stats for Commons-CSV a worry?
Are they offset enough by having more features? Should we look into a
lexical tool approach as it seems to work for Ostermiller?
Class-wise, I'd like to see something like:
Csv (instead of using String[][] or List of String[])
CsvPrinter
CsvParser
CsvException
CsvStrategy (used by both printer and parser)
I don't see any reason to not want every feature in the feature file.
That's it for the night. If you're not on commons-dev, mail
commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely
to last too long on a thread. Make sure you keep the [csv] on the
emails and future ones, useful way of separating the components out.
Hen
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org
Re: [csv] feature and performance analysis
Posted by Stefan Rufer <st...@netcetera.ch>.
Nice compilation!
Just to let you know: I will try to reserve some Netcetera time to work on
commons-csv. Unfortunately the project schedule looks "bad" for the next
months, but let's see what we can do.
cu
Stefan
On Sun, 15 Jan 2006, Henri Yandell wrote:
> From: Henri Yandell <fl...@gmail.com>
> To: Jakarta Commons Developers List <co...@jakarta.apache.org>
> Cc: Sean C. Sullivan <se...@seansullivan.com>,
> Steven Caswell <st...@gmail.com>,
> Brian McCallister <br...@apache.org>, Glen Smith <gl...@bytecode.com.au>,
> Stefan Rufer <st...@netcetera.ch>,
> Urs Hardegger <ur...@netcetera.ch>
> Subject: [csv] feature and performance analysis
> Date: Sun, 15 Jan 2006 00:31:01 -0500
>
> Spent a little time over the last week doing both performance and
> feature set analysis of the 5 open-source CSV libraries that I'm aware
> of.
>
> First up, feature sets:
>
> http://people.apache.org/~bayard/commons-csv/csv-features.xhtml
>
> Secondly, performance. The code/data is sitting in:
>
> http://people.apache.org/~bayard/commons-csv/csv-perf/
>
> Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].
>
> Results
>
> http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv
>
> Take with a grain of salt, these aren't quite the pattern I was seeing
> when running a few days ago on the plane. Ideally needs to run
> multiple time and take median or some such.
>
> Generally, Ostermiller is fastest for parsing, with Skife then Open a
> chunk behind. GJ a little behind them and Commons lagging by a lot.
>
> On printing, Open edges Skife, GJ a bit behind, Commons and then
> Ostermiller lagging a lot.
>
> -=-=-=-=-=-=-=-=-=-=-=-=-=-
>
> What did I learn from this.
>
> * Lots of features out there, no library contains all of them. I don't
> think that many of them are mutually exclusive.
>
> * Ostermiller's parser is very quick. Possibly because he's built on
> top of JFlex? The printer is very, very slow.
>
> * The current Commons parser is very slow. As is the printer.
>
> * I also did some bug checking. Given that I only had 7 quick lines of
> pain, I don't think any parser managed to parse them with much
> success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependability.csv
>
> Mostly; that despite the odd looks I get when I mention having a
> commons-csv (people think they're dumb simple things), we all have
> lots of room for improvement.
>
> -----
>
> So what next? Are the poor performance stats for Commons-CSV a worry?
> Are they offset enough by having more features? Should we look into a
> lexical tool approach as it seems to work for Ostermiller?
>
> Class-wise, I'd like to see something like:
>
> Csv (instead of using String[][] or List of String[])
> CsvPrinter
> CsvParser
> CsvException
> CsvStrategy (used by both printer and parser)
>
> I don't see any reason to not want every feature in the feature file.
>
> That's it for the night. If you're not on commons-dev, mail
> commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely
> to last too long on a thread. Make sure you keep the [csv] on the
> emails and future ones, useful way of separating the components out.
>
> Hen
>
--
Stefan Rufer | stefan.rufer@netcetera.ch
phone +41 (0)44 247 79 92 | fax +41 (0)44 247 70 75
Netcetera AG | 8040 Zürich | Switzerland | http://netcetera.ch
Re: [csv] feature and performance analysis
Posted by Rory Winston <rw...@eircom.net>.
Some good points there. A CSV lexer-based implementation may be a good
approach. There are a couple of helpful pointers/references here:
http://www.ricebridge.com/products/csvman/reference.htm
http://www.boyet.com/Articles/CsvParser.html
Did you run the Commons::CSV component through a profiling process?
Rory
Henri Yandell wrote:
>Spent a little time over the last week doing both performance and
>feature set analysis of the 5 open-source CSV libraries that I'm aware
>of.
>
>First up, feature sets:
>
>http://people.apache.org/~bayard/commons-csv/csv-features.xhtml
>
>Secondly, performance. The code/data is sitting in:
>
>http://people.apache.org/~bayard/commons-csv/csv-perf/
>
>Run under 1.5 because the Ostermiller library is compiled for 1.5 [grumble].
>
>Results
>
>http://people.apache.org/~bayard/commons-csv/csv-perf/results.csv
>
>Take with a grain of salt, these aren't quite the pattern I was seeing
>when running a few days ago on the plane. Ideally needs to run
>multiple time and take median or some such.
>
>Generally, Ostermiller is fastest for parsing, with Skife then Open a
>chunk behind. GJ a little behind them and Commons lagging by a lot.
>
>On printing, Open edges Skife, GJ a bit behind, Commons and then
>Ostermiller lagging a lot.
>
>-=-=-=-=-=-=-=-=-=-=-=-=-=-
>
>What did I learn from this.
>
>* Lots of features out there, no library contains all of them. I don't
>think that many of them are mutually exclusive.
>
>* Ostermiller's parser is very quick. Possibly because he's built on
>top of JFlex? The printer is very, very slow.
>
>* The current Commons parser is very slow. As is the printer.
>
>* I also did some bug checking. Given that I only had 7 quick lines of
>pain, I don't think any parser managed to parse them with much
>success: http://people.apache.org/~bayard/commons-csv/csv-perf/dependability.csv
>
>Mostly; that despite the odd looks I get when I mention having a
>commons-csv (people think they're dumb simple things), we all have
>lots of room for improvement.
>
>-----
>
>So what next? Are the poor performance stats for Commons-CSV a worry?
>Are they offset enough by having more features? Should we look into a
>lexical tool approach as it seems to work for Ostermiller?
>
>Class-wise, I'd like to see something like:
>
>Csv (instead of using String[][] or List of String[])
>CsvPrinter
>CsvParser
>CsvException
>CsvStrategy (used by both printer and parser)
>
>I don't see any reason to not want every feature in the feature file.
>
>That's it for the night. If you're not on commons-dev, mail
>commons-dev-subscribe@jakarta.apache.org to join in. Cc's are unlikely
>to last too long on a thread. Make sure you keep the [csv] on the
>emails and future ones, useful way of separating the components out.
>
>Hen
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: commons-dev-help@jakarta.apache.org
>
>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org