You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Emmanuel Bourg <eb...@apache.org> on 2012/03/11 15:05:00 UTC
[csv] Performance comparison
Hi,
I compared the performance of Commons CSV with the other CSV parsers
available. I took the world cities file from Maxmind as a test file [1],
it's a big file of 130M with 2.8 million records.
Here are the results obtained on a Core 2 Duo E8400 after several
iterations to let the JIT compiler kick in:
Direct read 750 ms
Java CSV 3328 ms
Super CSV 3562 ms (+7%)
OpenCSV 3609 ms (+8.4%)
GenJava CSV 3844 ms (+15.5%)
Commons CSV 4656 ms (+39.9%)
Skife CSV 4813 ms (+44.6%)
I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
them.
I haven't analyzed why Commons CSV is slower yet, but it seems there is
room for improvements. The memory usage will have to be compared too,
I'm looking for a way to measure it.
Emmanuel Bourg
[1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 00:02, Benedikt Ritter a écrit :
> I've started to dig my way through the source. I've not done too much
> performance measuring in my career yet. I would use VisualVM for
> profiling, if you don't know anything better.
Usually I work with JProfiler, it identifies the "hotspots" pretty well,
but I'm not sure if it will produce relevant results on the complex
methods of CSVLexer.
> And how about some performance junit tests? They may not be as
> accurate as a profiler, but they can give you a feeling, whether you
> are on the right way.
I wrote a quick test locally, but that's not clean enough to be
committed. It looks like this:
public class PerformanceTest extends TestCase {
private int max = 10;
private BufferedReader getReader() throws IOException {
return new BufferedReader(new FileReader("worldcitiespop.txt"));
}
public void testReadBigFile() throws Exception {
for (int i = 0; i < max; i++) {
BufferedReader in = getReader();
long t0 = System.currentTimeMillis();
int count = readAll(in);
in.close();
System.out.println("File read in " +
(System.currentTimeMillis() - t0) + "ms" + " " + count + " lines");
}
System.out.println();
}
private int readAll(BufferedReader in) throws IOException {
int count = 0;
while (in.readLine() != null) {
count++;
}
return count;
}
public void testParseBigFile() throws Exception {
for (int i = 0; i < max; i++) {
long t0 = System.currentTimeMillis();
int count = parseCommonsCSV(getReader());
System.out.println("File parsed in " +
(System.currentTimeMillis() - t0) + "ms with Commons CSV" + " " + count
+ " lines");
}
System.out.println();
}
private int parseCommonsCSV(Reader in) {
CSVFormat format =
CSVFormat.DEFAULT.withSurroundingSpacesIgnored(false);
int count = 0;
for (String[] record : format.parse(in)) {
count++;
}
return count;
}
}
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by Benedikt Ritter <be...@googlemail.com>.
Am 11. März 2012 21:21 schrieb Emmanuel Bourg <eb...@apache.org>:
> Le 11/03/2012 16:53, Benedikt Ritter a écrit :
>
>
>> I have some spare time to help you with this. I'll check out the
>> latest source tonight. Any suggestion where to start?
>
>
> Hi Benedikt, thank you for helping. You can start looking at the source of
> CSVParser if anything catch your eyes, and then run a profiler to try and
> identify the performance critical parts that could be improved.
>
Hi Emmanuel,
I've started to dig my way through the source. I've not done too much
performance measuring in my career yet. I would use VisualVM for
profiling, if you don't know anything better.
And how about some performance junit tests? They may not be as
accurate as a profiler, but they can give you a feeling, whether you
are on the right way.
Benedikt
> Emmanuel Bourg
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 11/03/2012 16:53, Benedikt Ritter a écrit :
> I have some spare time to help you with this. I'll check out the
> latest source tonight. Any suggestion where to start?
Hi Benedikt, thank you for helping. You can start looking at the source
of CSVParser if anything catch your eyes, and then run a profiler to try
and identify the performance critical parts that could be improved.
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by Benedikt Ritter <be...@googlemail.com>.
Am 11. März 2012 15:05 schrieb Emmanuel Bourg <eb...@apache.org>:
> Hi,
>
> I compared the performance of Commons CSV with the other CSV parsers
> available. I took the world cities file from Maxmind as a test file [1],
> it's a big file of 130M with 2.8 million records.
>
> Here are the results obtained on a Core 2 Duo E8400 after several iterations
> to let the JIT compiler kick in:
>
> Direct read 750 ms
> Java CSV 3328 ms
> Super CSV 3562 ms (+7%)
> OpenCSV 3609 ms (+8.4%)
> GenJava CSV 3844 ms (+15.5%)
> Commons CSV 4656 ms (+39.9%)
> Skife CSV 4813 ms (+44.6%)
>
> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
> them.
>
> I haven't analyzed why Commons CSV is slower yet, but it seems there is room
> for improvements. The memory usage will have to be compared too, I'm looking
> for a way to measure it.
>
Hey Emmanuel,
I have some spare time to help you with this. I'll check out the
latest source tonight. Any suggestion where to start?
Regards,
Benedikt
>
> Emmanuel Bourg
>
> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Gary Gregory <ga...@gmail.com>.
On Mar 12, 2012, at 20:25, sebb <se...@gmail.com> wrote:
> On 13 March 2012 00:12, Emmanuel Bourg <eb...@apache.org> wrote:
>> I kept tickling ExtendedBufferedReader and I have some interesting results.
>>
>> First I tried to simplify it by extending java.io.LineNumberReader instead
>> of BufferedReader. The performance decreased by 20%, probably because the
>> class is synchronized internally.
>>
>> But wait, isn't BufferedReader also synchronized? I copied the code of
>> BufferedReader and removed the synchronized blocks. Now the time to parse
>> the file is down to 2652 ms, 28% faster than previously!
>>
>> Of course the code of BufferedReader can't be copied from the JDK due to the
>> license mismatch, so I took the version from Harmony. On my test it is about
>> 4% faster than the JDK counterpart, and the parsing time is now around 2553
>> ms.
>
> I'm concerned that the CSV code may grow and grow with private
> versions of code that could be provided by the JDK.
>
> By all means make sure the code is efficient in the way it uses the
> JDK classes, but I don't think we should be recoding standard classes.
+1
Gary
>
>> Now Commons CSV can start claiming being the fastest CSV parser around :)
>>
>> Emmanuel Bourg
>>
>>
>> Le 12/03/2012 11:31, Emmanuel Bourg a écrit :
>>
>>> I have identified the performance killer, it's the
>>> ExtendedBufferedReader. It implements a complex logic to fetch one
>>> character ahead, but this extra character is rarely used. I have
>>> implemented a simpler look ahead using mark/reset as suggested by Bob
>>> Smith in CSV-42 and the performance improved by 30%.
>>>
>>> Now the parsing is down to 3406 ms, and that's almost without touching
>>> the parser yet.
>>>
>>> Emmanuel Bourg
>>>
>>>
>>> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>>>>
>>>> Hi,
>>>>
>>>> I compared the performance of Commons CSV with the other CSV parsers
>>>> available. I took the world cities file from Maxmind as a test file [1],
>>>> it's a big file of 130M with 2.8 million records.
>>>>
>>>> Here are the results obtained on a Core 2 Duo E8400 after several
>>>> iterations to let the JIT compiler kick in:
>>>>
>>>> Direct read 750 ms
>>>> Java CSV 3328 ms
>>>> Super CSV 3562 ms (+7%)
>>>> OpenCSV 3609 ms (+8.4%)
>>>> GenJava CSV 3844 ms (+15.5%)
>>>> Commons CSV 4656 ms (+39.9%)
>>>> Skife CSV 4813 ms (+44.6%)
>>>>
>>>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>>>> them.
>>>>
>>>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>>>> room for improvements. The memory usage will have to be compared too,
>>>> I'm looking for a way to measure it.
>>>>
>>>>
>>>> Emmanuel Bourg
>>>>
>>>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>>>
>>>
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Gary Gregory <ga...@gmail.com>.
On Mar 12, 2012, at 20:30, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classes, but I don't think we should be recoding standard classes.
>
> I agree such a class should not live in [csv], but maybe in [io]?
That would be better but we need to think twice before adding code.
Gary
>
> Emmanuel Bourg
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by sebb <se...@gmail.com>.
On 13 March 2012 09:01, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 13/03/2012 01:44, sebb a écrit :
>
>
>> I don't think we should be trying to recode JDK classes.
>
>
> I'd rather not, but we have done that in the past. FastDateFormat and
> StrBuilder come to mind.
And now Java has StringBuilder, which means StrBuilder is perhaps no
longer necessary...
> Emmanuel Bourg
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 13/03/2012 01:44, sebb a écrit :
> I don't think we should be trying to recode JDK classes.
I'd rather not, but we have done that in the past. FastDateFormat and
StrBuilder come to mind.
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by Christian Grobmeier <gr...@gmail.com>.
On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers <ra...@dslextreme.com> wrote:
>> I don't think we should be trying to recode JDK classes.
>
> If the implementations suck, why not?
+1
--
http://www.grobmeier.de
https://www.timeandbill.de
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Ralph Goers <ra...@dslextreme.com>.
On Mar 12, 2012, at 5:44 PM, sebb wrote:
> On 13 March 2012 00:29, Emmanuel Bourg <eb...@apache.org> wrote:
>> Le 13/03/2012 01:25, sebb a écrit :
>>
>>
>>> I'm concerned that the CSV code may grow and grow with private
>>> versions of code that could be provided by the JDK.
>>>
>>> By all means make sure the code is efficient in the way it uses the
>>> JDK classes, but I don't think we should be recoding standard classes.
>>
>>
>> I agree such a class should not live in [csv], but maybe in [io]?
>
> I don't think we should be trying to recode JDK classes.
If the implementations suck, why not?
Ralph
Re: [csv] Performance comparison
Posted by sebb <se...@gmail.com>.
On 13 March 2012 00:29, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classes, but I don't think we should be recoding standard classes.
>
>
> I agree such a class should not live in [csv], but maybe in [io]?
I don't think we should be trying to recode JDK classes.
> Emmanuel Bourg
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
After more experiments I'm less enthusiastic about providing an
optimized BufferedReader. The result of the performance test is
significantly different if the test is run alone or after all the other
unit tests (about 30% slower). When all the tests are executed, the
removal of the synchronized blocks in BufferedReader has no visible
effect (maybe less than 1%), and the Harmony implementation becomes slower.
Emmanuel Bourg
Le 13/03/2012 10:20, Emmanuel Bourg a écrit :
> Le 13/03/2012 02:47, Niall Pemberton a écrit :
>
>> IMO performance should be taken out of the equation by using the
>> Readable interface[1]. That way the users can use whatever
>> implementation suits them (for example using an underlying buffered
>> InputStream) to change/improve performance.
>
> I you mean that the performance of BufferedReader should be taken out of
> the equation then I agree. All CSV parsers should be compared with the
> same input source, otherwise the comparison isn't fair.
>
> Using Readable would be really nice, but that's very low level. We would
> have to build line reading and mark/reset on top of that, that's almost
> equivalent to reimplementing BufferedReader.
>
> If [io] could provide a BufferedReader implementation that:
> - takes a Readable in the constructor
> - does not synchronize reads
> - recognizes unicode line separators (and the classic ones)
>
> then I buy it right away!
>
> Emmanuel Bourg
>
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 13/03/2012 02:47, Niall Pemberton a écrit :
> IMO performance should be taken out of the equation by using the
> Readable interface[1]. That way the users can use whatever
> implementation suits them (for example using an underlying buffered
> InputStream) to change/improve performance.
I you mean that the performance of BufferedReader should be taken out of
the equation then I agree. All CSV parsers should be compared with the
same input source, otherwise the comparison isn't fair.
Using Readable would be really nice, but that's very low level. We would
have to build line reading and mark/reset on top of that, that's almost
equivalent to reimplementing BufferedReader.
If [io] could provide a BufferedReader implementation that:
- takes a Readable in the constructor
- does not synchronize reads
- recognizes unicode line separators (and the classic ones)
then I buy it right away!
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by sebb <se...@gmail.com>.
On 13 March 2012 01:47, Niall Pemberton <ni...@gmail.com> wrote:
> On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg <eb...@apache.org> wrote:
>> Le 13/03/2012 01:25, sebb a écrit :
>>
>>
>>> I'm concerned that the CSV code may grow and grow with private
>>> versions of code that could be provided by the JDK.
>>>
>>> By all means make sure the code is efficient in the way it uses the
>>> JDK classes, but I don't think we should be recoding standard classes.
>>
>>
>> I agree such a class should not live in [csv], but maybe in [io]?
>
> IMO performance should be taken out of the equation by using the
> Readable interface[1]. That way the users can use whatever
> implementation suits them (for example using an underlying buffered
> InputStream) to change/improve performance.
+1, excellent suggestion.
> Niall
>
> [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html
>
>> Emmanuel Bourg
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Niall Pemberton <ni...@gmail.com>.
On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classes, but I don't think we should be recoding standard classes.
>
>
> I agree such a class should not live in [csv], but maybe in [io]?
IMO performance should be taken out of the equation by using the
Readable interface[1]. That way the users can use whatever
implementation suits them (for example using an underlying buffered
InputStream) to change/improve performance.
Niall
[1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html
> Emmanuel Bourg
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 13/03/2012 01:25, sebb a écrit :
> I'm concerned that the CSV code may grow and grow with private
> versions of code that could be provided by the JDK.
>
> By all means make sure the code is efficient in the way it uses the
> JDK classes, but I don't think we should be recoding standard classes.
I agree such a class should not live in [csv], but maybe in [io]?
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by sebb <se...@gmail.com>.
On 13 March 2012 00:12, Emmanuel Bourg <eb...@apache.org> wrote:
> I kept tickling ExtendedBufferedReader and I have some interesting results.
>
> First I tried to simplify it by extending java.io.LineNumberReader instead
> of BufferedReader. The performance decreased by 20%, probably because the
> class is synchronized internally.
>
> But wait, isn't BufferedReader also synchronized? I copied the code of
> BufferedReader and removed the synchronized blocks. Now the time to parse
> the file is down to 2652 ms, 28% faster than previously!
>
> Of course the code of BufferedReader can't be copied from the JDK due to the
> license mismatch, so I took the version from Harmony. On my test it is about
> 4% faster than the JDK counterpart, and the parsing time is now around 2553
> ms.
I'm concerned that the CSV code may grow and grow with private
versions of code that could be provided by the JDK.
By all means make sure the code is efficient in the way it uses the
JDK classes, but I don't think we should be recoding standard classes.
> Now Commons CSV can start claiming being the fastest CSV parser around :)
>
> Emmanuel Bourg
>
>
> Le 12/03/2012 11:31, Emmanuel Bourg a écrit :
>
>> I have identified the performance killer, it's the
>> ExtendedBufferedReader. It implements a complex logic to fetch one
>> character ahead, but this extra character is rarely used. I have
>> implemented a simpler look ahead using mark/reset as suggested by Bob
>> Smith in CSV-42 and the performance improved by 30%.
>>
>> Now the parsing is down to 3406 ms, and that's almost without touching
>> the parser yet.
>>
>> Emmanuel Bourg
>>
>>
>> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>>>
>>> Hi,
>>>
>>> I compared the performance of Commons CSV with the other CSV parsers
>>> available. I took the world cities file from Maxmind as a test file [1],
>>> it's a big file of 130M with 2.8 million records.
>>>
>>> Here are the results obtained on a Core 2 Duo E8400 after several
>>> iterations to let the JIT compiler kick in:
>>>
>>> Direct read 750 ms
>>> Java CSV 3328 ms
>>> Super CSV 3562 ms (+7%)
>>> OpenCSV 3609 ms (+8.4%)
>>> GenJava CSV 3844 ms (+15.5%)
>>> Commons CSV 4656 ms (+39.9%)
>>> Skife CSV 4813 ms (+44.6%)
>>>
>>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>>> them.
>>>
>>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>>> room for improvements. The memory usage will have to be compared too,
>>> I'm looking for a way to measure it.
>>>
>>>
>>> Emmanuel Bourg
>>>
>>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>>
>>
>>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
I kept tickling ExtendedBufferedReader and I have some interesting results.
First I tried to simplify it by extending java.io.LineNumberReader
instead of BufferedReader. The performance decreased by 20%, probably
because the class is synchronized internally.
But wait, isn't BufferedReader also synchronized? I copied the code of
BufferedReader and removed the synchronized blocks. Now the time to
parse the file is down to 2652 ms, 28% faster than previously!
Of course the code of BufferedReader can't be copied from the JDK due to
the license mismatch, so I took the version from Harmony. On my test it
is about 4% faster than the JDK counterpart, and the parsing time is now
around 2553 ms.
Now Commons CSV can start claiming being the fastest CSV parser around :)
Emmanuel Bourg
Le 12/03/2012 11:31, Emmanuel Bourg a écrit :
> I have identified the performance killer, it's the
> ExtendedBufferedReader. It implements a complex logic to fetch one
> character ahead, but this extra character is rarely used. I have
> implemented a simpler look ahead using mark/reset as suggested by Bob
> Smith in CSV-42 and the performance improved by 30%.
>
> Now the parsing is down to 3406 ms, and that's almost without touching
> the parser yet.
>
> Emmanuel Bourg
>
>
> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>> Hi,
>>
>> I compared the performance of Commons CSV with the other CSV parsers
>> available. I took the world cities file from Maxmind as a test file [1],
>> it's a big file of 130M with 2.8 million records.
>>
>> Here are the results obtained on a Core 2 Duo E8400 after several
>> iterations to let the JIT compiler kick in:
>>
>> Direct read 750 ms
>> Java CSV 3328 ms
>> Super CSV 3562 ms (+7%)
>> OpenCSV 3609 ms (+8.4%)
>> GenJava CSV 3844 ms (+15.5%)
>> Commons CSV 4656 ms (+39.9%)
>> Skife CSV 4813 ms (+44.6%)
>>
>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>> them.
>>
>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>> room for improvements. The memory usage will have to be compared too,
>> I'm looking for a way to measure it.
>>
>>
>> Emmanuel Bourg
>>
>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>
>
>
Re: [csv] Performance comparison
Posted by James Carman <jc...@carmanconsulting.com>.
Yes this is what I mean. It might be worth a shot. Folks who specialize
in parsing have spent much time on these libraries. It would make sense
that they are quite fast. It gets us out of the parsing business.
On Mar 12, 2012 12:41 PM, "Emmanuel Bourg" <eb...@apache.org> wrote:
> Le 12/03/2012 17:28, James Carman a écrit :
>
>> Would one of the parser libraries not work here?
>>
>
> You think at something like JavaCC or AntLR? Not sure it'll be more
> efficient than a handcrafted parser. The CSV format is simple enough to do
> it manually.
>
> Emmanuel Bourg
>
>
Re: [csv] Performance comparison
Posted by Christian Grobmeier <gr...@gmail.com>.
On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 12/03/2012 17:28, James Carman a écrit :
>
>> Would one of the parser libraries not work here?
>
>
> You think at something like JavaCC or AntLR? Not sure it'll be more
> efficient than a handcrafted parser. The CSV format is simple enough to do
> it manually.
+1
I did the same for my json lib... javacc et al are pretty complex. I
still struggle to understand everything around ognl...
if not necessary, my preference is always to leave such tools out.
>
> Emmanuel Bourg
>
--
http://www.grobmeier.de
https://www.timeandbill.de
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 17:28, James Carman a écrit :
> Would one of the parser libraries not work here?
You think at something like JavaCC or AntLR? Not sure it'll be more
efficient than a handcrafted parser. The CSV format is simple enough to
do it manually.
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by James Carman <jc...@carmanconsulting.com>.
Would one of the parser libraries not work here?
On Mar 12, 2012 12:22 PM, "Emmanuel Bourg" <eb...@apache.org> wrote:
> Le 12/03/2012 17:03, Benedikt Ritter a écrit :
>
> The hole logic behind CSVLexer.nextToken() is very hard to read
>> (IMHO). Maybe a some refactoring would help to make it easier to
>> identify bottle necks?
>>
>
> Yes I started investigating in this direction. I filed a few bugs
> regarding the behavior of the escaping that aim at clarifying the parser.
>
> I think the nextToken() method should be broken into smaller methods to
> help the JIT compiler.
>
> The JIT does some surprising things, I found that even unused code
> branches can have an impact on the performance. For example if
> simpleTokenLexer() is changed to not support escaped characters, the
> performance improves by 10% (the input has no escaped character). And
> that's not merely because an if statement was removed. If I add a
> System.out.println() in this if block that is never called, the performance
> improves as well.
>
> So any change to the parser will have to be carefully tested. Innocent
> changes can have a significant impact.
>
>
> Emmanuel Bourg
>
>
Re: [csv] Performance comparison
Posted by Benedikt Ritter <be...@googlemail.com>.
Am 12. März 2012 17:22 schrieb Emmanuel Bourg <eb...@apache.org>:
> Le 12/03/2012 17:03, Benedikt Ritter a écrit :
>
>
>> The hole logic behind CSVLexer.nextToken() is very hard to read
>> (IMHO). Maybe a some refactoring would help to make it easier to
>> identify bottle necks?
>
>
> Yes I started investigating in this direction. I filed a few bugs regarding
> the behavior of the escaping that aim at clarifying the parser.
>
> I think the nextToken() method should be broken into smaller methods to help
> the JIT compiler.
>
I would start by eliminating the Token parameter. You could either
create a new token on each method call and return that one instead of
reusing on the gets passed in or you could use a private field
currentToken in CSVLexer. But I think that object creation costs for a
data object like Token can be considered irrelevant (so creating one
in each method call will not hurt us).
> The JIT does some surprising things, I found that even unused code branches
> can have an impact on the performance. For example if simpleTokenLexer() is
> changed to not support escaped characters, the performance improves by 10%
> (the input has no escaped character). And that's not merely because an if
> statement was removed. If I add a System.out.println() in this if block that
> is never called, the performance improves as well.
>
> So any change to the parser will have to be carefully tested. Innocent
> changes can have a significant impact.
>
>
> Emmanuel Bourg
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 17:03, Benedikt Ritter a écrit :
> The hole logic behind CSVLexer.nextToken() is very hard to read
> (IMHO). Maybe a some refactoring would help to make it easier to
> identify bottle necks?
Yes I started investigating in this direction. I filed a few bugs
regarding the behavior of the escaping that aim at clarifying the parser.
I think the nextToken() method should be broken into smaller methods to
help the JIT compiler.
The JIT does some surprising things, I found that even unused code
branches can have an impact on the performance. For example if
simpleTokenLexer() is changed to not support escaped characters, the
performance improves by 10% (the input has no escaped character). And
that's not merely because an if statement was removed. If I add a
System.out.println() in this if block that is never called, the
performance improves as well.
So any change to the parser will have to be carefully tested. Innocent
changes can have a significant impact.
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by Benedikt Ritter <be...@googlemail.com>.
Am 12. März 2012 11:31 schrieb Emmanuel Bourg <eb...@apache.org>:
> I have identified the performance killer, it's the ExtendedBufferedReader.
> It implements a complex logic to fetch one character ahead, but this extra
> character is rarely used. I have implemented a simpler look ahead using
> mark/reset as suggested by Bob Smith in CSV-42 and the performance improved
> by 30%.
>
> Now the parsing is down to 3406 ms, and that's almost without touching the
> parser yet.
>
great work Emmanuel!
looking at my profiler, I can say that 70% of the time is spend in
ExtendedBufferedReader.read(). This is no wonder, since read() is the
method that does the actual work. However, we should try to minimize
accesses to read(). For example isEndOfLine() calls read() two times.
And isEndOfLine() get's called 5 times by CSVLexer.nextToken() and
it's submethods.
The hole logic behind CSVLexer.nextToken() is very hard to read
(IMHO). Maybe a some refactoring would help to make it easier to
identify bottle necks?
Benedikt
> Emmanuel Bourg
>
>
> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>
>> Hi,
>>
>> I compared the performance of Commons CSV with the other CSV parsers
>> available. I took the world cities file from Maxmind as a test file [1],
>> it's a big file of 130M with 2.8 million records.
>>
>> Here are the results obtained on a Core 2 Duo E8400 after several
>> iterations to let the JIT compiler kick in:
>>
>> Direct read 750 ms
>> Java CSV 3328 ms
>> Super CSV 3562 ms (+7%)
>> OpenCSV 3609 ms (+8.4%)
>> GenJava CSV 3844 ms (+15.5%)
>> Commons CSV 4656 ms (+39.9%)
>> Skife CSV 4813 ms (+44.6%)
>>
>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>> them.
>>
>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>> room for improvements. The memory usage will have to be compared too,
>> I'm looking for a way to measure it.
>>
>>
>> Emmanuel Bourg
>>
>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 16:44, sebb a écrit :
> Java has a PushbackReader class - could that not be used?
I considered it, but it doesn't mix well with line reading. The
mark/reset solution is really simple and efficient.
Emmanuel Bourg
Re: [csv] Performance comparison
Posted by sebb <se...@gmail.com>.
On 12 March 2012 10:31, Emmanuel Bourg <eb...@apache.org> wrote:
> I have identified the performance killer, it's the ExtendedBufferedReader.
> It implements a complex logic to fetch one character ahead, but this extra
> character is rarely used. I have implemented a simpler look ahead using
> mark/reset as suggested by Bob Smith in CSV-42 and the performance improved
> by 30%.
Java has a PushbackReader class - could that not be used?
> Now the parsing is down to 3406 ms, and that's almost without touching the
> parser yet.
>
> Emmanuel Bourg
>
>
> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>
>> Hi,
>>
>> I compared the performance of Commons CSV with the other CSV parsers
>> available. I took the world cities file from Maxmind as a test file [1],
>> it's a big file of 130M with 2.8 million records.
>>
>> Here are the results obtained on a Core 2 Duo E8400 after several
>> iterations to let the JIT compiler kick in:
>>
>> Direct read 750 ms
>> Java CSV 3328 ms
>> Super CSV 3562 ms (+7%)
>> OpenCSV 3609 ms (+8.4%)
>> GenJava CSV 3844 ms (+15.5%)
>> Commons CSV 4656 ms (+39.9%)
>> Skife CSV 4813 ms (+44.6%)
>>
>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>> them.
>>
>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>> room for improvements. The memory usage will have to be compared too,
>> I'm looking for a way to measure it.
>>
>>
>> Emmanuel Bourg
>>
>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org
Re: [csv] Performance comparison
Posted by Emmanuel Bourg <eb...@apache.org>.
I have identified the performance killer, it's the
ExtendedBufferedReader. It implements a complex logic to fetch one
character ahead, but this extra character is rarely used. I have
implemented a simpler look ahead using mark/reset as suggested by Bob
Smith in CSV-42 and the performance improved by 30%.
Now the parsing is down to 3406 ms, and that's almost without touching
the parser yet.
Emmanuel Bourg
Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
> Hi,
>
> I compared the performance of Commons CSV with the other CSV parsers
> available. I took the world cities file from Maxmind as a test file [1],
> it's a big file of 130M with 2.8 million records.
>
> Here are the results obtained on a Core 2 Duo E8400 after several
> iterations to let the JIT compiler kick in:
>
> Direct read 750 ms
> Java CSV 3328 ms
> Super CSV 3562 ms (+7%)
> OpenCSV 3609 ms (+8.4%)
> GenJava CSV 3844 ms (+15.5%)
> Commons CSV 4656 ms (+39.9%)
> Skife CSV 4813 ms (+44.6%)
>
> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
> them.
>
> I haven't analyzed why Commons CSV is slower yet, but it seems there is
> room for improvements. The memory usage will have to be compared too,
> I'm looking for a way to measure it.
>
>
> Emmanuel Bourg
>
> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>