You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Emmanuel Bourg <eb...@apache.org> on 2012/03/11 15:05:00 UTC

[csv] Performance comparison

Hi,

I compared the performance of Commons CSV with the other CSV parsers 
available. I took the world cities file from Maxmind as a test file [1], 
it's a big file of 130M with 2.8 million records.

Here are the results obtained on a Core 2 Duo E8400 after several 
iterations to let the JIT compiler kick in:

Direct read      750 ms
Java CSV        3328 ms
Super CSV       3562 ms  (+7%)
OpenCSV         3609 ms  (+8.4%)
GenJava CSV     3844 ms  (+15.5%)
Commons CSV     4656 ms  (+39.9%)
Skife CSV       4813 ms  (+44.6%)

I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use 
them.

I haven't analyzed why Commons CSV is slower yet, but it seems there is 
room for improvements. The memory usage will have to be compared too, 
I'm looking for a way to measure it.


Emmanuel Bourg

[1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 12/03/2012 00:02, Benedikt Ritter a écrit :

> I've started to dig my way through the source. I've not done too much
> performance measuring in my career yet. I would use VisualVM for
> profiling, if you don't know anything better.

Usually I work with JProfiler, it identifies the "hotspots" pretty well, 
but I'm not sure if it will produce relevant results on the complex 
methods of CSVLexer.


> And how about some performance junit tests? They may not be as
> accurate as a profiler, but they can give you a feeling, whether you
> are on the right way.

I wrote a quick test locally, but that's not clean enough to be 
committed. It looks like this:


public class PerformanceTest extends TestCase {

     private int max = 10;

     private BufferedReader getReader() throws IOException {
         return new BufferedReader(new FileReader("worldcitiespop.txt"));
     }

     public void testReadBigFile() throws Exception {
         for (int i = 0; i < max; i++) {
             BufferedReader in = getReader();
             long t0 = System.currentTimeMillis();
             int count = readAll(in);
             in.close();
             System.out.println("File read in " + 
(System.currentTimeMillis() - t0) + "ms" + "  " + count + " lines");
         }
         System.out.println();
     }

     private int readAll(BufferedReader in) throws IOException {
         int count = 0;
         while (in.readLine() != null) {
             count++;
         }

         return count;
     }

     public void testParseBigFile() throws Exception {
         for (int i = 0; i < max; i++) {
             long t0 = System.currentTimeMillis();
             int count = parseCommonsCSV(getReader());
             System.out.println("File parsed in " + 
(System.currentTimeMillis() - t0) + "ms with Commons CSV" + "  " + count 
+ " lines");
         }
         System.out.println();
     }

     private int parseCommonsCSV(Reader in) {
         CSVFormat format = 
CSVFormat.DEFAULT.withSurroundingSpacesIgnored(false);

         int count = 0;
         for (String[] record : format.parse(in)) {
             count++;
         }

         return count;
     }
}


Emmanuel Bourg

Re: [csv] Performance comparison

Posted by Benedikt Ritter <be...@googlemail.com>.

Am 11. März 2012 21:21 schrieb Emmanuel Bourg <eb...@apache.org>:
> Le 11/03/2012 16:53, Benedikt Ritter a écrit :
>
>
>> I have some spare time to help you with this. I'll check out the
>> latest source tonight. Any suggestion where to start?
>
>
> Hi Benedikt, thank you for helping. You can start looking at the source of
> CSVParser if anything catch your eyes, and then run a profiler to try and
> identify the performance critical parts that could be improved.
>

Hi Emmanuel,

I've started to dig my way through the source. I've not done too much
performance measuring in my career yet. I would use VisualVM for
profiling, if you don't know anything better.
And how about some performance junit tests? They may not be as
accurate as a profiler, but they can give you a feeling, whether you
are on the right way.

Benedikt

> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 11/03/2012 16:53, Benedikt Ritter a écrit :

> I have some spare time to help you with this. I'll check out the
> latest source tonight. Any suggestion where to start?

Hi Benedikt, thank you for helping. You can start looking at the source 
of CSVParser if anything catch your eyes, and then run a profiler to try 
and identify the performance critical parts that could be improved.

Emmanuel Bourg

Re: [csv] Performance comparison

Posted by Benedikt Ritter <be...@googlemail.com>.

Am 11. März 2012 15:05 schrieb Emmanuel Bourg <eb...@apache.org>:
> Hi,
>
> I compared the performance of Commons CSV with the other CSV parsers
> available. I took the world cities file from Maxmind as a test file [1],
> it's a big file of 130M with 2.8 million records.
>
> Here are the results obtained on a Core 2 Duo E8400 after several iterations
> to let the JIT compiler kick in:
>
> Direct read      750 ms
> Java CSV        3328 ms
> Super CSV       3562 ms  (+7%)
> OpenCSV         3609 ms  (+8.4%)
> GenJava CSV     3844 ms  (+15.5%)
> Commons CSV     4656 ms  (+39.9%)
> Skife CSV       4813 ms  (+44.6%)
>
> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
> them.
>
> I haven't analyzed why Commons CSV is slower yet, but it seems there is room
> for improvements. The memory usage will have to be compared too, I'm looking
> for a way to measure it.
>

Hey Emmanuel,

I have some spare time to help you with this. I'll check out the
latest source tonight. Any suggestion where to start?

Regards,
Benedikt

>
> Emmanuel Bourg
>
> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Gary Gregory <ga...@gmail.com>.

On Mar 12, 2012, at 20:25, sebb <se...@gmail.com> wrote:

> On 13 March 2012 00:12, Emmanuel Bourg <eb...@apache.org> wrote:
>> I kept tickling ExtendedBufferedReader and I have some interesting results.
>>
>> First I tried to simplify it by extending java.io.LineNumberReader instead
>> of BufferedReader. The performance decreased by 20%, probably because the
>> class is synchronized internally.
>>
>> But wait, isn't BufferedReader also synchronized? I copied the code of
>> BufferedReader and removed the synchronized blocks. Now the time to parse
>> the file is down to 2652 ms, 28% faster than previously!
>>
>> Of course the code of BufferedReader can't be copied from the JDK due to the
>> license mismatch, so I took the version from Harmony. On my test it is about
>> 4% faster than the JDK counterpart, and the parsing time is now around 2553
>> ms.
>
> I'm concerned that the CSV code may grow and grow with private
> versions of code that could be provided by the JDK.
>
> By all means make sure the code is efficient in the way it uses the
> JDK classes, but I don't think we should be recoding standard classes.

+1

Gary
>
>> Now Commons CSV can start claiming being the fastest CSV parser around :)
>>
>> Emmanuel Bourg
>>
>>
>> Le 12/03/2012 11:31, Emmanuel Bourg a écrit :
>>
>>> I have identified the performance killer, it's the
>>> ExtendedBufferedReader. It implements a complex logic to fetch one
>>> character ahead, but this extra character is rarely used. I have
>>> implemented a simpler look ahead using mark/reset as suggested by Bob
>>> Smith in CSV-42 and the performance improved by 30%.
>>>
>>> Now the parsing is down to 3406 ms, and that's almost without touching
>>> the parser yet.
>>>
>>> Emmanuel Bourg
>>>
>>>
>>> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>>>>
>>>> Hi,
>>>>
>>>> I compared the performance of Commons CSV with the other CSV parsers
>>>> available. I took the world cities file from Maxmind as a test file [1],
>>>> it's a big file of 130M with 2.8 million records.
>>>>
>>>> Here are the results obtained on a Core 2 Duo E8400 after several
>>>> iterations to let the JIT compiler kick in:
>>>>
>>>> Direct read 750 ms
>>>> Java CSV 3328 ms
>>>> Super CSV 3562 ms (+7%)
>>>> OpenCSV 3609 ms (+8.4%)
>>>> GenJava CSV 3844 ms (+15.5%)
>>>> Commons CSV 4656 ms (+39.9%)
>>>> Skife CSV 4813 ms (+44.6%)
>>>>
>>>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>>>> them.
>>>>
>>>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>>>> room for improvements. The memory usage will have to be compared too,
>>>> I'm looking for a way to measure it.
>>>>
>>>>
>>>> Emmanuel Bourg
>>>>
>>>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>>>
>>>
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Gary Gregory <ga...@gmail.com>.

On Mar 12, 2012, at 20:30, Emmanuel Bourg <eb...@apache.org> wrote:

> Le 13/03/2012 01:25, sebb a écrit :
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classes, but I don't think we should be recoding standard classes.
>
> I agree such a class should not live in [csv], but maybe in [io]?

That would be better but we need to think twice before adding code.

Gary

>
> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by sebb <se...@gmail.com>.

On 13 March 2012 09:01, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 13/03/2012 01:44, sebb a écrit :
>
>
>> I don't think we should be trying to recode JDK classes.
>
>
> I'd rather not, but we have done that in the past. FastDateFormat and
> StrBuilder come to mind.

And now Java has StringBuilder, which means StrBuilder is perhaps no
longer necessary...

> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 13/03/2012 01:44, sebb a écrit :

> I don't think we should be trying to recode JDK classes.

I'd rather not, but we have done that in the past. FastDateFormat and 
StrBuilder come to mind.

Emmanuel Bourg

Re: [csv] Performance comparison

Posted by Christian Grobmeier <gr...@gmail.com>.

On Tue, Mar 13, 2012 at 4:33 AM, Ralph Goers <ra...@dslextreme.com> wrote:
>> I don't think we should be trying to recode JDK classes.
>
> If the implementations suck, why not?

+1


-- 
http://www.grobmeier.de
https://www.timeandbill.de

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Ralph Goers <ra...@dslextreme.com>.

On Mar 12, 2012, at 5:44 PM, sebb wrote:

> On 13 March 2012 00:29, Emmanuel Bourg <eb...@apache.org> wrote:
>> Le 13/03/2012 01:25, sebb a écrit :
>> 
>> 
>>> I'm concerned that the CSV code may grow and grow with private
>>> versions of code that could be provided by the JDK.
>>> 
>>> By all means make sure the code is efficient in the way it uses the
>>> JDK classes, but I don't think we should be recoding standard classes.
>> 
>> 
>> I agree such a class should not live in [csv], but maybe in [io]?
> 
> I don't think we should be trying to recode JDK classes.

If the implementations suck, why not?

Ralph

Re: [csv] Performance comparison

Posted by sebb <se...@gmail.com>.

On 13 March 2012 00:29, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classes, but I don't think we should be recoding standard classes.
>
>
> I agree such a class should not live in [csv], but maybe in [io]?

I don't think we should be trying to recode JDK classes.

> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

After more experiments I'm less enthusiastic about providing an 
optimized BufferedReader. The result of the performance test is 
significantly different if the test is run alone or after all the other 
unit tests (about 30% slower). When all the tests are executed, the 
removal of the synchronized blocks in BufferedReader has no visible 
effect (maybe less than 1%), and the Harmony implementation becomes slower.

Emmanuel Bourg


Le 13/03/2012 10:20, Emmanuel Bourg a écrit :
> Le 13/03/2012 02:47, Niall Pemberton a écrit :
>
>> IMO performance should be taken out of the equation by using the
>> Readable interface[1]. That way the users can use whatever
>> implementation suits them (for example using an underlying buffered
>> InputStream) to change/improve performance.
>
> I you mean that the performance of BufferedReader should be taken out of
> the equation then I agree. All CSV parsers should be compared with the
> same input source, otherwise the comparison isn't fair.
>
> Using Readable would be really nice, but that's very low level. We would
> have to build line reading and mark/reset on top of that, that's almost
> equivalent to reimplementing BufferedReader.
>
> If [io] could provide a BufferedReader implementation that:
> - takes a Readable in the constructor
> - does not synchronize reads
> - recognizes unicode line separators (and the classic ones)
>
> then I buy it right away!
>
> Emmanuel Bourg
>

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 13/03/2012 02:47, Niall Pemberton a écrit :

> IMO performance should be taken out of the equation by using the
> Readable interface[1]. That way the users can use whatever
> implementation suits them (for example using an underlying buffered
> InputStream) to change/improve performance.

I you mean that the performance of BufferedReader should be taken out of 
the equation then I agree. All CSV parsers should be compared with the 
same input source, otherwise the comparison isn't fair.

Using Readable would be really nice, but that's very low level. We would 
have to build line reading and mark/reset on top of that, that's almost 
equivalent to reimplementing BufferedReader.

If [io] could provide a BufferedReader implementation that:
- takes a Readable in the constructor
- does not synchronize reads
- recognizes unicode line separators (and the classic ones)

then I buy it right away!

Emmanuel Bourg

Re: [csv] Performance comparison

Posted by sebb <se...@gmail.com>.

On 13 March 2012 01:47, Niall Pemberton <ni...@gmail.com> wrote:
> On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg <eb...@apache.org> wrote:
>> Le 13/03/2012 01:25, sebb a écrit :
>>
>>
>>> I'm concerned that the CSV code may grow and grow with private
>>> versions of code that could be provided by the JDK.
>>>
>>> By all means make sure the code is efficient in the way it uses the
>>> JDK classes, but I don't think we should be recoding standard classes.
>>
>>
>> I agree such a class should not live in [csv], but maybe in [io]?
>
> IMO performance should be taken out of the equation by using the
> Readable interface[1]. That way the users can use whatever
> implementation suits them (for example using an underlying buffered
> InputStream) to change/improve performance.

+1, excellent suggestion.

> Niall
>
> [1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html
>
>> Emmanuel Bourg
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Niall Pemberton <ni...@gmail.com>.

On Tue, Mar 13, 2012 at 12:29 AM, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 13/03/2012 01:25, sebb a écrit :
>
>
>> I'm concerned that the CSV code may grow and grow with private
>> versions of code that could be provided by the JDK.
>>
>> By all means make sure the code is efficient in the way it uses the
>> JDK classes, but I don't think we should be recoding standard classes.
>
>
> I agree such a class should not live in [csv], but maybe in [io]?

IMO performance should be taken out of the equation by using the
Readable interface[1]. That way the users can use whatever
implementation suits them (for example using an underlying buffered
InputStream) to change/improve performance.

Niall

[1] http://docs.oracle.com/javase/7/docs/api/java/lang/Readable.html

> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 13/03/2012 01:25, sebb a écrit :

> I'm concerned that the CSV code may grow and grow with private
> versions of code that could be provided by the JDK.
>
> By all means make sure the code is efficient in the way it uses the
> JDK classes, but I don't think we should be recoding standard classes.

I agree such a class should not live in [csv], but maybe in [io]?

Emmanuel Bourg

Re: [csv] Performance comparison

Posted by sebb <se...@gmail.com>.

On 13 March 2012 00:12, Emmanuel Bourg <eb...@apache.org> wrote:
> I kept tickling ExtendedBufferedReader and I have some interesting results.
>
> First I tried to simplify it by extending java.io.LineNumberReader instead
> of BufferedReader. The performance decreased by 20%, probably because the
> class is synchronized internally.
>
> But wait, isn't BufferedReader also synchronized? I copied the code of
> BufferedReader and removed the synchronized blocks. Now the time to parse
> the file is down to 2652 ms, 28% faster than previously!
>
> Of course the code of BufferedReader can't be copied from the JDK due to the
> license mismatch, so I took the version from Harmony. On my test it is about
> 4% faster than the JDK counterpart, and the parsing time is now around 2553
> ms.

I'm concerned that the CSV code may grow and grow with private
versions of code that could be provided by the JDK.

By all means make sure the code is efficient in the way it uses the
JDK classes, but I don't think we should be recoding standard classes.

> Now Commons CSV can start claiming being the fastest CSV parser around :)
>
> Emmanuel Bourg
>
>
> Le 12/03/2012 11:31, Emmanuel Bourg a écrit :
>
>> I have identified the performance killer, it's the
>> ExtendedBufferedReader. It implements a complex logic to fetch one
>> character ahead, but this extra character is rarely used. I have
>> implemented a simpler look ahead using mark/reset as suggested by Bob
>> Smith in CSV-42 and the performance improved by 30%.
>>
>> Now the parsing is down to 3406 ms, and that's almost without touching
>> the parser yet.
>>
>> Emmanuel Bourg
>>
>>
>> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>>>
>>> Hi,
>>>
>>> I compared the performance of Commons CSV with the other CSV parsers
>>> available. I took the world cities file from Maxmind as a test file [1],
>>> it's a big file of 130M with 2.8 million records.
>>>
>>> Here are the results obtained on a Core 2 Duo E8400 after several
>>> iterations to let the JIT compiler kick in:
>>>
>>> Direct read 750 ms
>>> Java CSV 3328 ms
>>> Super CSV 3562 ms (+7%)
>>> OpenCSV 3609 ms (+8.4%)
>>> GenJava CSV 3844 ms (+15.5%)
>>> Commons CSV 4656 ms (+39.9%)
>>> Skife CSV 4813 ms (+44.6%)
>>>
>>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>>> them.
>>>
>>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>>> room for improvements. The memory usage will have to be compared too,
>>> I'm looking for a way to measure it.
>>>
>>>
>>> Emmanuel Bourg
>>>
>>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>>
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

I kept tickling ExtendedBufferedReader and I have some interesting results.

First I tried to simplify it by extending java.io.LineNumberReader 
instead of BufferedReader. The performance decreased by 20%, probably 
because the class is synchronized internally.

But wait, isn't BufferedReader also synchronized? I copied the code of 
BufferedReader and removed the synchronized blocks. Now the time to 
parse the file is down to 2652 ms, 28% faster than previously!

Of course the code of BufferedReader can't be copied from the JDK due to 
the license mismatch, so I took the version from Harmony. On my test it 
is about 4% faster than the JDK counterpart, and the parsing time is now 
around 2553 ms.

Now Commons CSV can start claiming being the fastest CSV parser around :)

Emmanuel Bourg


Le 12/03/2012 11:31, Emmanuel Bourg a écrit :
> I have identified the performance killer, it's the
> ExtendedBufferedReader. It implements a complex logic to fetch one
> character ahead, but this extra character is rarely used. I have
> implemented a simpler look ahead using mark/reset as suggested by Bob
> Smith in CSV-42 and the performance improved by 30%.
>
> Now the parsing is down to 3406 ms, and that's almost without touching
> the parser yet.
>
> Emmanuel Bourg
>
>
> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>> Hi,
>>
>> I compared the performance of Commons CSV with the other CSV parsers
>> available. I took the world cities file from Maxmind as a test file [1],
>> it's a big file of 130M with 2.8 million records.
>>
>> Here are the results obtained on a Core 2 Duo E8400 after several
>> iterations to let the JIT compiler kick in:
>>
>> Direct read 750 ms
>> Java CSV 3328 ms
>> Super CSV 3562 ms (+7%)
>> OpenCSV 3609 ms (+8.4%)
>> GenJava CSV 3844 ms (+15.5%)
>> Commons CSV 4656 ms (+39.9%)
>> Skife CSV 4813 ms (+44.6%)
>>
>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>> them.
>>
>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>> room for improvements. The memory usage will have to be compared too,
>> I'm looking for a way to measure it.
>>
>>
>> Emmanuel Bourg
>>
>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>
>
>

Re: [csv] Performance comparison

Posted by James Carman <jc...@carmanconsulting.com>.

Yes this is what I mean.  It might be worth a shot.  Folks who specialize
in parsing have spent much time on these libraries.  It would make sense
that they are quite fast.  It gets us out of the parsing business.
On Mar 12, 2012 12:41 PM, "Emmanuel Bourg" <eb...@apache.org> wrote:

> Le 12/03/2012 17:28, James Carman a écrit :
>
>> Would one of the parser libraries not work here?
>>
>
> You think at something like JavaCC or AntLR? Not sure it'll be more
> efficient than a handcrafted parser. The CSV format is simple enough to do
> it manually.
>
> Emmanuel Bourg
>
>

Re: [csv] Performance comparison

Posted by Christian Grobmeier <gr...@gmail.com>.

On Mon, Mar 12, 2012 at 5:41 PM, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 12/03/2012 17:28, James Carman a écrit :
>
>> Would one of the parser libraries not work here?
>
>
> You think at something like JavaCC or AntLR? Not sure it'll be more
> efficient than a handcrafted parser. The CSV format is simple enough to do
> it manually.

+1

I did the same for my json lib... javacc et al are pretty complex. I
still struggle to understand everything around ognl...
if not necessary, my preference is always to leave such tools out.

>
> Emmanuel Bourg
>



-- 
http://www.grobmeier.de
https://www.timeandbill.de

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 12/03/2012 17:28, James Carman a écrit :
> Would one of the parser libraries not work here?

You think at something like JavaCC or AntLR? Not sure it'll be more 
efficient than a handcrafted parser. The CSV format is simple enough to 
do it manually.

Emmanuel Bourg

Re: [csv] Performance comparison

Posted by James Carman <jc...@carmanconsulting.com>.

Would one of the parser libraries not work here?
On Mar 12, 2012 12:22 PM, "Emmanuel Bourg" <eb...@apache.org> wrote:

> Le 12/03/2012 17:03, Benedikt Ritter a écrit :
>
>  The hole logic behind CSVLexer.nextToken() is very hard to read
>> (IMHO). Maybe a some refactoring would help to make it easier to
>> identify bottle necks?
>>
>
> Yes I started investigating in this direction. I filed a few bugs
> regarding the behavior of the escaping that aim at clarifying the parser.
>
> I think the nextToken() method should be broken into smaller methods to
> help the JIT compiler.
>
> The JIT does some surprising things, I found that even unused code
> branches can have an impact on the performance. For example if
> simpleTokenLexer() is changed to not support escaped characters, the
> performance improves by 10% (the input has no escaped character). And
> that's not merely because an if statement was removed. If I add a
> System.out.println() in this if block that is never called, the performance
> improves as well.
>
> So any change to the parser will have to be carefully tested. Innocent
> changes can have a significant impact.
>
>
> Emmanuel Bourg
>
>

Re: [csv] Performance comparison

Posted by Benedikt Ritter <be...@googlemail.com>.

Am 12. März 2012 17:22 schrieb Emmanuel Bourg <eb...@apache.org>:
> Le 12/03/2012 17:03, Benedikt Ritter a écrit :
>
>
>> The hole logic behind CSVLexer.nextToken() is very hard to read
>> (IMHO). Maybe a some refactoring would help to make it easier to
>> identify bottle necks?
>
>
> Yes I started investigating in this direction. I filed a few bugs regarding
> the behavior of the escaping that aim at clarifying the parser.
>
> I think the nextToken() method should be broken into smaller methods to help
> the JIT compiler.
>

I would start by eliminating the Token parameter. You could either
create a new token on each method call and return that one instead of
reusing on the gets passed in or you could use a private field
currentToken in CSVLexer. But I think that object creation costs for a
data object like Token can be considered irrelevant (so creating one
in each method call will not hurt us).

> The JIT does some surprising things, I found that even unused code branches
> can have an impact on the performance. For example if simpleTokenLexer() is
> changed to not support escaped characters, the performance improves by 10%
> (the input has no escaped character). And that's not merely because an if
> statement was removed. If I add a System.out.println() in this if block that
> is never called, the performance improves as well.
>
> So any change to the parser will have to be carefully tested. Innocent
> changes can have a significant impact.
>
>
> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 12/03/2012 17:03, Benedikt Ritter a écrit :

> The hole logic behind CSVLexer.nextToken() is very hard to read
> (IMHO). Maybe a some refactoring would help to make it easier to
> identify bottle necks?

Yes I started investigating in this direction. I filed a few bugs 
regarding the behavior of the escaping that aim at clarifying the parser.

I think the nextToken() method should be broken into smaller methods to 
help the JIT compiler.

The JIT does some surprising things, I found that even unused code 
branches can have an impact on the performance. For example if 
simpleTokenLexer() is changed to not support escaped characters, the 
performance improves by 10% (the input has no escaped character). And 
that's not merely because an if statement was removed. If I add a 
System.out.println() in this if block that is never called, the 
performance improves as well.

So any change to the parser will have to be carefully tested. Innocent 
changes can have a significant impact.


Emmanuel Bourg

Re: [csv] Performance comparison

Posted by Benedikt Ritter <be...@googlemail.com>.

Am 12. März 2012 11:31 schrieb Emmanuel Bourg <eb...@apache.org>:
> I have identified the performance killer, it's the ExtendedBufferedReader.
> It implements a complex logic to fetch one character ahead, but this extra
> character is rarely used. I have implemented a simpler look ahead using
> mark/reset as suggested by Bob Smith in CSV-42 and the performance improved
> by 30%.
>
> Now the parsing is down to 3406 ms, and that's almost without touching the
> parser yet.
>

great work Emmanuel!

looking at my profiler, I can say that 70% of the time is spend in
ExtendedBufferedReader.read(). This is no wonder, since read() is the
method that does the actual work. However, we should try to minimize
accesses to read(). For example isEndOfLine() calls read() two times.
And isEndOfLine() get's called 5 times by CSVLexer.nextToken() and
it's submethods.
The hole logic behind CSVLexer.nextToken() is very hard to read
(IMHO). Maybe a some refactoring would help to make it easier to
identify bottle necks?

Benedikt

> Emmanuel Bourg
>
>
> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>
>> Hi,
>>
>> I compared the performance of Commons CSV with the other CSV parsers
>> available. I took the world cities file from Maxmind as a test file [1],
>> it's a big file of 130M with 2.8 million records.
>>
>> Here are the results obtained on a Core 2 Duo E8400 after several
>> iterations to let the JIT compiler kick in:
>>
>> Direct read 750 ms
>> Java CSV 3328 ms
>> Super CSV 3562 ms (+7%)
>> OpenCSV 3609 ms (+8.4%)
>> GenJava CSV 3844 ms (+15.5%)
>> Commons CSV 4656 ms (+39.9%)
>> Skife CSV 4813 ms (+44.6%)
>>
>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>> them.
>>
>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>> room for improvements. The memory usage will have to be compared too,
>> I'm looking for a way to measure it.
>>
>>
>> Emmanuel Bourg
>>
>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 12/03/2012 16:44, sebb a écrit :

> Java has a PushbackReader class - could that not be used?

I considered it, but it doesn't mix well with line reading. The 
mark/reset solution is really simple and efficient.

Emmanuel Bourg

Re: [csv] Performance comparison

Posted by sebb <se...@gmail.com>.

On 12 March 2012 10:31, Emmanuel Bourg <eb...@apache.org> wrote:
> I have identified the performance killer, it's the ExtendedBufferedReader.
> It implements a complex logic to fetch one character ahead, but this extra
> character is rarely used. I have implemented a simpler look ahead using
> mark/reset as suggested by Bob Smith in CSV-42 and the performance improved
> by 30%.

Java has a PushbackReader class - could that not be used?

> Now the parsing is down to 3406 ms, and that's almost without touching the
> parser yet.
>
> Emmanuel Bourg
>
>
> Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
>
>> Hi,
>>
>> I compared the performance of Commons CSV with the other CSV parsers
>> available. I took the world cities file from Maxmind as a test file [1],
>> it's a big file of 130M with 2.8 million records.
>>
>> Here are the results obtained on a Core 2 Duo E8400 after several
>> iterations to let the JIT compiler kick in:
>>
>> Direct read 750 ms
>> Java CSV 3328 ms
>> Super CSV 3562 ms (+7%)
>> OpenCSV 3609 ms (+8.4%)
>> GenJava CSV 3844 ms (+15.5%)
>> Commons CSV 4656 ms (+39.9%)
>> Skife CSV 4813 ms (+44.6%)
>>
>> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
>> them.
>>
>> I haven't analyzed why Commons CSV is slower yet, but it seems there is
>> room for improvements. The memory usage will have to be compared too,
>> I'm looking for a way to measure it.
>>
>>
>> Emmanuel Bourg
>>
>> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Performance comparison

Posted by Emmanuel Bourg <eb...@apache.org>.

I have identified the performance killer, it's the 
ExtendedBufferedReader. It implements a complex logic to fetch one 
character ahead, but this extra character is rarely used. I have 
implemented a simpler look ahead using mark/reset as suggested by Bob 
Smith in CSV-42 and the performance improved by 30%.

Now the parsing is down to 3406 ms, and that's almost without touching 
the parser yet.

Emmanuel Bourg


Le 11/03/2012 15:05, Emmanuel Bourg a écrit :
> Hi,
>
> I compared the performance of Commons CSV with the other CSV parsers
> available. I took the world cities file from Maxmind as a test file [1],
> it's a big file of 130M with 2.8 million records.
>
> Here are the results obtained on a Core 2 Duo E8400 after several
> iterations to let the JIT compiler kick in:
>
> Direct read 750 ms
> Java CSV 3328 ms
> Super CSV 3562 ms (+7%)
> OpenCSV 3609 ms (+8.4%)
> GenJava CSV 3844 ms (+15.5%)
> Commons CSV 4656 ms (+39.9%)
> Skife CSV 4813 ms (+44.6%)
>
> I also tried Nuiton CSV and Esperio CSV but I couldn't figure how to use
> them.
>
> I haven't analyzed why Commons CSV is slower yet, but it seems there is
> room for improvements. The memory usage will have to be compared too,
> I'm looking for a way to measure it.
>
>
> Emmanuel Bourg
>
> [1] http://www.maxmind.com/download/worldcities/worldcitiespop.txt.gz
>