You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Benedikt Ritter <be...@googlemail.com> on 2012/03/16 13:33:29 UTC

[csv] Improving readability in CSVLexer

Hey,

I'm thinking of ways to improve the readability of CSVLexer. I think
that it might be easier to improve performance if the code is easier
to understand. Here is, what I think can be improved:

1. eliminate Token input parameter on nextToken()
To me it looks like the token input parameter on nextToken() has the
purpose of sparing object creation. How about a private field
'currentToken' that can be reused. No method parameters are better
than one method parameter :)

2. add additional convenience methods
Right now we have some methods for char handling like isEndOfFile(c).
There are some methods missing like isDelimiter(c) or
isEncapsulator(c). There is not much to say about this. I just think
that isDelimiter(c) is slightly easier to understand than c ==
format.getDelimiter().

3. eliminate input parameter c on readEscape (and rename it ?)
Right now we have to pass an int to readEscape, but the method does
not use that parameter. So why do we keep it? Also the method does not
really "read" an escape. It assumes, that is is called after a "/" and
then returns the delimiter for a letter.

4. Get rid of those nasty while(true) loops!
There are several while true loops. It is really hard to see what is
going on, because you can not exactly see when a loop ends. The worst
example for this is encapsulatedTokenLexer. It has an outer
while(true) loop with a nested inner loop, that may return a token,
terminating both loops.
I've tried to eliminate those while true loops, but without success.

If no one objects, I'd like to create patches for 1. & 2. I leave 3.
and 4. for discussion...

Regards,
Benedikt

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by sebb <se...@gmail.com>.

On 16 March 2012 12:33, Benedikt Ritter <be...@googlemail.com> wrote:
> Hey,
>
> I'm thinking of ways to improve the readability of CSVLexer. I think
> that it might be easier to improve performance if the code is easier
> to understand. Here is, what I think can be improved:
>
> 1. eliminate Token input parameter on nextToken()
> To me it looks like the token input parameter on nextToken() has the
> purpose of sparing object creation. How about a private field
> 'currentToken' that can be reused.

As Ted Dunning said, it's probably not worth worrying about the
additional object creation.

As it stands, the "reusableToken" class field means that the CSVParser
class is not thread-safe.

If the "reusableToken" and "record" class fields were moved into the
getRecord() method, the class would then be thread-safe.
(Assuming that the Lexer is thread-safe).

These fields are only used by the getRecord() method, so (if kept) it
would be sensible to try and localise them so they are only visible to
it.

One way to do this would be to make the CSVParser class abstract, and
move the getRecord() method into an implementation class.

> No method parameters are better than one method parameter :)

Not always ...

> 2. add additional convenience methods
> Right now we have some methods for char handling like isEndOfFile(c).
> There are some methods missing like isDelimiter(c) or
> isEncapsulator(c). There is not much to say about this. I just think
> that isDelimiter(c) is slightly easier to understand than c ==
> format.getDelimiter().

Agreed; also such methods can check if the item is disabled.
And the server JVM will inline them as necessary.

> 3. eliminate input parameter c on readEscape (and rename it ?)
> Right now we have to pass an int to readEscape, but the method does
> not use that parameter. So why do we keep it? Also the method does not
> really "read" an escape. It assumes, that is is called after a "/" and
> then returns the delimiter for a letter.

In theory I agree, but I think there's an problem with escape
processing - see CVS-58.
To fix this, we would sometimes need to retain the escape character.
It might be necessary for the method to return a different value
depending on whether the escape is backslash or not.

But if the parameter turns out not to be needed, let's drop it.

> 4. Get rid of those nasty while(true) loops!
> There are several while true loops. It is really hard to see what is
> going on, because you can not exactly see when a loop ends. The worst
> example for this is encapsulatedTokenLexer. It has an outer
> while(true) loop with a nested inner loop, that may return a token,
> terminating both loops.
> I've tried to eliminate those while true loops, but without success.

The outer loop could probably be replaced by

while((c==i.read()) != -1) {
}

// now check if EOF detected partway through a sequence

> If no one objects, I'd like to create patches for 1. & 2. I leave 3.
> and 4. for discussion...
>
> Regards,
> Benedikt
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by sebb <se...@gmail.com>.

On 22 March 2012 10:40, sebb <se...@gmail.com> wrote:
> On 22 March 2012 07:33, Benedikt Ritter <be...@googlemail.com> wrote:
>> Am 22. März 2012 00:19 schrieb sebb <se...@gmail.com>:
>>> On 21 March 2012 19:16, Benedikt Ritter <be...@googlemail.com> wrote:
>>>> Hey,
>>>>
>>>> I've tried to remove the Token input parameter in CSVLexer.nextToken().
>>>> First by creating ne new Token on every invocation of nextToken().
>>>> That slowed execution of that method by about 100ms.
>>>
>>> 100ms per iteration? Or for the entire file?
>>>
>>
>> for parsing the hole file
>
> I see a bigger slowdown when testing just the lexer.
>
>>>> So I added a
>>>> private Token field to CSVLexer, that only get's initiated once. But
>>>> that solution was also slower than the one we have now.
>>>>
>>>> I'm not sure what to do now. Any suggestions? Shall I create a patch
>>>> for you to review?
>>>
>>> It's not necessary to change that call, so let's leave it.
>>>
>>
>> It was plan of our plan to split up the parsing logic in smaller
>> methods to make it easier for the compiler to optimize the code. I
>> thought it may be easier if we got rid of that input parameter.
>
> OK, I see.
>

I think the lexer is now somewhat simpler (and is much faster).

>>>> Regards,
>>>> Benedikt
>>>>
>>>> Am 16. März 2012 17:06 schrieb Emmanuel Bourg <eb...@apache.org>:
>>>>> Le 16/03/2012 17:01, Emmanuel Bourg a écrit :
>>>>>
>>>>>
>>>>>>> 2. add additional convenience methods
>>>>>>> Right now we have some methods for char handling like isEndOfFile(c).
>>>>>>> There are some methods missing like isDelimiter(c) or
>>>>>>> isEncapsulator(c). There is not much to say about this. I just think
>>>>>>> that isDelimiter(c) is slightly easier to understand than c ==
>>>>>>> format.getDelimiter().
>>>>>
>>>>>
>>>>> Sorry I misread your phrase. Actually removing these methods resulted in a
>>>>> slower parsing. So yes give it a try.
>>>>>
>>>>>
>>>>> Emmanuel Bourg
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by sebb <se...@gmail.com>.

On 22 March 2012 07:33, Benedikt Ritter <be...@googlemail.com> wrote:
> Am 22. März 2012 00:19 schrieb sebb <se...@gmail.com>:
>> On 21 March 2012 19:16, Benedikt Ritter <be...@googlemail.com> wrote:
>>> Hey,
>>>
>>> I've tried to remove the Token input parameter in CSVLexer.nextToken().
>>> First by creating ne new Token on every invocation of nextToken().
>>> That slowed execution of that method by about 100ms.
>>
>> 100ms per iteration? Or for the entire file?
>>
>
> for parsing the hole file

I see a bigger slowdown when testing just the lexer.

>>> So I added a
>>> private Token field to CSVLexer, that only get's initiated once. But
>>> that solution was also slower than the one we have now.
>>>
>>> I'm not sure what to do now. Any suggestions? Shall I create a patch
>>> for you to review?
>>
>> It's not necessary to change that call, so let's leave it.
>>
>
> It was plan of our plan to split up the parsing logic in smaller
> methods to make it easier for the compiler to optimize the code. I
> thought it may be easier if we got rid of that input parameter.

OK, I see.

>>> Regards,
>>> Benedikt
>>>
>>> Am 16. März 2012 17:06 schrieb Emmanuel Bourg <eb...@apache.org>:
>>>> Le 16/03/2012 17:01, Emmanuel Bourg a écrit :
>>>>
>>>>
>>>>>> 2. add additional convenience methods
>>>>>> Right now we have some methods for char handling like isEndOfFile(c).
>>>>>> There are some methods missing like isDelimiter(c) or
>>>>>> isEncapsulator(c). There is not much to say about this. I just think
>>>>>> that isDelimiter(c) is slightly easier to understand than c ==
>>>>>> format.getDelimiter().
>>>>
>>>>
>>>> Sorry I misread your phrase. Actually removing these methods resulted in a
>>>> slower parsing. So yes give it a try.
>>>>
>>>>
>>>> Emmanuel Bourg
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by Benedikt Ritter <be...@googlemail.com>.

Am 22. März 2012 00:19 schrieb sebb <se...@gmail.com>:
> On 21 March 2012 19:16, Benedikt Ritter <be...@googlemail.com> wrote:
>> Hey,
>>
>> I've tried to remove the Token input parameter in CSVLexer.nextToken().
>> First by creating ne new Token on every invocation of nextToken().
>> That slowed execution of that method by about 100ms.
>
> 100ms per iteration? Or for the entire file?
>

for parsing the hole file

>> So I added a
>> private Token field to CSVLexer, that only get's initiated once. But
>> that solution was also slower than the one we have now.
>>
>> I'm not sure what to do now. Any suggestions? Shall I create a patch
>> for you to review?
>
> It's not necessary to change that call, so let's leave it.
>

It was plan of our plan to split up the parsing logic in smaller
methods to make it easier for the compiler to optimize the code. I
thought it may be easier if we got rid of that input parameter.

>> Regards,
>> Benedikt
>>
>> Am 16. März 2012 17:06 schrieb Emmanuel Bourg <eb...@apache.org>:
>>> Le 16/03/2012 17:01, Emmanuel Bourg a écrit :
>>>
>>>
>>>>> 2. add additional convenience methods
>>>>> Right now we have some methods for char handling like isEndOfFile(c).
>>>>> There are some methods missing like isDelimiter(c) or
>>>>> isEncapsulator(c). There is not much to say about this. I just think
>>>>> that isDelimiter(c) is slightly easier to understand than c ==
>>>>> format.getDelimiter().
>>>
>>>
>>> Sorry I misread your phrase. Actually removing these methods resulted in a
>>> slower parsing. So yes give it a try.
>>>
>>>
>>> Emmanuel Bourg
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by sebb <se...@gmail.com>.

On 21 March 2012 19:16, Benedikt Ritter <be...@googlemail.com> wrote:
> Hey,
>
> I've tried to remove the Token input parameter in CSVLexer.nextToken().
> First by creating ne new Token on every invocation of nextToken().
> That slowed execution of that method by about 100ms.

100ms per iteration? Or for the entire file?

> So I added a
> private Token field to CSVLexer, that only get's initiated once. But
> that solution was also slower than the one we have now.
>
> I'm not sure what to do now. Any suggestions? Shall I create a patch
> for you to review?

It's not necessary to change that call, so let's leave it.

> Regards,
> Benedikt
>
> Am 16. März 2012 17:06 schrieb Emmanuel Bourg <eb...@apache.org>:
>> Le 16/03/2012 17:01, Emmanuel Bourg a écrit :
>>
>>
>>>> 2. add additional convenience methods
>>>> Right now we have some methods for char handling like isEndOfFile(c).
>>>> There are some methods missing like isDelimiter(c) or
>>>> isEncapsulator(c). There is not much to say about this. I just think
>>>> that isDelimiter(c) is slightly easier to understand than c ==
>>>> format.getDelimiter().
>>
>>
>> Sorry I misread your phrase. Actually removing these methods resulted in a
>> slower parsing. So yes give it a try.
>>
>>
>> Emmanuel Bourg
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by Benedikt Ritter <be...@googlemail.com>.

Hey,

I've tried to remove the Token input parameter in CSVLexer.nextToken().
First by creating ne new Token on every invocation of nextToken().
That slowed execution of that method by about 100ms. So I added a
private Token field to CSVLexer, that only get's initiated once. But
that solution was also slower than the one we have now.

I'm not sure what to do now. Any suggestions? Shall I create a patch
for you to review?

Regards,
Benedikt

Am 16. März 2012 17:06 schrieb Emmanuel Bourg <eb...@apache.org>:
> Le 16/03/2012 17:01, Emmanuel Bourg a écrit :
>
>
>>> 2. add additional convenience methods
>>> Right now we have some methods for char handling like isEndOfFile(c).
>>> There are some methods missing like isDelimiter(c) or
>>> isEncapsulator(c). There is not much to say about this. I just think
>>> that isDelimiter(c) is slightly easier to understand than c ==
>>> format.getDelimiter().
>
>
> Sorry I misread your phrase. Actually removing these methods resulted in a
> slower parsing. So yes give it a try.
>
>
> Emmanuel Bourg
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 16/03/2012 17:01, Emmanuel Bourg a écrit :

>> 2. add additional convenience methods
>> Right now we have some methods for char handling like isEndOfFile(c).
>> There are some methods missing like isDelimiter(c) or
>> isEncapsulator(c). There is not much to say about this. I just think
>> that isDelimiter(c) is slightly easier to understand than c ==
>> format.getDelimiter().

Sorry I misread your phrase. Actually removing these methods resulted in 
a slower parsing. So yes give it a try.

Emmanuel Bourg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [csv] Improving readability in CSVLexer

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 16/03/2012 13:33, Benedikt Ritter a écrit :

> 1. eliminate Token input parameter on nextToken()
> To me it looks like the token input parameter on nextToken() has the
> purpose of sparing object creation. How about a private field
> 'currentToken' that can be reused. No method parameters are better
> than one method parameter :)

This was the purpose of a ticket in JIRA to reduce memory allocations. 
In the light of the recent comments by Ted I'm curious to see if we can 
improve the performance without it.


> 2. add additional convenience methods
> Right now we have some methods for char handling like isEndOfFile(c).
> There are some methods missing like isDelimiter(c) or
> isEncapsulator(c). There is not much to say about this. I just think
> that isDelimiter(c) is slightly easier to understand than c ==
> format.getDelimiter().

Be careful with the performance, I tried that and it was slower on my 
system.


> 3. eliminate input parameter c on readEscape (and rename it ?)
> Right now we have to pass an int to readEscape, but the method does
> not use that parameter. So why do we keep it? Also the method does not
> really "read" an escape. It assumes, that is is called after a "/" and
> then returns the delimiter for a letter.

I agree, and in general I find weird to give the current character to 
all the lexer methods.


> 4. Get rid of those nasty while(true) loops!
> There are several while true loops. It is really hard to see what is
> going on, because you can not exactly see when a loop ends. The worst
> example for this is encapsulatedTokenLexer. It has an outer
> while(true) loop with a nested inner loop, that may return a token,
> terminating both loops.
> I've tried to eliminate those while true loops, but without success.

I see nothing wrong with the loops, even the JDK in BufferedReader uses 
them. But if you find a faster alternative, go for it!

Emmanuel Bourg

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org