You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by sebb <se...@gmail.com> on 2012/10/13 13:55:55 UTC

[CSV] Lexer and Character conversion

Before r1397883, the Lexer operated only on char fields; it's now been
converted to use Character, which means that unboxing is needed.

Also, the Character fields need to be checked for null before use.

It has just occurred to me that there is a genuine illegal char value
for everything except the delimiter - that is, the delimiter itself.
It does not make sense for there to be no delimiter, nor does it make
sense for any other meta-character to be the same as the delimiter.

So rather than having

    Character escape;
...
    boolean isEscape(final int c) {
        return escape != null && c == escape.charValue();
    }

one could use the simpler (and more efficient)

    char escape;
...
    boolean isEscape(final int c) { // similarly for isEncapsulator etc.
        return escape != delimiter;
    }

This would have the added bonus of automatically disallowing delimiter
as the escape (or encapsulator etc.) because they would not be
recognised.
[At present the code does not check this]

The Lexer ctor would need to be changed to convert a null escape
Character (comment etc) to the delimiter.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [CSV] Lexer and Character conversion

Posted by sebb <se...@gmail.com>.
On 13 October 2012 14:31, sebb <se...@gmail.com> wrote:
> On 13 October 2012 12:55, sebb <se...@gmail.com> wrote:
>> Before r1397883, the Lexer operated only on char fields; it's now been
>> converted to use Character, which means that unboxing is needed.
>>
>> Also, the Character fields need to be checked for null before use.
>>
>> It has just occurred to me that there is a genuine illegal char value
>> for everything except the delimiter - that is, the delimiter itself.
>> It does not make sense for there to be no delimiter, nor does it make
>> sense for any other meta-character to be the same as the delimiter.
>>
>> So rather than having
>>
>>     Character escape;
>> ...
>>     boolean isEscape(final int c) {
>>         return escape != null && c == escape.charValue();
>>     }
>>
>> one could use the simpler (and more efficient)
>>
>>     char escape;
>> ...
>>     boolean isEscape(final int c) { // similarly for isEncapsulator etc.
>>         return escape != delimiter;
>>     }
>
> Sorry, that's rubbish; it needs to be:
>
>      boolean isEscape(final int c) { // similarly for isEncapsulator etc.
>          return escape != delimiter && c == escape;
>      }
>
> which is hardly better than before.
>
> However, if the Lexer ctor ensures that escape (etc) can never be the
> same as the delimiter, then the check can be simplified to:
>
>      boolean isEscape(final int c) { // similarly for isEncapsulator etc.
>          return c == escape;
>      }

Rats! (I've not had coffee yet today!)

That last simplification does not work; the check needs to be:

     boolean isEscape(final int c) { // similarly for isEncapsulator etc.
         return c != delimiter && c == escape;
     }

The check for c != delimiter could be eliminated by setting the escape
char to DISABLED if its Character is null.

That would effectively revert back to the original code, except that
the DISABLED char would only be used in the Lexer.

See patch attached to https://issues.apache.org/jira/browse/CSV-94

This passes the unit tests.

>
>> This would have the added bonus of automatically disallowing delimiter
>> as the escape (or encapsulator etc.) because they would not be
>> recognised.
>> [At present the code does not check this]
>>
>> The Lexer ctor would need to be changed to convert a null escape
>> Character (comment etc) to the delimiter.
>
> and it would need to throw IAE or similar if any of the meta chars
> match the delimiter.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [CSV] Lexer and Character conversion

Posted by sebb <se...@gmail.com>.
On 13 October 2012 12:55, sebb <se...@gmail.com> wrote:
> Before r1397883, the Lexer operated only on char fields; it's now been
> converted to use Character, which means that unboxing is needed.
>
> Also, the Character fields need to be checked for null before use.
>
> It has just occurred to me that there is a genuine illegal char value
> for everything except the delimiter - that is, the delimiter itself.
> It does not make sense for there to be no delimiter, nor does it make
> sense for any other meta-character to be the same as the delimiter.
>
> So rather than having
>
>     Character escape;
> ...
>     boolean isEscape(final int c) {
>         return escape != null && c == escape.charValue();
>     }
>
> one could use the simpler (and more efficient)
>
>     char escape;
> ...
>     boolean isEscape(final int c) { // similarly for isEncapsulator etc.
>         return escape != delimiter;
>     }

Sorry, that's rubbish; it needs to be:

     boolean isEscape(final int c) { // similarly for isEncapsulator etc.
         return escape != delimiter && c == escape;
     }

which is hardly better than before.

However, if the Lexer ctor ensures that escape (etc) can never be the
same as the delimiter, then the check can be simplified to:

     boolean isEscape(final int c) { // similarly for isEncapsulator etc.
         return c == escape;
     }

> This would have the added bonus of automatically disallowing delimiter
> as the escape (or encapsulator etc.) because they would not be
> recognised.
> [At present the code does not check this]
>
> The Lexer ctor would need to be changed to convert a null escape
> Character (comment etc) to the delimiter.

and it would need to throw IAE or similar if any of the meta chars
match the delimiter.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org