You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Benedikt Ritter <be...@googlemail.com> on 2012/03/12 18:17:01 UTC

[cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Hi,

while looking for potential performance optimization I came across
CSVLexer.isEndOfLine(int c). Here is the source:

    private boolean isEndOfLine(int c) throws IOException {
        // check if we have \r\n...
        if (c == '\r' && in.lookAhead() == '\n') {
            // note: does not change c outside of this method !!
            c = in.read();
        }
        return (c == '\n' || c == '\r');
    }

this method assumes, that a line separator will always be "\r" or
"\r\n". This is true for the pre-configured CSVFormats EXCEL, TDF and
MYSQL. I'm not a pro when it comes to file encoding, but isn't there
the possibility that new encodings will have different line
separators?
If that is the case, isEndOfLine() should somehow use
format.getLineSeparator().
For example the lookAhead only has to be made, if
lineSeperator.length() > 1. This may have a positive impact on the
performance of parsing files with an encoding whose line separator is
only one char long.

Benedikt

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 18:52, sebb a écrit :

> The only possible ambiguity is whether the file uses CR, LF, or CRLF.

Yes that's what I mean by knowing in advance.

The line separator is not obvious when you open the file in an editor. 
That's not the case for the delimiter.

Emmanuel Bourg


Re: [cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Posted by sebb <se...@gmail.com>.
On 12 March 2012 17:38, Emmanuel Bourg <eb...@apache.org> wrote:
> Le 12/03/2012 18:31, Benedikt Ritter a écrit :
>
>
>> I'm not sure if I got you right. You have to pass a CSVFormat if you
>> want to construct a CSVLexer(), so we could use the lexer's internal
>> CSVformat.
>
>
> Yes that's what I understood, when I mention the parser it includes the
> lexer as well.
>
> I think the parser should look for any line separator and not only the one
> defined in the format. Otherwise the user will have to know in advance the
> line separator of the input files, and that's not always possible.

Not sure I follow that.
The user must know the field separator and quote character in advance;
surely they need to know the line-separator also?
The only possible ambiguity is whether the file uses CR, LF, or CRLF.

> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 18:51, Benedikt Ritter a écrit :

> you wrote about the printer ;)

Yes, because the line separator of CSVFormat is only used there.

Usually you have to be permissive in what your read, but strict on what 
your write.

Emmanuel Bourg


Re: [cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Posted by Benedikt Ritter <be...@googlemail.com>.
Am 12. März 2012 18:38 schrieb Emmanuel Bourg <eb...@apache.org>:
> Le 12/03/2012 18:31, Benedikt Ritter a écrit :
>
>
>> I'm not sure if I got you right. You have to pass a CSVFormat if you
>> want to construct a CSVLexer(), so we could use the lexer's internal
>> CSVformat.
>
>
> Yes that's what I understood, when I mention the parser it includes the
> lexer as well.
>

you wrote about the printer ;) now I know what you mean.

> I think the parser should look for any line separator and not only the one
> defined in the format. Otherwise the user will have to know in advance the
> line separator of the input files, and that's not always possible.
>

Okay, then I guess the best would be to have some place where all
possible line separators are defined. That could be a custom enum,
that has a string value.
then we could do something like:

private isEndOfLine(int c){
    for(LineSeparator separator : LineSeparator.values()) {
        // do what is necessary to decide if we have a line separator
and return true
    }
    return false;
}

Since the issue is tagged 1.x, I guess it is not our priority,
although I see the potential for a performance improvements.

> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 18:31, Benedikt Ritter a écrit :

> I'm not sure if I got you right. You have to pass a CSVFormat if you
> want to construct a CSVLexer(), so we could use the lexer's internal
> CSVformat.

Yes that's what I understood, when I mention the parser it includes the 
lexer as well.

I think the parser should look for any line separator and not only the 
one defined in the format. Otherwise the user will have to know in 
advance the line separator of the input files, and that's not always 
possible.

Emmanuel Bourg


Re: [cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Posted by Benedikt Ritter <be...@googlemail.com>.
Am 12. März 2012 18:24 schrieb Emmanuel Bourg <eb...@apache.org>:
> Le 12/03/2012 18:17, Benedikt Ritter a écrit :
>
>
>> this method assumes, that a line separator will always be "\r" or
>> "\r\n". This is true for the pre-configured CSVFormats EXCEL, TDF and
>> MYSQL. I'm not a pro when it comes to file encoding, but isn't there
>> the possibility that new encodings will have different line
>> separators?
>
>
> Indeed, there are unicode line separators, see:
>
> https://issues.apache.org/jira/browse/CSV-51
>
>
>> If that is the case, isEndOfLine() should somehow use
>> format.getLineSeparator().
>> For example the lookAhead only has to be made, if
>> lineSeperator.length()>  1. This may have a positive impact on the
>> performance of parsing files with an encoding whose line separator is
>> only one char long.
>
>
> CSVFormat defines a line separator, but it's only used by CSVPrinter. I'm
> not sure if we should restrict to this separator when parsing.
>

I'm not sure if I got you right. You have to pass a CSVFormat if you
want to construct a CSVLexer(), so we could use the lexer's internal
CSVformat.

> Emmanuel Bourg
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [cvs] CSVLexer.isEndOfLine(int c) makes assumptions on the line separator of a CSVFormat

Posted by Emmanuel Bourg <eb...@apache.org>.
Le 12/03/2012 18:17, Benedikt Ritter a écrit :

> this method assumes, that a line separator will always be "\r" or
> "\r\n". This is true for the pre-configured CSVFormats EXCEL, TDF and
> MYSQL. I'm not a pro when it comes to file encoding, but isn't there
> the possibility that new encodings will have different line
> separators?

Indeed, there are unicode line separators, see:

https://issues.apache.org/jira/browse/CSV-51

> If that is the case, isEndOfLine() should somehow use
> format.getLineSeparator().
> For example the lookAhead only has to be made, if
> lineSeperator.length()>  1. This may have a positive impact on the
> performance of parsing files with an encoding whose line separator is
> only one char long.

CSVFormat defines a line separator, but it's only used by CSVPrinter. 
I'm not sure if we should restrict to this separator when parsing.

Emmanuel Bourg