You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Stuart Robertson <st...@gmail.com> on 2007/04/06 19:44:59 UTC

[CSV] A few questions and comments

I just looked over the codebase and have a few questions.

First, I'm wondering if some simple invalid format detection might be
added as a configuration option.  Something to detect whether a given
input might even be theoretically parseable.  I'd like to be able to
detect, for instance, that this is a binary file, or maybe if it
doesn't seem to contain a consistent separator pattern (line 1 has 10
columns, line 2 only 6).  Basically anything to detect upfront an
invalid file condition rather than have garbage be passed into the
file using CSVParser.

Second, any thoughts on how guessFieldSparator can infer if it's TDF
or CSV?  Or maybe what flavor of CSV format the file might be using
(Excel or otherwise).  I see the CSVConfigGuesser attempts to
determine whether the file is fixed width.  And the method
guessFieldSeperator() seems to have a placeholder for guessing the
file separator, but currently that portion is an empty for loop.

Thinking about how that might be implemented, what if a regex counted
the occurrances of common separators in each of the "guess input"
lines.  A reasonable hueristic might be that the separator guess is
that separator that has a common occurrance count in each line, and we
could go with that.  Does this sound reasonable?  Or maybe there's a
better way to do it?

In general, I think it'd be a valuable feature for the guesser to be
as robust as possible for a range of input types.  Even if it weren't
possible to make it perfect, for uses where the application can't
completely control the format comming in, being fairly robust in the
face of a variety of types would be outstanding.

One last observation.  CSVConfigGuesser looks intended to uses the
first 10 lines of input if available for inferring the right config.
But looking at the code, it looks to me like it will actually read in
the entire file.  Here's the code (from SVN) I'm writing about:

/**
 * Guess the config based on the first 10 (or less when less available)
 * records of a CSV file.
 *
 * @return the guessed config.
 */
public CSVConfig guess() {
    try {
        // tralalal
        BufferedReader bIn = new BufferedReader(new
InputStreamReader((getInputStream())));
        String[] lines = new String[10];
        String line = null;
        int counter = 0;
        while ( (line = bIn.readLine()) != null || counter > 10) {  //
<----- Typo?
            lines[counter] = line;
            counter++;
        }
        if (counter < 10) {
            // remove nulls from the array, so we can skip the null checking.
            String[] newLines = new String[counter];
            System.arraycopy(lines, 0, newLines, 0, counter);
            lines = newLines;
        }

Shouldn't the line I've marked "Typo?" be reading until the file ends
or the count exceeds 10?  In a while loop, this would read "count <
10".

Thanks,

Stu Robertson

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Re: [CSV] A few questions and comments

Posted by Martin van den Bemt <ml...@mvdb.net>.
Hi Stuart,

Just a clarification upfront : Currently csv has 2 codebases : the on in the writer package is what
I have written (don't remember if someone else has worked on that though) and the main o.a.c.csv
package is what the codebase started with. Because of like of time and the fact that people seemed
to be more interested in what was at the main package, I just continued doing the writer package for
private use. My focus is configurability and simple patterns to be able to easily integrate csv in
web applications and front end applications.
Afaik there is no interaction between the 2 packages :)

Stuart Robertson wrote:
> I just looked over the codebase and have a few questions.
> 
> First, I'm wondering if some simple invalid format detection might be
> added as a configuration option.  Something to detect whether a given
> input might even be theoretically parseable.  I'd like to be able to
> detect, for instance, that this is a binary file, or maybe if it
> doesn't seem to contain a consistent separator pattern (line 1 has 10
> columns, line 2 only 6).  Basically anything to detect upfront an
> invalid file condition rather than have garbage be passed into the
> file using CSVParser.

The ConfigGuesser could be reused to achieve this. The main goal for ConfigGuesser is to limit user
configuration.

> 
> Second, any thoughts on how guessFieldSparator can infer if it's TDF
> or CSV?  Or maybe what flavor of CSV format the file might be using
> (Excel or otherwise).  I see the CSVConfigGuesser attempts to
> determine whether the file is fixed width.  And the method
> guessFieldSeperator() seems to have a placeholder for guessing the
> file separator, but currently that portion is an empty for loop.

It's far from finished and very buggy :) It's the concept I wanted to draw attention too.

> 
> Thinking about how that might be implemented, what if a regex counted
> the occurrances of common separators in each of the "guess input"
> lines.  A reasonable hueristic might be that the separator guess is
> that separator that has a common occurrance count in each line, and we
> could go with that.  Does this sound reasonable?  Or maybe there's a
> better way to do it?

There are a lot of problems with guessing the format :)

> 
> In general, I think it'd be a valuable feature for the guesser to be
> as robust as possible for a range of input types.  Even if it weren't
> possible to make it perfect, for uses where the application can't
> completely control the format comming in, being fairly robust in the
> face of a variety of types would be outstanding.

Robust would be nice, but pretty hard to achieve. Maybe some way of setting the configguesser
strategy can make the thing more robust for the scenario you are using it for. My usage scenario is
that I don't have a clue what people want to use as their text format (and I don't care). So
guessing should be most flexible. By usage a of eg a wizard, people are able to change the behavior
of the configgueser like stating that this csv file has 10 fields, you can make your system more
robust. So eg 1010101 probably means that the separator will be 0 (1 is the start and the end, so is
most likely a value). If the user specifies that the csv file only has one field, we know 0 is not
the separator.

So I prefer no to limit the options out of the box, but have some kind of strategy to be able to
limit the options (in my case that is users who specify that the csv has 10 fields), but we could
make standard strategies, like the default excel export format.

> 
> One last observation.  CSVConfigGuesser looks intended to uses the
> first 10 lines of input if available for inferring the right config.
> But looking at the code, it looks to me like it will actually read in
> the entire file.  Here's the code (from SVN) I'm writing about:

Yeah the code is bad, very bad :) I just committed the guesser as a concept. Almost every line is
bad to be honest :).

Fixed the while loop in subversion..

Mvgr,
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org