You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@commons.apache.org by Tim Allison <ta...@apache.org> on 2019/02/25 15:23:38 UTC

[csv] csv format detector/sniffer?

Commons-CSV team,

  We recently integrated Commons-CSV into Apache Tika.  For now, we’re
relying strictly on the filename for csv detection, and we’re relying
on our AutodetectReader to identify the charset.  It would be really
useful for us to be able to detect:

1) A csv/tsv file vs a regular .txt file by content heuristics
2) The parameters: delimiter, escape and quote characters

  We realize that no detection will be perfect, but we have two questions:

1) Do you have any pointers for this kind of thing?
2) If we develop it, would you want to put it in commons-csv or should
we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
way to integrate this without using a charset detector...but we can
hold off on that for now.

  Thank you for all of your fantastic work!

           Cheers,

                           Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org

Re: [csv] csv format detector/sniffer?

Posted by sebb <se...@gmail.com>.

On Mon, 25 Feb 2019 at 18:38, Tim Allison <ta...@apache.org> wrote:
>
> Hi Gary,
>
> Our charset detector stuff is a combo of html-metaheader detection,
> juniversalchardet and a cut and paste of a small portion of icu4j...we
> could add that to commons-io, but I don't think you'd want to add
> juniversalchardet as a dependency or would you?  Happy to discuss...

I think the HTML stuff is out of scope for IO; not sure about the other bits.

> My main question to commons-csv was intended rather to focus on:
>
> 1) text vs csv detection (aside from filename glob)
> 2) detection of most likely: a) delimiter, b) quote character, c)
> escape character

That seems reasonable for CSV.

But it should probably be in its own package as it is somewhat outside
the rest of CSV.


>  More like:
>
> org.apache.commons.csv.CSVParser.parse(path, charset);
>
> or ideally:
>
> CSVFormat format = CSVDetector.detect(path)
>
> where format includes charset and one value is "probably straight
> text, not likely a csv"
>
> On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <ga...@gmail.com> wrote:
> >
> > Hi,
> >
> > A Charset detector sounds like something generally useful that belongs in
> > Commons IO.
> >
> > Path path = Path.get(...);
> > Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
> > org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);
> >
> > Thoughts?
> >
> > Gary
> >
> >
> > On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <ta...@apache.org> wrote:
> >
> > > Commons-CSV team,
> > >
> > >   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> > > relying strictly on the filename for csv detection, and we’re relying
> > > on our AutodetectReader to identify the charset.  It would be really
> > > useful for us to be able to detect:
> > >
> > > 1) A csv/tsv file vs a regular .txt file by content heuristics
> > > 2) The parameters: delimiter, escape and quote characters
> > >
> > >   We realize that no detection will be perfect, but we have two questions:
> > >
> > > 1) Do you have any pointers for this kind of thing?
> > > 2) If we develop it, would you want to put it in commons-csv or should
> > > we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> > > way to integrate this without using a charset detector...but we can
> > > hold off on that for now.
> > >
> > >   Thank you for all of your fantastic work!
> > >
> > >            Cheers,
> > >
> > >                            Tim
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> > > For additional commands, e-mail: user-help@commons.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org

Re: [csv] csv format detector/sniffer?

Posted by sebb <se...@gmail.com>.

On Mon, 25 Feb 2019 at 18:38, Tim Allison <ta...@apache.org> wrote:
>
> Hi Gary,
>
> Our charset detector stuff is a combo of html-metaheader detection,
> juniversalchardet and a cut and paste of a small portion of icu4j...we
> could add that to commons-io, but I don't think you'd want to add
> juniversalchardet as a dependency or would you?  Happy to discuss...

I think the HTML stuff is out of scope for IO; not sure about the other bits.

> My main question to commons-csv was intended rather to focus on:
>
> 1) text vs csv detection (aside from filename glob)
> 2) detection of most likely: a) delimiter, b) quote character, c)
> escape character

That seems reasonable for CSV.

But it should probably be in its own package as it is somewhat outside
the rest of CSV.


>  More like:
>
> org.apache.commons.csv.CSVParser.parse(path, charset);
>
> or ideally:
>
> CSVFormat format = CSVDetector.detect(path)
>
> where format includes charset and one value is "probably straight
> text, not likely a csv"
>
> On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <ga...@gmail.com> wrote:
> >
> > Hi,
> >
> > A Charset detector sounds like something generally useful that belongs in
> > Commons IO.
> >
> > Path path = Path.get(...);
> > Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
> > org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);
> >
> > Thoughts?
> >
> > Gary
> >
> >
> > On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <ta...@apache.org> wrote:
> >
> > > Commons-CSV team,
> > >
> > >   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> > > relying strictly on the filename for csv detection, and we’re relying
> > > on our AutodetectReader to identify the charset.  It would be really
> > > useful for us to be able to detect:
> > >
> > > 1) A csv/tsv file vs a regular .txt file by content heuristics
> > > 2) The parameters: delimiter, escape and quote characters
> > >
> > >   We realize that no detection will be perfect, but we have two questions:
> > >
> > > 1) Do you have any pointers for this kind of thing?
> > > 2) If we develop it, would you want to put it in commons-csv or should
> > > we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> > > way to integrate this without using a charset detector...but we can
> > > hold off on that for now.
> > >
> > >   Thank you for all of your fantastic work!
> > >
> > >            Cheers,
> > >
> > >                            Tim
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> > > For additional commands, e-mail: user-help@commons.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>

Re: [csv] csv format detector/sniffer?

Posted by Tim Allison <ta...@apache.org>.

Hi Gary,

Our charset detector stuff is a combo of html-metaheader detection,
juniversalchardet and a cut and paste of a small portion of icu4j...we
could add that to commons-io, but I don't think you'd want to add
juniversalchardet as a dependency or would you?  Happy to discuss...

My main question to commons-csv was intended rather to focus on:

1) text vs csv detection (aside from filename glob)
2) detection of most likely: a) delimiter, b) quote character, c)
escape character

 More like:

org.apache.commons.csv.CSVParser.parse(path, charset);

or ideally:

CSVFormat format = CSVDetector.detect(path)

where format includes charset and one value is "probably straight
text, not likely a csv"

On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <ga...@gmail.com> wrote:
>
> Hi,
>
> A Charset detector sounds like something generally useful that belongs in
> Commons IO.
>
> Path path = Path.get(...);
> Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
> org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);
>
> Thoughts?
>
> Gary
>
>
> On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <ta...@apache.org> wrote:
>
> > Commons-CSV team,
> >
> >   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> > relying strictly on the filename for csv detection, and we’re relying
> > on our AutodetectReader to identify the charset.  It would be really
> > useful for us to be able to detect:
> >
> > 1) A csv/tsv file vs a regular .txt file by content heuristics
> > 2) The parameters: delimiter, escape and quote characters
> >
> >   We realize that no detection will be perfect, but we have two questions:
> >
> > 1) Do you have any pointers for this kind of thing?
> > 2) If we develop it, would you want to put it in commons-csv or should
> > we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> > way to integrate this without using a charset detector...but we can
> > hold off on that for now.
> >
> >   Thank you for all of your fantastic work!
> >
> >            Cheers,
> >
> >                            Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> > For additional commands, e-mail: user-help@commons.apache.org
> >
> >

Re: [csv] csv format detector/sniffer?

Posted by Tim Allison <ta...@apache.org>.

Hi Gary,

Our charset detector stuff is a combo of html-metaheader detection,
juniversalchardet and a cut and paste of a small portion of icu4j...we
could add that to commons-io, but I don't think you'd want to add
juniversalchardet as a dependency or would you?  Happy to discuss...

My main question to commons-csv was intended rather to focus on:

1) text vs csv detection (aside from filename glob)
2) detection of most likely: a) delimiter, b) quote character, c)
escape character

 More like:

org.apache.commons.csv.CSVParser.parse(path, charset);

or ideally:

CSVFormat format = CSVDetector.detect(path)

where format includes charset and one value is "probably straight
text, not likely a csv"

On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <ga...@gmail.com> wrote:
>
> Hi,
>
> A Charset detector sounds like something generally useful that belongs in
> Commons IO.
>
> Path path = Path.get(...);
> Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
> org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);
>
> Thoughts?
>
> Gary
>
>
> On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <ta...@apache.org> wrote:
>
> > Commons-CSV team,
> >
> >   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> > relying strictly on the filename for csv detection, and we’re relying
> > on our AutodetectReader to identify the charset.  It would be really
> > useful for us to be able to detect:
> >
> > 1) A csv/tsv file vs a regular .txt file by content heuristics
> > 2) The parameters: delimiter, escape and quote characters
> >
> >   We realize that no detection will be perfect, but we have two questions:
> >
> > 1) Do you have any pointers for this kind of thing?
> > 2) If we develop it, would you want to put it in commons-csv or should
> > we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> > way to integrate this without using a charset detector...but we can
> > hold off on that for now.
> >
> >   Thank you for all of your fantastic work!
> >
> >            Cheers,
> >
> >                            Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> > For additional commands, e-mail: user-help@commons.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
For additional commands, e-mail: user-help@commons.apache.org

Re: [csv] csv format detector/sniffer?

Posted by Gary Gregory <ga...@gmail.com>.

Hi,

A Charset detector sounds like something generally useful that belongs in
Commons IO.

Path path = Path.get(...);
Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);

Thoughts?

Gary


On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <ta...@apache.org> wrote:

> Commons-CSV team,
>
>   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> relying strictly on the filename for csv detection, and we’re relying
> on our AutodetectReader to identify the charset.  It would be really
> useful for us to be able to detect:
>
> 1) A csv/tsv file vs a regular .txt file by content heuristics
> 2) The parameters: delimiter, escape and quote characters
>
>   We realize that no detection will be perfect, but we have two questions:
>
> 1) Do you have any pointers for this kind of thing?
> 2) If we develop it, would you want to put it in commons-csv or should
> we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> way to integrate this without using a charset detector...but we can
> hold off on that for now.
>
>   Thank you for all of your fantastic work!
>
>            Cheers,
>
>                            Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>
>

Re: [csv] csv format detector/sniffer?

Posted by Gary Gregory <ga...@gmail.com>.

Hi,

A Charset detector sounds like something generally useful that belongs in
Commons IO.

Path path = Path.get(...);
Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);

Thoughts?

Gary


On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <ta...@apache.org> wrote:

> Commons-CSV team,
>
>   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> relying strictly on the filename for csv detection, and we’re relying
> on our AutodetectReader to identify the charset.  It would be really
> useful for us to be able to detect:
>
> 1) A csv/tsv file vs a regular .txt file by content heuristics
> 2) The parameters: delimiter, escape and quote characters
>
>   We realize that no detection will be perfect, but we have two questions:
>
> 1) Do you have any pointers for this kind of thing?
> 2) If we develop it, would you want to put it in commons-csv or should
> we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> way to integrate this without using a charset detector...but we can
> hold off on that for now.
>
>   Thank you for all of your fantastic work!
>
>            Cheers,
>
>                            Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>
>