You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@commons.apache.org by Gary Gregory <ga...@gmail.com> on 2013/07/30 21:44:39 UTC

[CSV] Headers and the first record

Hi All:

I have Excel files with headers. So I use withHeaders() of course to map
the headers.

When I call parser.iterator().next(), the first record is the header
record, not data.

I always have to skip this first line since it is not data.

I wonder if:

1) We should automatically skip the header line for next() and
parser.getRecords(), or
2) Add a skipHeader boolean setting to control the above behavior, where
the default is...?

(2) is the most flexible.

Thoughts?

Gary
-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Wed, Jul 31, 2013 at 4:38 PM, Mark Fortner <ph...@gmail.com> wrote:

> Hi Gary,
>
>
> > This does not look like a classic CSV file.
>
>
> I guess it depends on what your definition of "classic" is. :-)  This is
> pretty typical for most drug discovery companies.
>
>
> > It sounds like your files contain different sections in different
> formats.
> >
>
> True.
>
>
> >
> > In its current state, commons-csv might not be right for you. What does
> the
> > rest of the file look like?
>
>
> The data section looks similar to this.
>
>                   Erlotinib - Run 1                      Erlotinib - Run 2
> Target       1uM 10 uM 100 uM 1nM         1uM 10 uM 100 uM 1nM
> BRCA1       0.01  0.001  0.0001 0.00001   0.01  0.001  0.0001 0.00001
> BRCA2       0.2    0.002  0.0002 0.00002   0.2    0.002  0.0002 0.00002
>
>
>
Hm... so it looks like you have a couple of rows that each have a different
format.

For some rows, the format has the header and it's value on the same line:

Date: 12/10/13
Protocol: Selectivity Profile 1        Instrument Name: Gandalf
Scientist: John Smith

Which is different from the 'usual' column we see. You format is more like
a spreadsheet than a CSV file.

Nonetheless, we would need to extend our current feature set to accommodate
this format.

I could see the client code looking like this:

// row one is a key: value pair
format.addKeyValueRow(1, ":");

// row two is 2 key: value pairs, separated by a tab
format.addKeyValueRow(2, ":", "\t"); // 2 pairs

The args should also be a format object of some kind, like we have a
CSVFormat object now.

This seems out of scope for 1.0 if we are itching to get 1.0 out the door.

Gary

Regards,
>
> Mark
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Mark Fortner <ph...@gmail.com>.

Hi Gary,


> This does not look like a classic CSV file.


I guess it depends on what your definition of "classic" is. :-)  This is
pretty typical for most drug discovery companies.


> It sounds like your files contain different sections in different formats.
>

True.


>
> In its current state, commons-csv might not be right for you. What does the
> rest of the file look like?


The data section looks similar to this.

                  Erlotinib - Run 1                      Erlotinib - Run 2
Target       1uM 10 uM 100 uM 1nM         1uM 10 uM 100 uM 1nM
BRCA1       0.01  0.001  0.0001 0.00001   0.01  0.001  0.0001 0.00001
BRCA2       0.2    0.002  0.0002 0.00002   0.2    0.002  0.0002 0.00002


Regards,

Mark

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Wed, Jul 31, 2013 at 3:44 PM, Mark Fortner <ph...@gmail.com> wrote:

> Hi Gary,
> One other complication I forgot to mention.  Compounds are usually run
> multiple times.  So the same compound will appear with the same set of
> concentrations.  In practice you would end up with column headers that have
> the same text in them, so this issue with using a Set vs String[] for the
> column names would complicate things.
>
>
> > CSVFormat implements Serializable, so you can use plain old Java
> > serialization, it's not human readable, but it's something.
> >
>
> A human readable configuration would probably be a high priority.
>
>
> >
> > If we moved to Java 6, we could annotate CSVFormat with JAXB so you can
> > have XML IO. Personally, I do not think we should do our own XML IO, so
> > JAXB is the best path IMO since it is built-in Java 6.
> >
>
> It would be best if there were a CSVFormat serializer so that the CSVFormat
> could be injected.  Using JAXB would be fine as a default implementation,
> but I imagine that the configuration format would change.  Or that a user
> might decide to store individual configuration items in a database.
>
>
> >
> > What do you currently use to parse your CSV files?
> >
>
> Most biotech companies have their own home grown tools for parsing
> instrument files.  There isn't a standard library.
>
>
> >
> > Would Commons-CSV work for you as well? If not, how so?
> >
>
> As I understand it, the code doesn't support "experiment condition"-type
> parameters, like this:
>
> Date: 12/10/13
> Protocol: Selectivity Profile 1        Instrument Name: Gandalf
> Scientist: John Smith
>

This does not look like a classic CSV file.

It sounds like your files contain different sections in different formats.

In its current state, commons-csv might not be right for you. What does the
rest of the file look like?

Gary


>
>
> > Would you be willing to experiment with the current code?
> >
> >
> Sure. If the previous issues were addressed.
>
> I'm curious if other industries have similar issues?  I assume that anyone
> that deals with instrument data might have similar needs.
>
> Mark
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Mark Fortner <ph...@gmail.com>.

Hi Gary,
One other complication I forgot to mention.  Compounds are usually run
multiple times.  So the same compound will appear with the same set of
concentrations.  In practice you would end up with column headers that have
the same text in them, so this issue with using a Set vs String[] for the
column names would complicate things.


> CSVFormat implements Serializable, so you can use plain old Java
> serialization, it's not human readable, but it's something.
>

A human readable configuration would probably be a high priority.


>
> If we moved to Java 6, we could annotate CSVFormat with JAXB so you can
> have XML IO. Personally, I do not think we should do our own XML IO, so
> JAXB is the best path IMO since it is built-in Java 6.
>

It would be best if there were a CSVFormat serializer so that the CSVFormat
could be injected.  Using JAXB would be fine as a default implementation,
but I imagine that the configuration format would change.  Or that a user
might decide to store individual configuration items in a database.


>
> What do you currently use to parse your CSV files?
>

Most biotech companies have their own home grown tools for parsing
instrument files.  There isn't a standard library.


>
> Would Commons-CSV work for you as well? If not, how so?
>

As I understand it, the code doesn't support "experiment condition"-type
parameters, like this:

Date: 12/10/13
Protocol: Selectivity Profile 1        Instrument Name: Gandalf
Scientist: John Smith


> Would you be willing to experiment with the current code?
>
>
Sure. If the previous issues were addressed.

I'm curious if other industries have similar issues?  I assume that anyone
that deals with instrument data might have similar needs.

Mark

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Wed, Jul 31, 2013 at 11:14 AM, Mark Fortner <ph...@gmail.com> wrote:

> I took a brief look at the API for CSV, and thought I would share a typical
> use case from the biotech industry.  We deal with a lot of instruments that
> produce a multiline header.  The header usually contains "experiment
> conditions".  You can think of this as metadata for the columnar data.  The
> experiment conditions usually contain things like the name of the scientist
> using the instrument, the time of day the experiment was run, and some
> instrument configuration settings.  Usually when we parse CSV files, we
> have to parse the header first, extract all relevant data, and then parse
> the rows of data.
>
> In addition to the experiment conditions header, there are also column
> headers.  The column headers can be multi-lined as well.  For example, you
> might have a column header whose first line contains chemical compound IDs
> or names, and the second line of the column header contains the
> concentrations for those compounds. The data values represent the percent
> inhibition at those concentrations. Like this:
>
> Erlotinib
> 1uM 10 uM 100 uM 1nM
> 0.01  0.001  0.0001 0.00001
> ...
>
> Since the position and types of header and body data vary, we typically use
>  parse configuration files that describe "what data can be found where".
>  The parse configuration varies not only per instrument but also per
> experimental protocol. So there are usually numerous configuration files in
> your typical lab.  The configuration files can also be stored in a
> database.  This is usually part of a file-watching web app.  It allows
> scientists to add support for new experiments or instruments without having
> to get a developer to write more code.
>
> In the API I saw support for hard-coded configurations via the CSVFormat
> object, but I didn't see any support for creating and using persistable
> configurations.  You may want to consider that as you move forward.
>

Thank you for taking the time to offer your point of view here.

CSVFormat implements Serializable, so you can use plain old Java
serialization, it's not human readable, but it's something.

If we moved to Java 6, we could annotate CSVFormat with JAXB so you can
have XML IO. Personally, I do not think we should do our own XML IO, so
JAXB is the best path IMO since it is built-in Java 6.

What do you currently use to parse your CSV files?

Would Commons-CSV work for you as well? If not, how so?

Would you be willing to experiment with the current code?

Thank you,
Gary


> Hope this helps,
>
> Mark
>
>
>
> On Wed, Jul 31, 2013 at 6:36 AM, Gary Gregory <garydgregory@gmail.com
> >wrote:
>
> > On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory <garydgregory@gmail.com
> > >wrote:
> >
> > > On Jul 31, 2013, at 3:38, Benedikt Ritter <br...@apache.org> wrote:
> > >
> > > > 2013/7/31 Gary Gregory <ga...@gmail.com>
> > > >
> > > >> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <eb...@apache.org>
> > > wrote:
> > > >>
> > > >>> Le 30/07/2013 23:26, Gary Gregory a écrit :
> > > >>>> And another thing: internally, the header should be a Set<String>,
> > not
> > > >> a
> > > >>>> String[]. I plan on fixing that later too.
> > > >>>
> > > >>> Why should it be a set? Is there an impact on the performance?
> > > >>
> > > >> Well, I did not finish my though on that one, sorry about that,
> please
> > > >> allow me to walk through my use cases. The issue is about the
> feature,
> > > not
> > > >> performance.
> > > >>
> > > >> At first glance, using a set avoids an inherent problem with any
> > non-set
> > > >> data structure: defining duplicates. What does the following mean?
> > > >>
> > > >> withHeader("A", "B", "C", "A");
> > > >>
> > > >> It's is a recipe for garbage results: record.get("A") returns what?
> > > >>
> > > >> Today, I added some CSVFormat validation code that checks for
> > duplicate
> > > >> column names. If you build a format with withHeader("A", "B", "C",
> > "A");
> > > >> you will get an ISE when validate() is called.
> > > >>
> > > >> If we had withHeader(Set) and document it as the 'main' way to
> specify
> > > >> column names, then we can say that withHeader(String...) is just a
> > > >> syntactical convenience and turn the String[] into a Set. But that
> > will
> > > not
> > > >> work.
> > > >>
> > > >> The problem with a Java Set is that it is not ordered and the
> current
> > > >> implementation relies on order of the String[]. But why? What the
> > > current
> > > >> implementation says is: ignore what the header line of the file is
> and
> > > use
> > > >> the given column names at the given positions. A perfectly good user
> > > story.
> > > >> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1,
> > and
> > > so
> > > >> on. Ok, that's one usage.
> > > >>
> > > >> Taking a step back, I want to talk about why should the column name
> > > order
> > > >> matter when you are calling withHeader(). I would like to be able to
> > > tell
> > > >> the parser that I want to use a Set of column names and have it
> figure
> > > out,
> > > >> based on the header line, the columns indices. This is quite
> different
> > > than
> > > >> what we have now.
> > > >>
> > > >> A use case I have now is a CSV file with a lot of columns (~90) but
> I
> > > only
> > > >> care about a small subset of the columns (~10). I'd like to be able
> to
> > > say
> > > >> withHeader(Set) where the Set may be a subset of the actual column
> > > names in
> > > >> the header line. This is different from withHeader(String[]) because
> > the
> > > >> names in the Set must match the names in the header record.
> > > >
> > > > I'm not sure if we should try to build in all this different cases
> > > > (guessing headers, using the first record as headers, only use a
> subset
> > > of
> > > > the available headers) into one implementation.
> > > >
> > > > What you are talking about sounds more like a view or a projection of
> > the
> > > > actual content being parsed.
> > > > Do we really need this for 1.0 or can it be postponed?
> > >
> > > This is a real scenario and a real need, not some imaginary
> complication
> > ;)
> > >
> >
> > But I could work with current framework and use withHeaders(new
> String[]{})
> > and let the parser find the headers. Then I can just do record.get("A")
> > with the columns I care about. It just feels a little more mysterious.
> >
> > I think the only wrinkle left for me is that I want validation that the
> > columns I care about are there. Right now get(String) throws
> > IllegalArgumentException if you give it an unknown column, which will
> fail
> > fast enough on the first record.
> >
> > So I'll go down that road until the next speed bump...
> >
> > Gary
> >
> >
> > >
> > > Even if it is not implemented for 1.0, we should talk about how it
> > > should be done such that it fits in and does not cause API problems
> > > later. And if I can get it done by then, then that much the better.
> > >
> > > Gary
> > >
> > > >
> > > >
> > > >>
> > > >> So I think it boils down to ignoring my comment about using a Set
> > > >> internally and adding a feature where I can tell the parser that I
> > want
> > > to
> > > >> use a set of column names and not worry about the order, because the
> > > parser
> > > >> will match up the column names when it reads the header line.
> > > >>
> > > >> Gary
> > > >>
> > > >>
> > > >>>
> > > >>>
> > > >>> Emmanuel Bourg
> > > >>>
> > > >>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > > >>> For additional commands, e-mail: dev-help@commons.apache.org
> > > >>
> > > >>
> > > >> --
> > > >> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > > >> Java Persistence with Hibernate, Second Edition<
> > > >> http://www.manning.com/bauer3/>
> > > >> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> > > >> Spring Batch in Action <http://www.manning.com/templier/>
> > > >> Blog: http://garygregory.wordpress.com
> > > >> Home: http://garygregory.com/
> > > >> Tweet! http://twitter.com/GaryGregory
> > > >
> > > >
> > > >
> > > > --
> > > > http://people.apache.org/~britter/
> > > > http://www.systemoutprintln.de/
> > > > http://twitter.com/BenediktRitter
> > > > http://github.com/britter
> > >
> >
> >
> >
> > --
> > E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > Java Persistence with Hibernate, Second Edition<
> > http://www.manning.com/bauer3/>
> > JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> > Spring Batch in Action <http://www.manning.com/templier/>
> > Blog: http://garygregory.wordpress.com
> > Home: http://garygregory.com/
> > Tweet! http://twitter.com/GaryGregory
> >
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Mark Fortner <ph...@gmail.com>.

I took a brief look at the API for CSV, and thought I would share a typical
use case from the biotech industry.  We deal with a lot of instruments that
produce a multiline header.  The header usually contains "experiment
conditions".  You can think of this as metadata for the columnar data.  The
experiment conditions usually contain things like the name of the scientist
using the instrument, the time of day the experiment was run, and some
instrument configuration settings.  Usually when we parse CSV files, we
have to parse the header first, extract all relevant data, and then parse
the rows of data.

In addition to the experiment conditions header, there are also column
headers.  The column headers can be multi-lined as well.  For example, you
might have a column header whose first line contains chemical compound IDs
or names, and the second line of the column header contains the
concentrations for those compounds. The data values represent the percent
inhibition at those concentrations. Like this:

Erlotinib
1uM 10 uM 100 uM 1nM
0.01  0.001  0.0001 0.00001
...

Since the position and types of header and body data vary, we typically use
 parse configuration files that describe "what data can be found where".
 The parse configuration varies not only per instrument but also per
experimental protocol. So there are usually numerous configuration files in
your typical lab.  The configuration files can also be stored in a
database.  This is usually part of a file-watching web app.  It allows
scientists to add support for new experiments or instruments without having
to get a developer to write more code.

In the API I saw support for hard-coded configurations via the CSVFormat
object, but I didn't see any support for creating and using persistable
configurations.  You may want to consider that as you move forward.

Hope this helps,

Mark



On Wed, Jul 31, 2013 at 6:36 AM, Gary Gregory <ga...@gmail.com>wrote:

> On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory <garydgregory@gmail.com
> >wrote:
>
> > On Jul 31, 2013, at 3:38, Benedikt Ritter <br...@apache.org> wrote:
> >
> > > 2013/7/31 Gary Gregory <ga...@gmail.com>
> > >
> > >> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <eb...@apache.org>
> > wrote:
> > >>
> > >>> Le 30/07/2013 23:26, Gary Gregory a écrit :
> > >>>> And another thing: internally, the header should be a Set<String>,
> not
> > >> a
> > >>>> String[]. I plan on fixing that later too.
> > >>>
> > >>> Why should it be a set? Is there an impact on the performance?
> > >>
> > >> Well, I did not finish my though on that one, sorry about that, please
> > >> allow me to walk through my use cases. The issue is about the feature,
> > not
> > >> performance.
> > >>
> > >> At first glance, using a set avoids an inherent problem with any
> non-set
> > >> data structure: defining duplicates. What does the following mean?
> > >>
> > >> withHeader("A", "B", "C", "A");
> > >>
> > >> It's is a recipe for garbage results: record.get("A") returns what?
> > >>
> > >> Today, I added some CSVFormat validation code that checks for
> duplicate
> > >> column names. If you build a format with withHeader("A", "B", "C",
> "A");
> > >> you will get an ISE when validate() is called.
> > >>
> > >> If we had withHeader(Set) and document it as the 'main' way to specify
> > >> column names, then we can say that withHeader(String...) is just a
> > >> syntactical convenience and turn the String[] into a Set. But that
> will
> > not
> > >> work.
> > >>
> > >> The problem with a Java Set is that it is not ordered and the current
> > >> implementation relies on order of the String[]. But why? What the
> > current
> > >> implementation says is: ignore what the header line of the file is and
> > use
> > >> the given column names at the given positions. A perfectly good user
> > story.
> > >> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1,
> and
> > so
> > >> on. Ok, that's one usage.
> > >>
> > >> Taking a step back, I want to talk about why should the column name
> > order
> > >> matter when you are calling withHeader(). I would like to be able to
> > tell
> > >> the parser that I want to use a Set of column names and have it figure
> > out,
> > >> based on the header line, the columns indices. This is quite different
> > than
> > >> what we have now.
> > >>
> > >> A use case I have now is a CSV file with a lot of columns (~90) but I
> > only
> > >> care about a small subset of the columns (~10). I'd like to be able to
> > say
> > >> withHeader(Set) where the Set may be a subset of the actual column
> > names in
> > >> the header line. This is different from withHeader(String[]) because
> the
> > >> names in the Set must match the names in the header record.
> > >
> > > I'm not sure if we should try to build in all this different cases
> > > (guessing headers, using the first record as headers, only use a subset
> > of
> > > the available headers) into one implementation.
> > >
> > > What you are talking about sounds more like a view or a projection of
> the
> > > actual content being parsed.
> > > Do we really need this for 1.0 or can it be postponed?
> >
> > This is a real scenario and a real need, not some imaginary complication
> ;)
> >
>
> But I could work with current framework and use withHeaders(new String[]{})
> and let the parser find the headers. Then I can just do record.get("A")
> with the columns I care about. It just feels a little more mysterious.
>
> I think the only wrinkle left for me is that I want validation that the
> columns I care about are there. Right now get(String) throws
> IllegalArgumentException if you give it an unknown column, which will fail
> fast enough on the first record.
>
> So I'll go down that road until the next speed bump...
>
> Gary
>
>
> >
> > Even if it is not implemented for 1.0, we should talk about how it
> > should be done such that it fits in and does not cause API problems
> > later. And if I can get it done by then, then that much the better.
> >
> > Gary
> >
> > >
> > >
> > >>
> > >> So I think it boils down to ignoring my comment about using a Set
> > >> internally and adding a feature where I can tell the parser that I
> want
> > to
> > >> use a set of column names and not worry about the order, because the
> > parser
> > >> will match up the column names when it reads the header line.
> > >>
> > >> Gary
> > >>
> > >>
> > >>>
> > >>>
> > >>> Emmanuel Bourg
> > >>>
> > >>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > >>> For additional commands, e-mail: dev-help@commons.apache.org
> > >>
> > >>
> > >> --
> > >> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> > >> Java Persistence with Hibernate, Second Edition<
> > >> http://www.manning.com/bauer3/>
> > >> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> > >> Spring Batch in Action <http://www.manning.com/templier/>
> > >> Blog: http://garygregory.wordpress.com
> > >> Home: http://garygregory.com/
> > >> Tweet! http://twitter.com/GaryGregory
> > >
> > >
> > >
> > > --
> > > http://people.apache.org/~britter/
> > > http://www.systemoutprintln.de/
> > > http://twitter.com/BenediktRitter
> > > http://github.com/britter
> >
>
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition<
> http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Wed, Jul 31, 2013 at 8:58 AM, Gary Gregory <ga...@gmail.com>wrote:

> On Jul 31, 2013, at 3:38, Benedikt Ritter <br...@apache.org> wrote:
>
> > 2013/7/31 Gary Gregory <ga...@gmail.com>
> >
> >> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <eb...@apache.org>
> wrote:
> >>
> >>> Le 30/07/2013 23:26, Gary Gregory a écrit :
> >>>> And another thing: internally, the header should be a Set<String>, not
> >> a
> >>>> String[]. I plan on fixing that later too.
> >>>
> >>> Why should it be a set? Is there an impact on the performance?
> >>
> >> Well, I did not finish my though on that one, sorry about that, please
> >> allow me to walk through my use cases. The issue is about the feature,
> not
> >> performance.
> >>
> >> At first glance, using a set avoids an inherent problem with any non-set
> >> data structure: defining duplicates. What does the following mean?
> >>
> >> withHeader("A", "B", "C", "A");
> >>
> >> It's is a recipe for garbage results: record.get("A") returns what?
> >>
> >> Today, I added some CSVFormat validation code that checks for duplicate
> >> column names. If you build a format with withHeader("A", "B", "C", "A");
> >> you will get an ISE when validate() is called.
> >>
> >> If we had withHeader(Set) and document it as the 'main' way to specify
> >> column names, then we can say that withHeader(String...) is just a
> >> syntactical convenience and turn the String[] into a Set. But that will
> not
> >> work.
> >>
> >> The problem with a Java Set is that it is not ordered and the current
> >> implementation relies on order of the String[]. But why? What the
> current
> >> implementation says is: ignore what the header line of the file is and
> use
> >> the given column names at the given positions. A perfectly good user
> story.
> >> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and
> so
> >> on. Ok, that's one usage.
> >>
> >> Taking a step back, I want to talk about why should the column name
> order
> >> matter when you are calling withHeader(). I would like to be able to
> tell
> >> the parser that I want to use a Set of column names and have it figure
> out,
> >> based on the header line, the columns indices. This is quite different
> than
> >> what we have now.
> >>
> >> A use case I have now is a CSV file with a lot of columns (~90) but I
> only
> >> care about a small subset of the columns (~10). I'd like to be able to
> say
> >> withHeader(Set) where the Set may be a subset of the actual column
> names in
> >> the header line. This is different from withHeader(String[]) because the
> >> names in the Set must match the names in the header record.
> >
> > I'm not sure if we should try to build in all this different cases
> > (guessing headers, using the first record as headers, only use a subset
> of
> > the available headers) into one implementation.
> >
> > What you are talking about sounds more like a view or a projection of the
> > actual content being parsed.
> > Do we really need this for 1.0 or can it be postponed?
>
> This is a real scenario and a real need, not some imaginary complication ;)
>

But I could work with current framework and use withHeaders(new String[]{})
and let the parser find the headers. Then I can just do record.get("A")
with the columns I care about. It just feels a little more mysterious.

I think the only wrinkle left for me is that I want validation that the
columns I care about are there. Right now get(String) throws
IllegalArgumentException if you give it an unknown column, which will fail
fast enough on the first record.

So I'll go down that road until the next speed bump...

Gary


>
> Even if it is not implemented for 1.0, we should talk about how it
> should be done such that it fits in and does not cause API problems
> later. And if I can get it done by then, then that much the better.
>
> Gary
>
> >
> >
> >>
> >> So I think it boils down to ignoring my comment about using a Set
> >> internally and adding a feature where I can tell the parser that I want
> to
> >> use a set of column names and not worry about the order, because the
> parser
> >> will match up the column names when it reads the header line.
> >>
> >> Gary
> >>
> >>
> >>>
> >>>
> >>> Emmanuel Bourg
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> >>> For additional commands, e-mail: dev-help@commons.apache.org
> >>
> >>
> >> --
> >> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> >> Java Persistence with Hibernate, Second Edition<
> >> http://www.manning.com/bauer3/>
> >> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> >> Spring Batch in Action <http://www.manning.com/templier/>
> >> Blog: http://garygregory.wordpress.com
> >> Home: http://garygregory.com/
> >> Tweet! http://twitter.com/GaryGregory
> >
> >
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Jul 31, 2013, at 3:38, Benedikt Ritter <br...@apache.org> wrote:

> 2013/7/31 Gary Gregory <ga...@gmail.com>
>
>> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <eb...@apache.org> wrote:
>>
>>> Le 30/07/2013 23:26, Gary Gregory a écrit :
>>>> And another thing: internally, the header should be a Set<String>, not
>> a
>>>> String[]. I plan on fixing that later too.
>>>
>>> Why should it be a set? Is there an impact on the performance?
>>
>> Well, I did not finish my though on that one, sorry about that, please
>> allow me to walk through my use cases. The issue is about the feature, not
>> performance.
>>
>> At first glance, using a set avoids an inherent problem with any non-set
>> data structure: defining duplicates. What does the following mean?
>>
>> withHeader("A", "B", "C", "A");
>>
>> It's is a recipe for garbage results: record.get("A") returns what?
>>
>> Today, I added some CSVFormat validation code that checks for duplicate
>> column names. If you build a format with withHeader("A", "B", "C", "A");
>> you will get an ISE when validate() is called.
>>
>> If we had withHeader(Set) and document it as the 'main' way to specify
>> column names, then we can say that withHeader(String...) is just a
>> syntactical convenience and turn the String[] into a Set. But that will not
>> work.
>>
>> The problem with a Java Set is that it is not ordered and the current
>> implementation relies on order of the String[]. But why? What the current
>> implementation says is: ignore what the header line of the file is and use
>> the given column names at the given positions. A perfectly good user story.
>> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and so
>> on. Ok, that's one usage.
>>
>> Taking a step back, I want to talk about why should the column name order
>> matter when you are calling withHeader(). I would like to be able to tell
>> the parser that I want to use a Set of column names and have it figure out,
>> based on the header line, the columns indices. This is quite different than
>> what we have now.
>>
>> A use case I have now is a CSV file with a lot of columns (~90) but I only
>> care about a small subset of the columns (~10). I'd like to be able to say
>> withHeader(Set) where the Set may be a subset of the actual column names in
>> the header line. This is different from withHeader(String[]) because the
>> names in the Set must match the names in the header record.
>
> I'm not sure if we should try to build in all this different cases
> (guessing headers, using the first record as headers, only use a subset of
> the available headers) into one implementation.
>
> What you are talking about sounds more like a view or a projection of the
> actual content being parsed.
> Do we really need this for 1.0 or can it be postponed?

This is a real scenario and a real need, not some imaginary complication ;)

Even if it is not implemented for 1.0, we should talk about how it
should be done such that it fits in and does not cause API problems
later. And if I can get it done by then, then that much the better.

Gary

>
>
>>
>> So I think it boils down to ignoring my comment about using a Set
>> internally and adding a feature where I can tell the parser that I want to
>> use a set of column names and not worry about the order, because the parser
>> will match up the column names when it reads the header line.
>>
>> Gary
>>
>>
>>>
>>>
>>> Emmanuel Bourg
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>> --
>> E-Mail: garydgregory@gmail.com | ggregory@apache.org
>> Java Persistence with Hibernate, Second Edition<
>> http://www.manning.com/bauer3/>
>> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
>> Spring Batch in Action <http://www.manning.com/templier/>
>> Blog: http://garygregory.wordpress.com
>> Home: http://garygregory.com/
>> Tweet! http://twitter.com/GaryGregory
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CSV] Headers and the first record

Posted by sebb <se...@gmail.com>.

On 31 July 2013 08:38, Benedikt Ritter <br...@apache.org> wrote:
> 2013/7/31 Gary Gregory <ga...@gmail.com>
>
>> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <eb...@apache.org> wrote:
>>
>> > Le 30/07/2013 23:26, Gary Gregory a écrit :
>> > > And another thing: internally, the header should be a Set<String>, not
>> a
>> > > String[]. I plan on fixing that later too.
>> >
>> > Why should it be a set? Is there an impact on the performance?
>> >
>>
>> Well, I did not finish my though on that one, sorry about that, please
>> allow me to walk through my use cases. The issue is about the feature, not
>> performance.
>>
>> At first glance, using a set avoids an inherent problem with any non-set
>> data structure: defining duplicates. What does the following mean?
>>
>> withHeader("A", "B", "C", "A");
>>
>> It's is a recipe for garbage results: record.get("A") returns what?
>>
>> Today, I added some CSVFormat validation code that checks for duplicate
>> column names. If you build a format with withHeader("A", "B", "C", "A");
>> you will get an ISE when validate() is called.
>>
>> If we had withHeader(Set) and document it as the 'main' way to specify
>> column names, then we can say that withHeader(String...) is just a
>> syntactical convenience and turn the String[] into a Set. But that will not
>> work.
>>
>> The problem with a Java Set is that it is not ordered and the current
>> implementation relies on order of the String[]. But why? What the current
>> implementation says is: ignore what the header line of the file is and use
>> the given column names at the given positions. A perfectly good user story.
>> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and so
>> on. Ok, that's one usage.
>>
>> Taking a step back, I want to talk about why should the column name order
>> matter when you are calling withHeader(). I would like to be able to tell
>> the parser that I want to use a Set of column names and have it figure out,
>> based on the header line, the columns indices. This is quite different than
>> what we have now.
>>
>> A use case I have now is a CSV file with a lot of columns (~90) but I only
>> care about a small subset of the columns (~10). I'd like to be able to say
>> withHeader(Set) where the Set may be a subset of the actual column names in
>> the header line. This is different from withHeader(String[]) because the
>> names in the Set must match the names in the header record.
>>
>
> I'm not sure if we should try to build in all this different cases
> (guessing headers, using the first record as headers, only use a subset of
> the available headers) into one implementation.
>
> What you are talking about sounds more like a view or a projection of the
> actual content being parsed.
> Do we really need this for 1.0 or can it be postponed?

Agreed, this is something that needs more work before it could be included.

There will always be some extra item that would be nice to have; this
seems non-essential to me.

>
>>
>> So I think it boils down to ignoring my comment about using a Set
>> internally and adding a feature where I can tell the parser that I want to
>> use a set of column names and not worry about the order, because the parser
>> will match up the column names when it reads the header line.
>>
>> Gary
>>
>>
>> >
>> >
>> > Emmanuel Bourg
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> > For additional commands, e-mail: dev-help@commons.apache.org
>> >
>> >
>>
>>
>> --
>> E-Mail: garydgregory@gmail.com | ggregory@apache.org
>> Java Persistence with Hibernate, Second Edition<
>> http://www.manning.com/bauer3/>
>> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
>> Spring Batch in Action <http://www.manning.com/templier/>
>> Blog: http://garygregory.wordpress.com
>> Home: http://garygregory.com/
>> Tweet! http://twitter.com/GaryGregory
>>
>
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CSV] Headers and the first record

Posted by Benedikt Ritter <br...@apache.org>.

2013/7/31 Gary Gregory <ga...@gmail.com>

> On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <eb...@apache.org> wrote:
>
> > Le 30/07/2013 23:26, Gary Gregory a écrit :
> > > And another thing: internally, the header should be a Set<String>, not
> a
> > > String[]. I plan on fixing that later too.
> >
> > Why should it be a set? Is there an impact on the performance?
> >
>
> Well, I did not finish my though on that one, sorry about that, please
> allow me to walk through my use cases. The issue is about the feature, not
> performance.
>
> At first glance, using a set avoids an inherent problem with any non-set
> data structure: defining duplicates. What does the following mean?
>
> withHeader("A", "B", "C", "A");
>
> It's is a recipe for garbage results: record.get("A") returns what?
>
> Today, I added some CSVFormat validation code that checks for duplicate
> column names. If you build a format with withHeader("A", "B", "C", "A");
> you will get an ISE when validate() is called.
>
> If we had withHeader(Set) and document it as the 'main' way to specify
> column names, then we can say that withHeader(String...) is just a
> syntactical convenience and turn the String[] into a Set. But that will not
> work.
>
> The problem with a Java Set is that it is not ordered and the current
> implementation relies on order of the String[]. But why? What the current
> implementation says is: ignore what the header line of the file is and use
> the given column names at the given positions. A perfectly good user story.
> So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and so
> on. Ok, that's one usage.
>
> Taking a step back, I want to talk about why should the column name order
> matter when you are calling withHeader(). I would like to be able to tell
> the parser that I want to use a Set of column names and have it figure out,
> based on the header line, the columns indices. This is quite different than
> what we have now.
>
> A use case I have now is a CSV file with a lot of columns (~90) but I only
> care about a small subset of the columns (~10). I'd like to be able to say
> withHeader(Set) where the Set may be a subset of the actual column names in
> the header line. This is different from withHeader(String[]) because the
> names in the Set must match the names in the header record.
>

I'm not sure if we should try to build in all this different cases
(guessing headers, using the first record as headers, only use a subset of
the available headers) into one implementation.

What you are talking about sounds more like a view or a projection of the
actual content being parsed.
Do we really need this for 1.0 or can it be postponed?


>
> So I think it boils down to ignoring my comment about using a Set
> internally and adding a feature where I can tell the parser that I want to
> use a set of column names and not worry about the order, because the parser
> will match up the column names when it reads the header line.
>
> Gary
>
>
> >
> >
> > Emmanuel Bourg
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> > For additional commands, e-mail: dev-help@commons.apache.org
> >
> >
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition<
> http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>



-- 
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Tue, Jul 30, 2013 at 5:29 PM, Emmanuel Bourg <eb...@apache.org> wrote:

> Le 30/07/2013 23:26, Gary Gregory a écrit :
> > And another thing: internally, the header should be a Set<String>, not a
> > String[]. I plan on fixing that later too.
>
> Why should it be a set? Is there an impact on the performance?
>

Well, I did not finish my though on that one, sorry about that, please
allow me to walk through my use cases. The issue is about the feature, not
performance.

At first glance, using a set avoids an inherent problem with any non-set
data structure: defining duplicates. What does the following mean?

withHeader("A", "B", "C", "A");

It's is a recipe for garbage results: record.get("A") returns what?

Today, I added some CSVFormat validation code that checks for duplicate
column names. If you build a format with withHeader("A", "B", "C", "A");
you will get an ISE when validate() is called.

If we had withHeader(Set) and document it as the 'main' way to specify
column names, then we can say that withHeader(String...) is just a
syntactical convenience and turn the String[] into a Set. But that will not
work.

The problem with a Java Set is that it is not ordered and the current
implementation relies on order of the String[]. But why? What the current
implementation says is: ignore what the header line of the file is and use
the given column names at the given positions. A perfectly good user story.
So for withHeader("A", "B", "C"), "A" is column 0, "B" is column 1, and so
on. Ok, that's one usage.

Taking a step back, I want to talk about why should the column name order
matter when you are calling withHeader(). I would like to be able to tell
the parser that I want to use a Set of column names and have it figure out,
based on the header line, the columns indices. This is quite different than
what we have now.

A use case I have now is a CSV file with a lot of columns (~90) but I only
care about a small subset of the columns (~10). I'd like to be able to say
withHeader(Set) where the Set may be a subset of the actual column names in
the header line. This is different from withHeader(String[]) because the
names in the Set must match the names in the header record.

So I think it boils down to ignoring my comment about using a Set
internally and adding a feature where I can tell the parser that I want to
use a set of column names and not worry about the order, because the parser
will match up the column names when it reads the header line.

Gary

>
>
> Emmanuel Bourg
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 30/07/2013 23:26, Gary Gregory a écrit :
> And another thing: internally, the header should be a Set<String>, not a
> String[]. I plan on fixing that later too.

Why should it be a set? Is there an impact on the performance?


Emmanuel Bourg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

And another thing: internally, the header should be a Set<String>, not a
String[]. I plan on fixing that later too.

Gary


On Tue, Jul 30, 2013 at 5:24 PM, Gary Gregory <ga...@gmail.com>wrote:

> On Tue, Jul 30, 2013 at 5:15 PM, Emmanuel Bourg <eb...@apache.org> wrote:
>
>> I haven't checked the current code, but the intended behavior was:
>>
>> - no args: the first record defines the header and is not returned when
>> iterating
>>
>> - args: the header is defined independently of the data, all the records
>> are returned when iterating
>>
>
> Yeah, that's too clever IMO. I expected the same behavior WRT record
> reading with the only difference being if I let the parser guess or not.
>
> The current code now always reads the header line if you set any non-null
> header. If you call withHeader() with no args it is a non-null call with an
> empty String[].
>
> The idea being that if I use headers and I ask the parser to guess or give
> it the headers, I do not need to have the header line as a record.
>
> I plan on adding a setting that allows the header record to be saved for
> callers who care.
>
> Gary
>
>
>>
>> Emmanuel Bourg
>>
>>
>> Le 30/07/2013 22:23, Gary Gregory a écrit :
>> > Actually, if you use withHeader(), no args, you _cannot_ get back the
>> first
>> > record, so that makes skipHeader=false not possible without making the
>> > parser track the first record separately.
>> >
>> > In the interest of simplicity, I am going to make it simple: if you use
>> > withHeader of any kind, then the first record is read.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Tue, Jul 30, 2013 at 5:47 PM, Emmanuel Bourg <eb...@apache.org> wrote:

> Le 30/07/2013 23:24, Gary Gregory a écrit :
>
> > Yeah, that's too clever IMO. I expected the same behavior WRT record
> > reading with the only difference being if I let the parser guess or not.
>
> Too clever? I didn't feel like I designed a rocket with this feature
> though :) That's an important feature to me and I'd like to preserve it.
>
> If the header is defined in the file I don't want to skip the first
> record manually, the parser should take care of it. That also means the
> user code can remain the same, whether the header is defined in the code
> or in the file.
>

Let me reply to this part tomorrow (it's late here ;)


>
>
> > The current code now always reads the header line if you set any non-null
> > header. If you call withHeader() with no args it is a non-null call with
> an
> > empty String[].
>
> I guess a null header or an empty header is just the same and means the
> first record must be used as the header.
>

It is not the same at all. A null header String[] is different from a
length 0 array. It's been like that for a while.

Gary


>
> Emmanuel Bourg
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Wed, Jul 31, 2013 at 10:48 AM, Gary Gregory <ga...@gmail.com>wrote:

> On Wed, Jul 31, 2013 at 9:34 AM, Emmanuel Bourg <eb...@apache.org> wrote:
>
>> Le 31/07/2013 15:08, Gary Gregory a écrit :
>>
>> > But that is exactly what _was_ happening! ;)
>> >
>> > If I called withHeader("A", "B", "C") the header was not skipped.
>>
>> Sounds good. The header is defined in the code, we don't expect to see
>> the header in the file so nothing is skipped.
>>
>
> NOT good! ;) This is where we disagree. The parser used to behave
> differently depending on the contents of the String[].
> - From an API design standpoint, it's smelly to me.
> - The feature is hard to understand. If we want that, we need two APIs for
> two behaviors.
>
> Using the withHeader API, I can tell the parser to:
> - Ignore the fact that there is a header record, I am overriding it with
> my own names
> - There is no header record, so I am telling you what the header names are.
>
> These two features clash because in one case the file has a header line
> and in the other the file does not. This is why we need settings with
> different names.
>
> That or a setting that says 'skip the first record, it's the header, I do
> not want to see it as a data record'
>
> I see three scenarios:
>
> 1) I set the headers (the file does not have one), do not skip the first
> record
> 2) I override the existing header record, skip the first record
> 3) The parser guesses the headers based on reading the first record, which
> skips the first record as a data record
>
> This can be accommodated with a skipHeaderRecord boolean setting.
>
> I do not care what the default behavior is as long as I can say "this file
> has headers, guess them please, and skip record 0" and "this file has a
> header record, but I'm telling you to call them A, B, and C, so skip record
> 0"
>
> 1) withHeader("A", "B", "C").skipHeaderRecord(false);
> 2) withHeader("A", "B", "C").skipHeaderRecord(true);
> 3) withHeader()
>
> Is there a better name for skipHeaderRecord? Maybe:
>
> 1b) withHeader("A", "B", "C").firstRecordIsHeader(false);
> 2b) withHeader("A", "B", "C").firstRecordIsHeader(true);
>
> Here the difference is that the API does not describe behavior, instead it
> describes the data, and behavior is implied.
>
> There is also:
>
> 1c) withHeader("A", "B", "C")
> 2c) withHeaderOverride("A", "B", "C")
>
> Thoughts?
>

I reverted back to NOT skipping a record when withHeader is called with a
non-empty array; and added a skipHeaderRecord setting to CSVFormat to use
when headers are initialized.

Gary


>
> Gary
>
>
>>
>> > If I called withHeader(new String[]{}) the header was skipped.
>>
>> Correct. The header is not defined in the code, the parser uses the
>> first record as header and doesn't return it when iterating.
>>
>> > If I called withHeader() the header was skipped (same as line above).
>>
>> Sounds good too.
>>
>>
>> What was the issue again ? ;)
>>
>>
>> > What I am asking is: should we have a saveHeader setting such that IF
>> you
>> > ask for headers, then we save that record in the parser, it is currently
>> > "lost", or, actually transformed into the header map.
>>
>> Keeping the header around might be useful, I wouldn't create a format
>> parameter for this though. It could be made available at the record
>> level, much like ResultSet.getMetaData().
>>
>> Emmanuel Bourg
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
>> For additional commands, e-mail: dev-help@commons.apache.org
>>
>>
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Wed, Jul 31, 2013 at 9:34 AM, Emmanuel Bourg <eb...@apache.org> wrote:

> Le 31/07/2013 15:08, Gary Gregory a écrit :
>
> > But that is exactly what _was_ happening! ;)
> >
> > If I called withHeader("A", "B", "C") the header was not skipped.
>
> Sounds good. The header is defined in the code, we don't expect to see
> the header in the file so nothing is skipped.
>

NOT good! ;) This is where we disagree. The parser used to behave
differently depending on the contents of the String[].
- From an API design standpoint, it's smelly to me.
- The feature is hard to understand. If we want that, we need two APIs for
two behaviors.

Using the withHeader API, I can tell the parser to:
- Ignore the fact that there is a header record, I am overriding it with my
own names
- There is no header record, so I am telling you what the header names are.

These two features clash because in one case the file has a header line and
in the other the file does not. This is why we need settings with different
names.

That or a setting that says 'skip the first record, it's the header, I do
not want to see it as a data record'

I see three scenarios:

1) I set the headers (the file does not have one), do not skip the first
record
2) I override the existing header record, skip the first record
3) The parser guesses the headers based on reading the first record, which
skips the first record as a data record

This can be accommodated with a skipHeaderRecord boolean setting.

I do not care what the default behavior is as long as I can say "this file
has headers, guess them please, and skip record 0" and "this file has a
header record, but I'm telling you to call them A, B, and C, so skip record
0"

1) withHeader("A", "B", "C").skipHeaderRecord(false);
2) withHeader("A", "B", "C").skipHeaderRecord(true);
3) withHeader()

Is there a better name for skipHeaderRecord? Maybe:

1b) withHeader("A", "B", "C").firstRecordIsHeader(false);
2b) withHeader("A", "B", "C").firstRecordIsHeader(true);

Here the difference is that the API does not describe behavior, instead it
describes the data, and behavior is implied.

There is also:

1c) withHeader("A", "B", "C")
2c) withHeaderOverride("A", "B", "C")

Thoughts?

Gary


>
> > If I called withHeader(new String[]{}) the header was skipped.
>
> Correct. The header is not defined in the code, the parser uses the
> first record as header and doesn't return it when iterating.
>
> > If I called withHeader() the header was skipped (same as line above).
>
> Sounds good too.
>
>
> What was the issue again ? ;)
>
>
> > What I am asking is: should we have a saveHeader setting such that IF you
> > ask for headers, then we save that record in the parser, it is currently
> > "lost", or, actually transformed into the header map.
>
> Keeping the header around might be useful, I wouldn't create a format
> parameter for this though. It could be made available at the record
> level, much like ResultSet.getMetaData().
>
> Emmanuel Bourg
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>


-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 31/07/2013 15:08, Gary Gregory a écrit :

> But that is exactly what _was_ happening! ;)
> 
> If I called withHeader("A", "B", "C") the header was not skipped.

Sounds good. The header is defined in the code, we don't expect to see
the header in the file so nothing is skipped.

> If I called withHeader(new String[]{}) the header was skipped.

Correct. The header is not defined in the code, the parser uses the
first record as header and doesn't return it when iterating.

> If I called withHeader() the header was skipped (same as line above).

Sounds good too.


What was the issue again ? ;)


> What I am asking is: should we have a saveHeader setting such that IF you
> ask for headers, then we save that record in the parser, it is currently
> "lost", or, actually transformed into the header map.

Keeping the header around might be useful, I wouldn't create a format
parameter for this though. It could be made available at the record
level, much like ResultSet.getMetaData().

Emmanuel Bourg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Tue, Jul 30, 2013 at 5:47 PM, Emmanuel Bourg <eb...@apache.org> wrote:

> Le 30/07/2013 23:24, Gary Gregory a écrit :
>
> > Yeah, that's too clever IMO. I expected the same behavior WRT record
> > reading with the only difference being if I let the parser guess or not.
>
> Too clever? I didn't feel like I designed a rocket with this feature
> though :) That's an important feature to me and I'd like to preserve it.
>
> If the header is defined in the file I don't want to skip the first
> record manually, the parser should take care of it.

But that is exactly what _was_ happening! ;)

If I called withHeader("A", "B", "C") the header was not skipped.
If I called withHeader(new String[]{}) the header was skipped.
If I called withHeader() the header was skipped (same as line above).

In both cases, I am telling the parser that there is a header, but it is
not skipped in both cases. That's the inconsistency I fixed.

What I am asking is: should we have a saveHeader setting such that IF you
ask for headers, then we save that record in the parser, it is currently
"lost", or, actually transformed into the header map.

Gary

> That also means the
> user code can remain the same, whether the header is defined in the code
> or in the file.
>
>
> > The current code now always reads the header line if you set any non-null
> > header. If you call withHeader() with no args it is a non-null call with
> an
> > empty String[].
>
> I guess a null header or an empty header is just the same and means the
> first record must be used as the header.
>
> Emmanuel Bourg
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Emmanuel Bourg <eb...@apache.org>.

Le 30/07/2013 23:24, Gary Gregory a écrit :

> Yeah, that's too clever IMO. I expected the same behavior WRT record
> reading with the only difference being if I let the parser guess or not.

Too clever? I didn't feel like I designed a rocket with this feature
though :) That's an important feature to me and I'd like to preserve it.

If the header is defined in the file I don't want to skip the first
record manually, the parser should take care of it. That also means the
user code can remain the same, whether the header is defined in the code
or in the file.


> The current code now always reads the header line if you set any non-null
> header. If you call withHeader() with no args it is a non-null call with an
> empty String[].

I guess a null header or an empty header is just the same and means the
first record must be used as the header.

Emmanuel Bourg


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

On Tue, Jul 30, 2013 at 5:15 PM, Emmanuel Bourg <eb...@apache.org> wrote:

> I haven't checked the current code, but the intended behavior was:
>
> - no args: the first record defines the header and is not returned when
> iterating
>
> - args: the header is defined independently of the data, all the records
> are returned when iterating
>

Yeah, that's too clever IMO. I expected the same behavior WRT record
reading with the only difference being if I let the parser guess or not.

The current code now always reads the header line if you set any non-null
header. If you call withHeader() with no args it is a non-null call with an
empty String[].

The idea being that if I use headers and I ask the parser to guess or give
it the headers, I do not need to have the header line as a record.

I plan on adding a setting that allows the header record to be saved for
callers who care.

Gary

>
> Emmanuel Bourg
>
>
> Le 30/07/2013 22:23, Gary Gregory a écrit :
> > Actually, if you use withHeader(), no args, you _cannot_ get back the
> first
> > record, so that makes skipHeader=false not possible without making the
> > parser track the first record separately.
> >
> > In the interest of simplicity, I am going to make it simple: if you use
> > withHeader of any kind, then the first record is read.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Emmanuel Bourg <eb...@apache.org>.

I haven't checked the current code, but the intended behavior was:

- no args: the first record defines the header and is not returned when
iterating

- args: the header is defined independently of the data, all the records
are returned when iterating

Emmanuel Bourg


Le 30/07/2013 22:23, Gary Gregory a écrit :
> Actually, if you use withHeader(), no args, you _cannot_ get back the first
> record, so that makes skipHeader=false not possible without making the
> parser track the first record separately.
> 
> In the interest of simplicity, I am going to make it simple: if you use
> withHeader of any kind, then the first record is read.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

Actually, if you use withHeader(), no args, you _cannot_ get back the first
record, so that makes skipHeader=false not possible without making the
parser track the first record separately.

In the interest of simplicity, I am going to make it simple: if you use
withHeader of any kind, then the first record is read.

Gary


On Tue, Jul 30, 2013 at 4:15 PM, Gary Gregory <ga...@gmail.com>wrote:

> Hi All:
>
> I see now, the behavior is different depending on what you pass to
> withHeader()! Confusing indeed.
>
> If you call withHeader with Strings, the first line is not read and it is
> returned as a record.
>
> If you call withHeader with no arguments, the first line _is_ read and it
> is NOT returned as a record.
>
> I think I'll change it so that withHeader causes the first line to be
> skipped, always, and add an option skipHeaders with a default of true. So
> if you really want to set the headers AND see what they are, you can do
> that.
>
> Gary
>
>
> On Tue, Jul 30, 2013 at 3:44 PM, Gary Gregory <ga...@gmail.com>wrote:
>
>> Hi All:
>>
>> I have Excel files with headers. So I use withHeaders() of course to map
>> the headers.
>>
>> When I call parser.iterator().next(), the first record is the header
>> record, not data.
>>
>> I always have to skip this first line since it is not data.
>>
>> I wonder if:
>>
>> 1) We should automatically skip the header line for next() and
>> parser.getRecords(), or
>> 2) Add a skipHeader boolean setting to control the above behavior, where
>> the default is...?
>>
>> (2) is the most flexible.
>>
>> Thoughts?
>>
>> Gary
>> --
>> E-Mail: garydgregory@gmail.com | ggregory@apache.org
>> Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
>> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
>> Spring Batch in Action <http://www.manning.com/templier/>
>> Blog: http://garygregory.wordpress.com
>> Home: http://garygregory.com/
>> Tweet! http://twitter.com/GaryGregory
>>
>
>
>
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>



-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory

Re: [CSV] Headers and the first record

Posted by Gary Gregory <ga...@gmail.com>.

Hi All:

I see now, the behavior is different depending on what you pass to
withHeader()! Confusing indeed.

If you call withHeader with Strings, the first line is not read and it is
returned as a record.

If you call withHeader with no arguments, the first line _is_ read and it
is NOT returned as a record.

I think I'll change it so that withHeader causes the first line to be
skipped, always, and add an option skipHeaders with a default of true. So
if you really want to set the headers AND see what they are, you can do
that.

Gary

On Tue, Jul 30, 2013 at 3:44 PM, Gary Gregory <ga...@gmail.com>wrote:

> Hi All:
>
> I have Excel files with headers. So I use withHeaders() of course to map
> the headers.
>
> When I call parser.iterator().next(), the first record is the header
> record, not data.
>
> I always have to skip this first line since it is not data.
>
> I wonder if:
>
> 1) We should automatically skip the header line for next() and
> parser.getRecords(), or
> 2) Add a skipHeader boolean setting to control the above behavior, where
> the default is...?
>
> (2) is the most flexible.
>
> Thoughts?
>
> Gary
> --
> E-Mail: garydgregory@gmail.com | ggregory@apache.org
> Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
> JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
> Spring Batch in Action <http://www.manning.com/templier/>
> Blog: http://garygregory.wordpress.com
> Home: http://garygregory.com/
> Tweet! http://twitter.com/GaryGregory
>

-- 
E-Mail: garydgregory@gmail.com | ggregory@apache.org
Java Persistence with Hibernate, Second Edition<http://www.manning.com/bauer3/>
JUnit in Action, Second Edition <http://www.manning.com/tahchiev/>
Spring Batch in Action <http://www.manning.com/templier/>
Blog: http://garygregory.wordpress.com
Home: http://garygregory.com/
Tweet! http://twitter.com/GaryGregory