You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Charles Givre <cg...@gmail.com> on 2021/11/18 00:30:58 UTC

[DISCUSS] Refactoring Drill's CSV (Text) Reader

Hello Drill Community, 
I would like to put forward some thoughts I've had relating to the CSV reader in Drill.  I would like to propose a few changes which could actually be breaking changes, so I wanted to see if there are any strongly held opinions in the community.  Here goes:

The Problems:
1.  The default behavior for Drill is to leave the extractColumnHeaders option as false.  When a user queries a CSV file this way, the results are returned in a list of columns called columns.  Thus if a user wants the first column, they would project columns[0].  I have never been a fan of this behavior.  Even though Drill ships with the csvh file extension which enables the header extraction, this is not a commonly used file format.  Furthermore, the returned results (the column list) does not work well with BI tools. 

2.  The CSV reader does not attempt to do any kind of data type discovery.

Proposed Changes:
The overall goal is to make it easier to query CSV data and also to make the behavior more consistent across format plugins.
1.  Change the default behavior and set the extractHeaders to true. 
2.  Other formats, like the excel reader, read tables directly into columns.  If the header is not known, Drill assigns a name of field_n.  I would propose replacing the `columns` array with a model similar to the Excel reader. 
3.  Implement schema discovery (data types) with an allTextMode option similar to the JSON reader.  When the allTextMode is disabled, the CSV reader would attempt to infer data types. 

Since there are some breaking changes here, I'd like to ask if people have any strong feelings on this topic or suggestions. 
Thanks!,
-- C




AW: Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Z0ltrix <z0...@pm.me.INVALID>.
I would appreciate such a change.

Each time i introduce drill to users i start with a csv example and its hard to explain why it has to be so difficult just to read a simple csv file.

Discover Datatypes would be cool, but it has not the highest priority. Casting by Users is fine until they have an intuitive way to query the strings.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Ted Dunning <te...@gmail.com> schrieb am Donnerstag, 18. November 2021 um 07:17:

> I think that these would be significant improvements.
> 

> The current behavior is pretty painful on average. Better defaults and just
> 

> a bit of deduction could pay off big. I even think that the presence of
> 

> headers might be pretty reliably inferred.
> 

> On Wed, Nov 17, 2021 at 4:31 PM Charles Givre cgivre@gmail.com wrote:
> 

> > Hello Drill Community,
> > 

> > I would like to put forward some thoughts I've had relating to the CSV
> > 

> > reader in Drill. I would like to propose a few changes which could
> > 

> > actually be breaking changes, so I wanted to see if there are any strongly
> > 

> > held opinions in the community. Here goes:
> > 

> > The Problems:
> > 

> > 1.  The default behavior for Drill is to leave the extractColumnHeaders
> >     

> >     option as false. When a user queries a CSV file this way, the results are
> >     

> >     returned in a list of columns called columns. Thus if a user wants the
> >     

> >     first column, they would project columns[0]. I have never been a fan of
> >     

> >     this behavior. Even though Drill ships with the csvh file extension which
> >     

> >     enables the header extraction, this is not a commonly used file format.
> >     

> >     Furthermore, the returned results (the column list) does not work well with
> >     

> >     BI tools.
> >     

> > 2.  The CSV reader does not attempt to do any kind of data type discovery.
> >     

> > 

> > Proposed Changes:
> > 

> > The overall goal is to make it easier to query CSV data and also to make
> > 

> > the behavior more consistent across format plugins.
> > 

> > 1.  Change the default behavior and set the extractHeaders to true.
> > 2.  Other formats, like the excel reader, read tables directly into
> >     

> >     columns. If the header is not known, Drill assigns a name of field_n. I
> >     

> >     would propose replacing the `columns` array with a model similar to the
> >     

> >     Excel reader.
> > 3.  Implement schema discovery (data types) with an allTextMode option
> >     

> >     similar to the JSON reader. When the allTextMode is disabled, the CSV
> >     

> >     reader would attempt to infer data types.
> > 

> > Since there are some breaking changes here, I'd like to ask if people have
> > 

> > any strong feelings on this topic or suggestions.
> > 

> > Thanks!,
> > 

> > -- C

AW: Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Z0ltrix <z0...@pm.me.INVALID>.
I would appreciate such a change.

Each time i introduce drill to users i start with a csv example and its hard to explain why it has to be so difficult just to read a simple csv file.

Discover Datatypes would be cool, but it has not the highest priority. Casting by Users is fine until they have an intuitive way to query the strings.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Ted Dunning <te...@gmail.com> schrieb am Donnerstag, 18. November 2021 um 07:17:

> I think that these would be significant improvements.
> 

> The current behavior is pretty painful on average. Better defaults and just
> 

> a bit of deduction could pay off big. I even think that the presence of
> 

> headers might be pretty reliably inferred.
> 

> On Wed, Nov 17, 2021 at 4:31 PM Charles Givre cgivre@gmail.com wrote:
> 

> > Hello Drill Community,
> > 

> > I would like to put forward some thoughts I've had relating to the CSV
> > 

> > reader in Drill. I would like to propose a few changes which could
> > 

> > actually be breaking changes, so I wanted to see if there are any strongly
> > 

> > held opinions in the community. Here goes:
> > 

> > The Problems:
> > 

> > 1.  The default behavior for Drill is to leave the extractColumnHeaders
> >     

> >     option as false. When a user queries a CSV file this way, the results are
> >     

> >     returned in a list of columns called columns. Thus if a user wants the
> >     

> >     first column, they would project columns[0]. I have never been a fan of
> >     

> >     this behavior. Even though Drill ships with the csvh file extension which
> >     

> >     enables the header extraction, this is not a commonly used file format.
> >     

> >     Furthermore, the returned results (the column list) does not work well with
> >     

> >     BI tools.
> >     

> > 2.  The CSV reader does not attempt to do any kind of data type discovery.
> >     

> > 

> > Proposed Changes:
> > 

> > The overall goal is to make it easier to query CSV data and also to make
> > 

> > the behavior more consistent across format plugins.
> > 

> > 1.  Change the default behavior and set the extractHeaders to true.
> > 2.  Other formats, like the excel reader, read tables directly into
> >     

> >     columns. If the header is not known, Drill assigns a name of field_n. I
> >     

> >     would propose replacing the `columns` array with a model similar to the
> >     

> >     Excel reader.
> > 3.  Implement schema discovery (data types) with an allTextMode option
> >     

> >     similar to the JSON reader. When the allTextMode is disabled, the CSV
> >     

> >     reader would attempt to infer data types.
> > 

> > Since there are some breaking changes here, I'd like to ask if people have
> > 

> > any strong feelings on this topic or suggestions.
> > 

> > Thanks!,
> > 

> > -- C

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Ted Dunning <te...@gmail.com>.
I think that these would be significant improvements.

The current behavior is pretty painful on average. Better defaults and just
a bit of deduction could pay off big. I even think that the presence of
headers might be pretty reliably inferred.



On Wed, Nov 17, 2021 at 4:31 PM Charles Givre <cg...@gmail.com> wrote:

> Hello Drill Community,
> I would like to put forward some thoughts I've had relating to the CSV
> reader in Drill.  I would like to propose a few changes which could
> actually be breaking changes, so I wanted to see if there are any strongly
> held opinions in the community.  Here goes:
>
> The Problems:
> 1.  The default behavior for Drill is to leave the extractColumnHeaders
> option as false.  When a user queries a CSV file this way, the results are
> returned in a list of columns called columns.  Thus if a user wants the
> first column, they would project columns[0].  I have never been a fan of
> this behavior.  Even though Drill ships with the csvh file extension which
> enables the header extraction, this is not a commonly used file format.
> Furthermore, the returned results (the column list) does not work well with
> BI tools.
>
> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>
> Proposed Changes:
> The overall goal is to make it easier to query CSV data and also to make
> the behavior more consistent across format plugins.
> 1.  Change the default behavior and set the extractHeaders to true.
> 2.  Other formats, like the excel reader, read tables directly into
> columns.  If the header is not known, Drill assigns a name of field_n.  I
> would propose replacing the `columns` array with a model similar to the
> Excel reader.
> 3.  Implement schema discovery (data types) with an allTextMode option
> similar to the JSON reader.  When the allTextMode is disabled, the CSV
> reader would attempt to infer data types.
>
> Since there are some breaking changes here, I'd like to ask if people have
> any strong feelings on this topic or suggestions.
> Thanks!,
> -- C
>
>
>
>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Дмитрий Владимирович <kd...@gmail.com>.
Please exclude me from conversation

чт, 18 нояб. 2021 г., 13:30 Charles Givre <cg...@gmail.com>:

> HI James,
> I do think it might be time to start considering creating a wiki of
> breaking changes for a Drill 2.0.  I'd also concur that having tons of
> config options that don't really add value is not a good use of config
> options as it leads to the creation of a lot of technical debt. I'll start
> a wiki page and put this on there.
>
> In the mean time, I may submit a PR that changes the default value of
> extractHeaders for CSV to true.  I don't really see that as a breaking
> change in that a user can simply change that flag and the previous behavior
> is restored.
> Best,
> -- C
>
>
>
> > On Nov 18, 2021, at 2:34 AM, James Turton <dz...@apache.org> wrote:
> >
> > Definitely a +1 for this friendlier default behaviour and another +1 for
> the prospect of increased consistency across format plugins.
> >
> > My follow-up questions to the community.
> > Since these are examples of user-breaking changes, and not just in niche
> areas, are we approaching a point when we want to start working on Drill
> 2.x?
> > Do we have other user-breaking or significant refactoring ideas that
> we've been keeping stashed away in our heads, that would get their chance
> at life from the fact that a 2.x Drill can defensibly exhibit some
> incompatibilities with Drill 1.x?
> > Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where we
> record such ideas?
> > Would we be fine in terms of dev resources with supporting both bug fix
> releases to a 1.x series and also pushing forward in a 2.x series?
> > My own feeling is that to get the most value from a good proposal such
> as the below, we don't want to conceal everything behind default-false
> options in order to avoid breaking Drill 1.x users, we want to embrace the
> breakage which (to me) points to Drill 2.x.
> >
> > On 2021/11/18 02:30, Charles Givre wrote:
> >> Hello Drill Community,
> >> I would like to put forward some thoughts I've had relating to the CSV
> reader in Drill.  I would like to propose a few changes which could
> actually be breaking changes, so I wanted to see if there are any strongly
> held opinions in the community.  Here goes:
> >>
> >> The Problems:
> >> 1.  The default behavior for Drill is to leave the extractColumnHeaders
> option as false.  When a user queries a CSV file this way, the results are
> returned in a list of columns called columns.  Thus if a user wants the
> first column, they would project columns[0].  I have never been a fan of
> this behavior.  Even though Drill ships with the csvh file extension which
> enables the header extraction, this is not a commonly used file format.
> Furthermore, the returned results (the column list) does not work well with
> BI tools.
> >>
> >> 2.  The CSV reader does not attempt to do any kind of data type
> discovery.
> >>
> >> Proposed Changes:
> >> The overall goal is to make it easier to query CSV data and also to
> make the behavior more consistent across format plugins.
> >> 1.  Change the default behavior and set the extractHeaders to true.
> >> 2.  Other formats, like the excel reader, read tables directly into
> columns.  If the header is not known, Drill assigns a name of field_n.  I
> would propose replacing the `columns` array with a model similar to the
> Excel reader.
> >> 3.  Implement schema discovery (data types) with an allTextMode option
> similar to the JSON reader.  When the allTextMode is disabled, the CSV
> reader would attempt to infer data types.
> >>
> >> Since there are some breaking changes here, I'd like to ask if people
> have any strong feelings on this topic or suggestions.
> >> Thanks!,
> >> -- C
> >>
> >>
> >>
> >
> > <dzamo.vcf>
>
>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Дмитрий Владимирович <kd...@gmail.com>.
Please exclude me from conversation

чт, 18 нояб. 2021 г., 13:30 Charles Givre <cg...@gmail.com>:

> HI James,
> I do think it might be time to start considering creating a wiki of
> breaking changes for a Drill 2.0.  I'd also concur that having tons of
> config options that don't really add value is not a good use of config
> options as it leads to the creation of a lot of technical debt. I'll start
> a wiki page and put this on there.
>
> In the mean time, I may submit a PR that changes the default value of
> extractHeaders for CSV to true.  I don't really see that as a breaking
> change in that a user can simply change that flag and the previous behavior
> is restored.
> Best,
> -- C
>
>
>
> > On Nov 18, 2021, at 2:34 AM, James Turton <dz...@apache.org> wrote:
> >
> > Definitely a +1 for this friendlier default behaviour and another +1 for
> the prospect of increased consistency across format plugins.
> >
> > My follow-up questions to the community.
> > Since these are examples of user-breaking changes, and not just in niche
> areas, are we approaching a point when we want to start working on Drill
> 2.x?
> > Do we have other user-breaking or significant refactoring ideas that
> we've been keeping stashed away in our heads, that would get their chance
> at life from the fact that a 2.x Drill can defensibly exhibit some
> incompatibilities with Drill 1.x?
> > Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where we
> record such ideas?
> > Would we be fine in terms of dev resources with supporting both bug fix
> releases to a 1.x series and also pushing forward in a 2.x series?
> > My own feeling is that to get the most value from a good proposal such
> as the below, we don't want to conceal everything behind default-false
> options in order to avoid breaking Drill 1.x users, we want to embrace the
> breakage which (to me) points to Drill 2.x.
> >
> > On 2021/11/18 02:30, Charles Givre wrote:
> >> Hello Drill Community,
> >> I would like to put forward some thoughts I've had relating to the CSV
> reader in Drill.  I would like to propose a few changes which could
> actually be breaking changes, so I wanted to see if there are any strongly
> held opinions in the community.  Here goes:
> >>
> >> The Problems:
> >> 1.  The default behavior for Drill is to leave the extractColumnHeaders
> option as false.  When a user queries a CSV file this way, the results are
> returned in a list of columns called columns.  Thus if a user wants the
> first column, they would project columns[0].  I have never been a fan of
> this behavior.  Even though Drill ships with the csvh file extension which
> enables the header extraction, this is not a commonly used file format.
> Furthermore, the returned results (the column list) does not work well with
> BI tools.
> >>
> >> 2.  The CSV reader does not attempt to do any kind of data type
> discovery.
> >>
> >> Proposed Changes:
> >> The overall goal is to make it easier to query CSV data and also to
> make the behavior more consistent across format plugins.
> >> 1.  Change the default behavior and set the extractHeaders to true.
> >> 2.  Other formats, like the excel reader, read tables directly into
> columns.  If the header is not known, Drill assigns a name of field_n.  I
> would propose replacing the `columns` array with a model similar to the
> Excel reader.
> >> 3.  Implement schema discovery (data types) with an allTextMode option
> similar to the JSON reader.  When the allTextMode is disabled, the CSV
> reader would attempt to infer data types.
> >>
> >> Since there are some breaking changes here, I'd like to ask if people
> have any strong feelings on this topic or suggestions.
> >> Thanks!,
> >> -- C
> >>
> >>
> >>
> >
> > <dzamo.vcf>
>
>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Charles Givre <cg...@gmail.com>.
HI James, 
I do think it might be time to start considering creating a wiki of breaking changes for a Drill 2.0.  I'd also concur that having tons of config options that don't really add value is not a good use of config options as it leads to the creation of a lot of technical debt. I'll start a wiki page and put this on there.  

In the mean time, I may submit a PR that changes the default value of extractHeaders for CSV to true.  I don't really see that as a breaking change in that a user can simply change that flag and the previous behavior is restored.
Best,
-- C



> On Nov 18, 2021, at 2:34 AM, James Turton <dz...@apache.org> wrote:
> 
> Definitely a +1 for this friendlier default behaviour and another +1 for the prospect of increased consistency across format plugins.
> 
> My follow-up questions to the community.
> Since these are examples of user-breaking changes, and not just in niche areas, are we approaching a point when we want to start working on Drill 2.x?
> Do we have other user-breaking or significant refactoring ideas that we've been keeping stashed away in our heads, that would get their chance at life from the fact that a 2.x Drill can defensibly exhibit some incompatibilities with Drill 1.x?
> Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where we record such ideas?
> Would we be fine in terms of dev resources with supporting both bug fix releases to a 1.x series and also pushing forward in a 2.x series?
> My own feeling is that to get the most value from a good proposal such as the below, we don't want to conceal everything behind default-false options in order to avoid breaking Drill 1.x users, we want to embrace the breakage which (to me) points to Drill 2.x.
> 
> On 2021/11/18 02:30, Charles Givre wrote:
>> Hello Drill Community, 
>> I would like to put forward some thoughts I've had relating to the CSV reader in Drill.  I would like to propose a few changes which could actually be breaking changes, so I wanted to see if there are any strongly held opinions in the community.  Here goes:
>> 
>> The Problems:
>> 1.  The default behavior for Drill is to leave the extractColumnHeaders option as false.  When a user queries a CSV file this way, the results are returned in a list of columns called columns.  Thus if a user wants the first column, they would project columns[0].  I have never been a fan of this behavior.  Even though Drill ships with the csvh file extension which enables the header extraction, this is not a commonly used file format.  Furthermore, the returned results (the column list) does not work well with BI tools. 
>> 
>> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>> 
>> Proposed Changes:
>> The overall goal is to make it easier to query CSV data and also to make the behavior more consistent across format plugins.
>> 1.  Change the default behavior and set the extractHeaders to true. 
>> 2.  Other formats, like the excel reader, read tables directly into columns.  If the header is not known, Drill assigns a name of field_n.  I would propose replacing the `columns` array with a model similar to the Excel reader. 
>> 3.  Implement schema discovery (data types) with an allTextMode option similar to the JSON reader.  When the allTextMode is disabled, the CSV reader would attempt to infer data types. 
>> 
>> Since there are some breaking changes here, I'd like to ask if people have any strong feelings on this topic or suggestions. 
>> Thanks!,
>> -- C
>> 
>> 
>> 
> 
> <dzamo.vcf>


Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Charles Givre <cg...@gmail.com>.
HI James, 
I do think it might be time to start considering creating a wiki of breaking changes for a Drill 2.0.  I'd also concur that having tons of config options that don't really add value is not a good use of config options as it leads to the creation of a lot of technical debt. I'll start a wiki page and put this on there.  

In the mean time, I may submit a PR that changes the default value of extractHeaders for CSV to true.  I don't really see that as a breaking change in that a user can simply change that flag and the previous behavior is restored.
Best,
-- C



> On Nov 18, 2021, at 2:34 AM, James Turton <dz...@apache.org> wrote:
> 
> Definitely a +1 for this friendlier default behaviour and another +1 for the prospect of increased consistency across format plugins.
> 
> My follow-up questions to the community.
> Since these are examples of user-breaking changes, and not just in niche areas, are we approaching a point when we want to start working on Drill 2.x?
> Do we have other user-breaking or significant refactoring ideas that we've been keeping stashed away in our heads, that would get their chance at life from the fact that a 2.x Drill can defensibly exhibit some incompatibilities with Drill 1.x?
> Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where we record such ideas?
> Would we be fine in terms of dev resources with supporting both bug fix releases to a 1.x series and also pushing forward in a 2.x series?
> My own feeling is that to get the most value from a good proposal such as the below, we don't want to conceal everything behind default-false options in order to avoid breaking Drill 1.x users, we want to embrace the breakage which (to me) points to Drill 2.x.
> 
> On 2021/11/18 02:30, Charles Givre wrote:
>> Hello Drill Community, 
>> I would like to put forward some thoughts I've had relating to the CSV reader in Drill.  I would like to propose a few changes which could actually be breaking changes, so I wanted to see if there are any strongly held opinions in the community.  Here goes:
>> 
>> The Problems:
>> 1.  The default behavior for Drill is to leave the extractColumnHeaders option as false.  When a user queries a CSV file this way, the results are returned in a list of columns called columns.  Thus if a user wants the first column, they would project columns[0].  I have never been a fan of this behavior.  Even though Drill ships with the csvh file extension which enables the header extraction, this is not a commonly used file format.  Furthermore, the returned results (the column list) does not work well with BI tools. 
>> 
>> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>> 
>> Proposed Changes:
>> The overall goal is to make it easier to query CSV data and also to make the behavior more consistent across format plugins.
>> 1.  Change the default behavior and set the extractHeaders to true. 
>> 2.  Other formats, like the excel reader, read tables directly into columns.  If the header is not known, Drill assigns a name of field_n.  I would propose replacing the `columns` array with a model similar to the Excel reader. 
>> 3.  Implement schema discovery (data types) with an allTextMode option similar to the JSON reader.  When the allTextMode is disabled, the CSV reader would attempt to infer data types. 
>> 
>> Since there are some breaking changes here, I'd like to ask if people have any strong feelings on this topic or suggestions. 
>> Thanks!,
>> -- C
>> 
>> 
>> 
> 
> <dzamo.vcf>


Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by James Turton <dz...@apache.org>.
Definitely a +1 for this friendlier default behaviour and another +1 for 
the prospect of increased consistency across format plugins.

My follow-up questions to the community.

 1. Since these are examples of user-breaking changes, and not just in
    niche areas, are we approaching a point when we want to start
    working on Drill 2.x?
 2. Do we have other user-breaking or significant refactoring ideas that
    we've been keeping stashed away in our heads, that would get their
    chance at life from the fact that a 2.x Drill can defensibly exhibit
    some incompatibilities with Drill 1.x?
 3. Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where
    we record such ideas?
 4. Would we be fine in terms of dev resources with supporting both bug
    fix releases to a 1.x series and also pushing forward in a 2.x series?

My own feeling is that to get the most value from a good proposal such 
as the below, we don't want to conceal everything behind default-false 
options in order to avoid breaking Drill 1.x users, we want to embrace 
the breakage which (to me) points to Drill 2.x.

On 2021/11/18 02:30, Charles Givre wrote:
> Hello Drill Community,
> I would like to put forward some thoughts I've had relating to the CSV reader in Drill.  I would like to propose a few changes which could actually be breaking changes, so I wanted to see if there are any strongly held opinions in the community.  Here goes:
>
> The Problems:
> 1.  The default behavior for Drill is to leave the extractColumnHeaders option as false.  When a user queries a CSV file this way, the results are returned in a list of columns called columns.  Thus if a user wants the first column, they would project columns[0].  I have never been a fan of this behavior.  Even though Drill ships with the csvh file extension which enables the header extraction, this is not a commonly used file format.  Furthermore, the returned results (the column list) does not work well with BI tools.
>
> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>
> Proposed Changes:
> The overall goal is to make it easier to query CSV data and also to make the behavior more consistent across format plugins.
> 1.  Change the default behavior and set the extractHeaders to true.
> 2.  Other formats, like the excel reader, read tables directly into columns.  If the header is not known, Drill assigns a name of field_n.  I would propose replacing the `columns` array with a model similar to the Excel reader.
> 3.  Implement schema discovery (data types) with an allTextMode option similar to the JSON reader.  When the allTextMode is disabled, the CSV reader would attempt to infer data types.
>
> Since there are some breaking changes here, I'd like to ask if people have any strong feelings on this topic or suggestions.
> Thanks!,
> -- C
>
>
>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by James Turton <dz...@apache.org>.
Definitely a +1 for this friendlier default behaviour and another +1 for 
the prospect of increased consistency across format plugins.

My follow-up questions to the community.

 1. Since these are examples of user-breaking changes, and not just in
    niche areas, are we approaching a point when we want to start
    working on Drill 2.x?
 2. Do we have other user-breaking or significant refactoring ideas that
    we've been keeping stashed away in our heads, that would get their
    chance at life from the fact that a 2.x Drill can defensibly exhibit
    some incompatibilities with Drill 1.x?
 3. Should we make a "Drill v2 Parking Lot" page in the Dev Wiki where
    we record such ideas?
 4. Would we be fine in terms of dev resources with supporting both bug
    fix releases to a 1.x series and also pushing forward in a 2.x series?

My own feeling is that to get the most value from a good proposal such 
as the below, we don't want to conceal everything behind default-false 
options in order to avoid breaking Drill 1.x users, we want to embrace 
the breakage which (to me) points to Drill 2.x.

On 2021/11/18 02:30, Charles Givre wrote:
> Hello Drill Community,
> I would like to put forward some thoughts I've had relating to the CSV reader in Drill.  I would like to propose a few changes which could actually be breaking changes, so I wanted to see if there are any strongly held opinions in the community.  Here goes:
>
> The Problems:
> 1.  The default behavior for Drill is to leave the extractColumnHeaders option as false.  When a user queries a CSV file this way, the results are returned in a list of columns called columns.  Thus if a user wants the first column, they would project columns[0].  I have never been a fan of this behavior.  Even though Drill ships with the csvh file extension which enables the header extraction, this is not a commonly used file format.  Furthermore, the returned results (the column list) does not work well with BI tools.
>
> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>
> Proposed Changes:
> The overall goal is to make it easier to query CSV data and also to make the behavior more consistent across format plugins.
> 1.  Change the default behavior and set the extractHeaders to true.
> 2.  Other formats, like the excel reader, read tables directly into columns.  If the header is not known, Drill assigns a name of field_n.  I would propose replacing the `columns` array with a model similar to the Excel reader.
> 3.  Implement schema discovery (data types) with an allTextMode option similar to the JSON reader.  When the allTextMode is disabled, the CSV reader would attempt to infer data types.
>
> Since there are some breaking changes here, I'd like to ask if people have any strong feelings on this topic or suggestions.
> Thanks!,
> -- C
>
>
>

Re: [DISCUSS] Refactoring Drill's CSV (Text) Reader

Posted by Ted Dunning <te...@gmail.com>.
I think that these would be significant improvements.

The current behavior is pretty painful on average. Better defaults and just
a bit of deduction could pay off big. I even think that the presence of
headers might be pretty reliably inferred.



On Wed, Nov 17, 2021 at 4:31 PM Charles Givre <cg...@gmail.com> wrote:

> Hello Drill Community,
> I would like to put forward some thoughts I've had relating to the CSV
> reader in Drill.  I would like to propose a few changes which could
> actually be breaking changes, so I wanted to see if there are any strongly
> held opinions in the community.  Here goes:
>
> The Problems:
> 1.  The default behavior for Drill is to leave the extractColumnHeaders
> option as false.  When a user queries a CSV file this way, the results are
> returned in a list of columns called columns.  Thus if a user wants the
> first column, they would project columns[0].  I have never been a fan of
> this behavior.  Even though Drill ships with the csvh file extension which
> enables the header extraction, this is not a commonly used file format.
> Furthermore, the returned results (the column list) does not work well with
> BI tools.
>
> 2.  The CSV reader does not attempt to do any kind of data type discovery.
>
> Proposed Changes:
> The overall goal is to make it easier to query CSV data and also to make
> the behavior more consistent across format plugins.
> 1.  Change the default behavior and set the extractHeaders to true.
> 2.  Other formats, like the excel reader, read tables directly into
> columns.  If the header is not known, Drill assigns a name of field_n.  I
> would propose replacing the `columns` array with a model similar to the
> Excel reader.
> 3.  Implement schema discovery (data types) with an allTextMode option
> similar to the JSON reader.  When the allTextMode is disabled, the CSV
> reader would attempt to infer data types.
>
> Since there are some breaking changes here, I'd like to ask if people have
> any strong feelings on this topic or suggestions.
> Thanks!,
> -- C
>
>
>
>