You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Larry White <lj...@gmail.com> on 2015/07/02 21:48:59 UTC

(noob) performance of queries against csv files

hi,

i'm using drill to provide a query-able wrapper around some csv files. when
i load a csv datasource, is the data transformed in someway (beyond what
Calcite does) to improve performance?  Specifically, is it transformed into
column format? re-written as parquet, or otherwise optimized?

thanks.

larry

Re: (noob) performance of queries against csv files

Posted by Jason Altekruse <al...@gmail.com>.

I would recommend select with specific column aliases assigned and casts
where appropriate.

create table parquet_users as select cast(columns[0] as int) as user_id,
columns[1] as username, cast(columns[2] as timestamp) as registration_date
from `users.csv1`;


On Thu, Jul 2, 2015 at 2:46 PM, Larry White <lj...@gmail.com> wrote:

> so the solution is to use select, but with columns specifically defined. is
> that right?
>
> On Thu, Jul 2, 2015 at 4:48 PM, Jason Altekruse <al...@gmail.com>
> wrote:
>
> > Just one additional note here, I would strongly advise against converting
> > csv files using a select * query out of a csv.
> >
> > The reason for this is two-fold. Currently we read csv files into a list
> of
> > varchars, rather than individual columns. While parquet supports lists
> and
> > we will read them, the read path for complex data has not been optimized
> as
> > much as our read path for flat data out of parquet. You will get the best
> > performance selecting data out of the particular "columns" (we read the
> > entire line into a single column with a list of varchars called
> `columns`)
> > in your csv file with our array syntax and then assigning meaningful
> column
> > names, for example select columns[0] as user_id, columns[1] as user_name,
> > ... from `foo.csv`. Additionally for any columns with particular known
> > types like int, float, datetime, etc. I would also recommend that casts
> be
> > inserted where appropriate. You will get better read performance reading
> > fixed width data, rather than reading a file full of varchars. On top of
> > the read overhead storing data in the varchars, you would also be adding
> > overhead as your future queries would require a cast anyway to actually
> > analyze the data.
> >
> >
> >
> > On Thu, Jul 2, 2015 at 1:27 PM, Larry White <lj...@gmail.com> wrote:
> >
> > > Great.  Thanks much
> > >
> > > On Thursday, July 2, 2015, Ted Dunning <te...@gmail.com> wrote:
> > >
> > > > Hey Larry,
> > > >
> > > > Drill transforms your CSV data into an internal memory-resident
> format
> > > for
> > > > processing, but does not change the structure of your original data.
> > > >
> > > > If you want to convert your file to parquet, you can do this:
> > > >
> > > > create table `foo.parquet` as select * from `foo.csv`
> > > >
> > > >
> > > > This will, however, not leave you with interesting column names.  You
> > can
> > > > add names inside the select or by putting a parenthesized list of
> > fields
> > > > after the word 'table'.  Often you will want to add casts in the
> select
> > > to
> > > > indicate what type of data you want to use.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1001@gmail.com
> > > > <javascript:;>> wrote:
> > > >
> > > > > hi,
> > > > >
> > > > > i'm using drill to provide a query-able wrapper around some csv
> > files.
> > > > when
> > > > > i load a csv datasource, is the data transformed in someway (beyond
> > > what
> > > > > Calcite does) to improve performance?  Specifically, is it
> > transformed
> > > > into
> > > > > column format? re-written as parquet, or otherwise optimized?
> > > > >
> > > > > thanks.
> > > > >
> > > > > larry
> > > > >
> > > >
> > >
> >
>

Re: (noob) performance of queries against csv files

Posted by Larry White <lj...@gmail.com>.

so the solution is to use select, but with columns specifically defined. is
that right?

On Thu, Jul 2, 2015 at 4:48 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Just one additional note here, I would strongly advise against converting
> csv files using a select * query out of a csv.
>
> The reason for this is two-fold. Currently we read csv files into a list of
> varchars, rather than individual columns. While parquet supports lists and
> we will read them, the read path for complex data has not been optimized as
> much as our read path for flat data out of parquet. You will get the best
> performance selecting data out of the particular "columns" (we read the
> entire line into a single column with a list of varchars called `columns`)
> in your csv file with our array syntax and then assigning meaningful column
> names, for example select columns[0] as user_id, columns[1] as user_name,
> ... from `foo.csv`. Additionally for any columns with particular known
> types like int, float, datetime, etc. I would also recommend that casts be
> inserted where appropriate. You will get better read performance reading
> fixed width data, rather than reading a file full of varchars. On top of
> the read overhead storing data in the varchars, you would also be adding
> overhead as your future queries would require a cast anyway to actually
> analyze the data.
>
>
>
> On Thu, Jul 2, 2015 at 1:27 PM, Larry White <lj...@gmail.com> wrote:
>
> > Great.  Thanks much
> >
> > On Thursday, July 2, 2015, Ted Dunning <te...@gmail.com> wrote:
> >
> > > Hey Larry,
> > >
> > > Drill transforms your CSV data into an internal memory-resident format
> > for
> > > processing, but does not change the structure of your original data.
> > >
> > > If you want to convert your file to parquet, you can do this:
> > >
> > > create table `foo.parquet` as select * from `foo.csv`
> > >
> > >
> > > This will, however, not leave you with interesting column names.  You
> can
> > > add names inside the select or by putting a parenthesized list of
> fields
> > > after the word 'table'.  Often you will want to add casts in the select
> > to
> > > indicate what type of data you want to use.
> > >
> > >
> > >
> > >
> > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1001@gmail.com
> > > <javascript:;>> wrote:
> > >
> > > > hi,
> > > >
> > > > i'm using drill to provide a query-able wrapper around some csv
> files.
> > > when
> > > > i load a csv datasource, is the data transformed in someway (beyond
> > what
> > > > Calcite does) to improve performance?  Specifically, is it
> transformed
> > > into
> > > > column format? re-written as parquet, or otherwise optimized?
> > > >
> > > > thanks.
> > > >
> > > > larry
> > > >
> > >
> >
>

Re: (noob) performance of queries against csv files

Posted by Larry White <lj...@gmail.com>.

ok. thanks for that tip. i will be using local file system as i'm looking
at embedding drill in analytics program.  the performance needn't be
brilliant for a while, but i will keep an eye on development and use
smaller files if needed.

On Thu, Jul 2, 2015 at 4:57 PM, Jacques Nadeau <ja...@apache.org> wrote:

> And one more note.  You question didn't say whether you were running on
> HDFS or local file system.  There is one thing that is a weaknesses to the
> local file system capability--we don't split files on block splits.  We had
> a patch for that out but it hasn't been merged.  If you're currently in
> that case, you may want to split into smaller files manually until we get a
> patch like that merged.
>
> On Thu, Jul 2, 2015 at 1:48 PM, Jason Altekruse <al...@gmail.com>
> wrote:
>
> > Just one additional note here, I would strongly advise against converting
> > csv files using a select * query out of a csv.
> >
> > The reason for this is two-fold. Currently we read csv files into a list
> of
> > varchars, rather than individual columns. While parquet supports lists
> and
> > we will read them, the read path for complex data has not been optimized
> as
> > much as our read path for flat data out of parquet. You will get the best
> > performance selecting data out of the particular "columns" (we read the
> > entire line into a single column with a list of varchars called
> `columns`)
> > in your csv file with our array syntax and then assigning meaningful
> column
> > names, for example select columns[0] as user_id, columns[1] as user_name,
> > ... from `foo.csv`. Additionally for any columns with particular known
> > types like int, float, datetime, etc. I would also recommend that casts
> be
> > inserted where appropriate. You will get better read performance reading
> > fixed width data, rather than reading a file full of varchars. On top of
> > the read overhead storing data in the varchars, you would also be adding
> > overhead as your future queries would require a cast anyway to actually
> > analyze the data.
> >
> >
> >
> > On Thu, Jul 2, 2015 at 1:27 PM, Larry White <lj...@gmail.com> wrote:
> >
> > > Great.  Thanks much
> > >
> > > On Thursday, July 2, 2015, Ted Dunning <te...@gmail.com> wrote:
> > >
> > > > Hey Larry,
> > > >
> > > > Drill transforms your CSV data into an internal memory-resident
> format
> > > for
> > > > processing, but does not change the structure of your original data.
> > > >
> > > > If you want to convert your file to parquet, you can do this:
> > > >
> > > > create table `foo.parquet` as select * from `foo.csv`
> > > >
> > > >
> > > > This will, however, not leave you with interesting column names.  You
> > can
> > > > add names inside the select or by putting a parenthesized list of
> > fields
> > > > after the word 'table'.  Often you will want to add casts in the
> select
> > > to
> > > > indicate what type of data you want to use.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1001@gmail.com
> > > > <javascript:;>> wrote:
> > > >
> > > > > hi,
> > > > >
> > > > > i'm using drill to provide a query-able wrapper around some csv
> > files.
> > > > when
> > > > > i load a csv datasource, is the data transformed in someway (beyond
> > > what
> > > > > Calcite does) to improve performance?  Specifically, is it
> > transformed
> > > > into
> > > > > column format? re-written as parquet, or otherwise optimized?
> > > > >
> > > > > thanks.
> > > > >
> > > > > larry
> > > > >
> > > >
> > >
> >
>

Re: (noob) performance of queries against csv files

Posted by Jacques Nadeau <ja...@apache.org>.

And one more note.  You question didn't say whether you were running on
HDFS or local file system.  There is one thing that is a weaknesses to the
local file system capability--we don't split files on block splits.  We had
a patch for that out but it hasn't been merged.  If you're currently in
that case, you may want to split into smaller files manually until we get a
patch like that merged.

On Thu, Jul 2, 2015 at 1:48 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Just one additional note here, I would strongly advise against converting
> csv files using a select * query out of a csv.
>
> The reason for this is two-fold. Currently we read csv files into a list of
> varchars, rather than individual columns. While parquet supports lists and
> we will read them, the read path for complex data has not been optimized as
> much as our read path for flat data out of parquet. You will get the best
> performance selecting data out of the particular "columns" (we read the
> entire line into a single column with a list of varchars called `columns`)
> in your csv file with our array syntax and then assigning meaningful column
> names, for example select columns[0] as user_id, columns[1] as user_name,
> ... from `foo.csv`. Additionally for any columns with particular known
> types like int, float, datetime, etc. I would also recommend that casts be
> inserted where appropriate. You will get better read performance reading
> fixed width data, rather than reading a file full of varchars. On top of
> the read overhead storing data in the varchars, you would also be adding
> overhead as your future queries would require a cast anyway to actually
> analyze the data.
>
>
>
> On Thu, Jul 2, 2015 at 1:27 PM, Larry White <lj...@gmail.com> wrote:
>
> > Great.  Thanks much
> >
> > On Thursday, July 2, 2015, Ted Dunning <te...@gmail.com> wrote:
> >
> > > Hey Larry,
> > >
> > > Drill transforms your CSV data into an internal memory-resident format
> > for
> > > processing, but does not change the structure of your original data.
> > >
> > > If you want to convert your file to parquet, you can do this:
> > >
> > > create table `foo.parquet` as select * from `foo.csv`
> > >
> > >
> > > This will, however, not leave you with interesting column names.  You
> can
> > > add names inside the select or by putting a parenthesized list of
> fields
> > > after the word 'table'.  Often you will want to add casts in the select
> > to
> > > indicate what type of data you want to use.
> > >
> > >
> > >
> > >
> > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1001@gmail.com
> > > <javascript:;>> wrote:
> > >
> > > > hi,
> > > >
> > > > i'm using drill to provide a query-able wrapper around some csv
> files.
> > > when
> > > > i load a csv datasource, is the data transformed in someway (beyond
> > what
> > > > Calcite does) to improve performance?  Specifically, is it
> transformed
> > > into
> > > > column format? re-written as parquet, or otherwise optimized?
> > > >
> > > > thanks.
> > > >
> > > > larry
> > > >
> > >
> >
>

Re: (noob) performance of queries against csv files

Posted by Ted Dunning <te...@gmail.com>.

Jason's advice here is sage.



On Thu, Jul 2, 2015 at 1:48 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Just one additional note here, I would strongly advise against converting
> csv files using a select * query out of a csv.
>
> The reason for this is two-fold. Currently we read csv files into a list of
> varchars, rather than individual columns. While parquet supports lists and
> we will read them, the read path for complex data has not been optimized as
> much as our read path for flat data out of parquet. You will get the best
> performance selecting data out of the particular "columns" (we read the
> entire line into a single column with a list of varchars called `columns`)
> in your csv file with our array syntax and then assigning meaningful column
> names, for example select columns[0] as user_id, columns[1] as user_name,
> ... from `foo.csv`. Additionally for any columns with particular known
> types like int, float, datetime, etc. I would also recommend that casts be
> inserted where appropriate. You will get better read performance reading
> fixed width data, rather than reading a file full of varchars. On top of
> the read overhead storing data in the varchars, you would also be adding
> overhead as your future queries would require a cast anyway to actually
> analyze the data.
>
>
>
> On Thu, Jul 2, 2015 at 1:27 PM, Larry White <lj...@gmail.com> wrote:
>
> > Great.  Thanks much
> >
> > On Thursday, July 2, 2015, Ted Dunning <te...@gmail.com> wrote:
> >
> > > Hey Larry,
> > >
> > > Drill transforms your CSV data into an internal memory-resident format
> > for
> > > processing, but does not change the structure of your original data.
> > >
> > > If you want to convert your file to parquet, you can do this:
> > >
> > > create table `foo.parquet` as select * from `foo.csv`
> > >
> > >
> > > This will, however, not leave you with interesting column names.  You
> can
> > > add names inside the select or by putting a parenthesized list of
> fields
> > > after the word 'table'.  Often you will want to add casts in the select
> > to
> > > indicate what type of data you want to use.
> > >
> > >
> > >
> > >
> > > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1001@gmail.com
> > > <javascript:;>> wrote:
> > >
> > > > hi,
> > > >
> > > > i'm using drill to provide a query-able wrapper around some csv
> files.
> > > when
> > > > i load a csv datasource, is the data transformed in someway (beyond
> > what
> > > > Calcite does) to improve performance?  Specifically, is it
> transformed
> > > into
> > > > column format? re-written as parquet, or otherwise optimized?
> > > >
> > > > thanks.
> > > >
> > > > larry
> > > >
> > >
> >
>

Re: (noob) performance of queries against csv files

Posted by Jason Altekruse <al...@gmail.com>.

Just one additional note here, I would strongly advise against converting
csv files using a select * query out of a csv.

The reason for this is two-fold. Currently we read csv files into a list of
varchars, rather than individual columns. While parquet supports lists and
we will read them, the read path for complex data has not been optimized as
much as our read path for flat data out of parquet. You will get the best
performance selecting data out of the particular "columns" (we read the
entire line into a single column with a list of varchars called `columns`)
in your csv file with our array syntax and then assigning meaningful column
names, for example select columns[0] as user_id, columns[1] as user_name,
... from `foo.csv`. Additionally for any columns with particular known
types like int, float, datetime, etc. I would also recommend that casts be
inserted where appropriate. You will get better read performance reading
fixed width data, rather than reading a file full of varchars. On top of
the read overhead storing data in the varchars, you would also be adding
overhead as your future queries would require a cast anyway to actually
analyze the data.

On Thu, Jul 2, 2015 at 1:27 PM, Larry White <lj...@gmail.com> wrote:

> Great.  Thanks much
>
> On Thursday, July 2, 2015, Ted Dunning <te...@gmail.com> wrote:
>
> > Hey Larry,
> >
> > Drill transforms your CSV data into an internal memory-resident format
> for
> > processing, but does not change the structure of your original data.
> >
> > If you want to convert your file to parquet, you can do this:
> >
> > create table `foo.parquet` as select * from `foo.csv`
> >
> >
> > This will, however, not leave you with interesting column names.  You can
> > add names inside the select or by putting a parenthesized list of fields
> > after the word 'table'.  Often you will want to add casts in the select
> to
> > indicate what type of data you want to use.
> >
> >
> >
> >
> > On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1001@gmail.com
> > <javascript:;>> wrote:
> >
> > > hi,
> > >
> > > i'm using drill to provide a query-able wrapper around some csv files.
> > when
> > > i load a csv datasource, is the data transformed in someway (beyond
> what
> > > Calcite does) to improve performance?  Specifically, is it transformed
> > into
> > > column format? re-written as parquet, or otherwise optimized?
> > >
> > > thanks.
> > >
> > > larry
> > >
> >
>

Re: (noob) performance of queries against csv files

Posted by Larry White <lj...@gmail.com>.

Great.  Thanks much

On Thursday, July 2, 2015, Ted Dunning <te...@gmail.com> wrote:

> Hey Larry,
>
> Drill transforms your CSV data into an internal memory-resident format for
> processing, but does not change the structure of your original data.
>
> If you want to convert your file to parquet, you can do this:
>
> create table `foo.parquet` as select * from `foo.csv`
>
>
> This will, however, not leave you with interesting column names.  You can
> add names inside the select or by putting a parenthesized list of fields
> after the word 'table'.  Often you will want to add casts in the select to
> indicate what type of data you want to use.
>
>
>
>
> On Thu, Jul 2, 2015 at 12:48 PM, Larry White <ljw1001@gmail.com
> <javascript:;>> wrote:
>
> > hi,
> >
> > i'm using drill to provide a query-able wrapper around some csv files.
> when
> > i load a csv datasource, is the data transformed in someway (beyond what
> > Calcite does) to improve performance?  Specifically, is it transformed
> into
> > column format? re-written as parquet, or otherwise optimized?
> >
> > thanks.
> >
> > larry
> >
>

Re: (noob) performance of queries against csv files

Posted by Ted Dunning <te...@gmail.com>.

Hey Larry,

Drill transforms your CSV data into an internal memory-resident format for
processing, but does not change the structure of your original data.

If you want to convert your file to parquet, you can do this:

create table `foo.parquet` as select * from `foo.csv`

This will, however, not leave you with interesting column names.  You can
add names inside the select or by putting a parenthesized list of fields
after the word 'table'.  Often you will want to add casts in the select to
indicate what type of data you want to use.

On Thu, Jul 2, 2015 at 12:48 PM, Larry White <lj...@gmail.com> wrote:

> hi,
>
> i'm using drill to provide a query-able wrapper around some csv files. when
> i load a csv datasource, is the data transformed in someway (beyond what
> Calcite does) to improve performance?  Specifically, is it transformed into
> column format? re-written as parquet, or otherwise optimized?
>
> thanks.
>
> larry
>