You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Avram Aelony <Av...@eharmony.com> on 2009/09/16 19:48:33 UTC

adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data.  As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use.  My hope is that this information could be added in (upon request) via Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com


RE: adding filenames as new columns via Hive

Posted by Avram Aelony <Av...@eharmony.com>.
Clarification, the line below should have been "many files, roughly 500MB compressed" instead of "many files, roughly 500M compressed"..
-A


From: Avram Aelony [mailto:AvramAelony@eharmony.com]
Sent: Wednesday, September 16, 2009 10:49 AM
To: hive-user@hadoop.apache.org
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data.  As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use.  My hope is that this information could be added in (upon request) via Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com


RE: adding filenames as new columns via Hive

Posted by Avram Aelony <Av...@eharmony.com>.
I defer to your judgment, but this appears to me to be a case where distinctions between meta-data and row-data are blurry.  In my case, the filename is useful to have only because it contains information that should have been included in the row-data.  Thus the filename adds information that is not truly meta-data, especially if  there are hundreds of filenames, each with various substrings that really should have been new grouping variables within each file.

That said, the important point is that  it is clear that the filename info is present and available for further parsing.  I would think the intent of a describe command would be to describe what is available, but if there is a way to describe meta-data that wouldn't be overlooked by an analyst poking around for the data being looked for, that works for me...

Cheers,
-A




From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 12:22 PM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I think all hive tables will have this metadata (relpath and other hdfs file properties potentially) available by default. So there is no need for this to be shown in describe.

________________________________
From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 12:19:21 -0700
To: <hi...@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

Actually, I think it would be great to see filename in the describe command (if the table specified it be included at table create time).

Part of the issue is that necessary information is only available in the filename and not in the row data.  Making the filename available (for further parsing) in the describe command would alert the analyst that the sought-after information is actually present and need only be further massaged...

-A


-----Original Message-----
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Wednesday, September 16, 2009 12:11 PM
To: hive-user@hadoop.apache.org
Subject: RE: adding filenames as new columns via Hive

You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Wednesday, September 16, 2009 12:05 PM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <nj...@facebook.com> wrote:
> I don't think it is a good idea to make it a part of table metadata in
> any way.
>
> What happens if the filename changes ? It will be very difficult to
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one
> of them
>
> to start with - it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them - this is mostly for advanced
> users and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:AvramAelony@eharmony.com]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action. J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <hi...@hadoop.apache.org>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new)
> derived fields, I will be happy. J For example, in a later query I'd
> like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to
> be a partition column (one of such) instead of a separate type of
> column. Would that work?
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/';
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME) stored as textfile
> location 's3:/somebucket/';
>
>
> So FILENAME becomes a new type. I like this because partition columns
> are virtual columns just like the filename column and do not exist
> along with data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <hi...@hadoop.apache.org>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M
> compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the
> same schema,  they have individual filenames that provide useful
> information that does not get captured and does not exist separately
> as a column within each file's data.  As a general problem, I'd like
> to be able to add a new column via Hive that contains the filename of
> the files read in that were present in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container
> bucket, and I am thinking that at some point Hadoop or Hive must have
> a file handle with the filenames that perhaps could be of use.  My
> hope is that this information could be added in (upon request) via
> Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> add_filename as 'filename'
> stored as textfile location 's3:/somebucket/';
>
>
> Has anyone thought of this?  Is there a way to add a new column within
> Hive that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com

Re: adding filenames as new columns via Hive

Posted by Prasad Chakka <pc...@facebook.com>.
I think all hive tables will have this metadata (relpath and other hdfs file properties potentially) available by default. So there is no need for this to be shown in describe.


________________________________
From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 12:19:21 -0700
To: <hi...@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

Actually, I think it would be great to see filename in the describe command (if the table specified it be included at table create time).

Part of the issue is that necessary information is only available in the filename and not in the row data.  Making the filename available (for further parsing) in the describe command would alert the analyst that the sought-after information is actually present and need only be further massaged...

-A


-----Original Message-----
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Wednesday, September 16, 2009 12:11 PM
To: hive-user@hadoop.apache.org
Subject: RE: adding filenames as new columns via Hive

You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Wednesday, September 16, 2009 12:05 PM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <nj...@facebook.com> wrote:
> I don't think it is a good idea to make it a part of table metadata in
> any way.
>
> What happens if the filename changes ? It will be very difficult to
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one
> of them
>
> to start with - it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them - this is mostly for advanced
> users and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:AvramAelony@eharmony.com]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action. J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <hi...@hadoop.apache.org>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new)
> derived fields, I will be happy. J For example, in a later query I'd
> like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to
> be a partition column (one of such) instead of a separate type of
> column. Would that work?
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/';
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME) stored as textfile
> location 's3:/somebucket/';
>
>
> So FILENAME becomes a new type. I like this because partition columns
> are virtual columns just like the filename column and do not exist
> along with data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <hi...@hadoop.apache.org>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M
> compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the
> same schema,  they have individual filenames that provide useful
> information that does not get captured and does not exist separately
> as a column within each file's data.  As a general problem, I'd like
> to be able to add a new column via Hive that contains the filename of
> the files read in that were present in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container
> bucket, and I am thinking that at some point Hadoop or Hive must have
> a file handle with the filenames that perhaps could be of use.  My
> hope is that this information could be added in (upon request) via
> Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> add_filename as 'filename'
> stored as textfile location 's3:/somebucket/';
>
>
> Has anyone thought of this?  Is there a way to add a new column within
> Hive that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com


RE: adding filenames as new columns via Hive

Posted by Avram Aelony <Av...@eharmony.com>.
Actually, I think it would be great to see filename in the describe command (if the table specified it be included at table create time).  

Part of the issue is that necessary information is only available in the filename and not in the row data.  Making the filename available (for further parsing) in the describe command would alert the analyst that the sought-after information is actually present and need only be further massaged...

-A


-----Original Message-----
From: Ashish Thusoo [mailto:athusoo@facebook.com] 
Sent: Wednesday, September 16, 2009 12:11 PM
To: hive-user@hadoop.apache.org
Subject: RE: adding filenames as new columns via Hive

You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish 

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com] 
Sent: Wednesday, September 16, 2009 12:05 PM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <nj...@facebook.com> wrote:
> I don't think it is a good idea to make it a part of table metadata in 
> any way.
>
> What happens if the filename changes ? It will be very difficult to 
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one 
> of them
>
> to start with - it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them - this is mostly for advanced 
> users and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:AvramAelony@eharmony.com]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action. J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select 
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <hi...@hadoop.apache.org>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new) 
> derived fields, I will be happy. J For example, in a later query I'd 
> like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to 
> be a partition column (one of such) instead of a separate type of 
> column. Would that work?
>
> create external table FOO (  <list of fields and types> ) row format 
> delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/';
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> ) row format 
> delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME) stored as textfile 
> location 's3:/somebucket/';
>
>
> So FILENAME becomes a new type. I like this because partition columns 
> are virtual columns just like the filename column and do not exist 
> along with data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <hi...@hadoop.apache.org>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M 
> compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the 
> same schema,  they have individual filenames that provide useful 
> information that does not get captured and does not exist separately 
> as a column within each file's data.  As a general problem, I'd like 
> to be able to add a new column via Hive that contains the filename of 
> the files read in that were present in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container 
> bucket, and I am thinking that at some point Hadoop or Hive must have 
> a file handle with the filenames that perhaps could be of use.  My 
> hope is that this information could be added in (upon request) via 
> Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> ) row format 
> delimited fields terminated by ','
> add_filename as 'filename'
> stored as textfile location 's3:/somebucket/';
>
>
> Has anyone thought of this?  Is there a way to add a new column within 
> Hive that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com

Re: adding filenames as new columns via Hive

Posted by Prasad Chakka <pc...@facebook.com>.
Yeah, udfs are better semantically. Will this be a problem for partition pruning?


________________________________
From: Ashish Thusoo <at...@facebook.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 12:11:00 -0700
To: <hi...@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com]
Sent: Wednesday, September 16, 2009 12:05 PM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <nj...@facebook.com> wrote:
> I don't think it is a good idea to make it a part of table metadata in
> any way.
>
> What happens if the filename changes ? It will be very difficult to
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one
> of them
>
> to start with - it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them - this is mostly for advanced
> users and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:AvramAelony@eharmony.com]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action. J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <hi...@hadoop.apache.org>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new)
> derived fields, I will be happy. J For example, in a later query I'd
> like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to
> be a partition column (one of such) instead of a separate type of
> column. Would that work?
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/';
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME) stored as textfile
> location 's3:/somebucket/';
>
>
> So FILENAME becomes a new type. I like this because partition columns
> are virtual columns just like the filename column and do not exist
> along with data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <hi...@hadoop.apache.org>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M
> compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the
> same schema,  they have individual filenames that provide useful
> information that does not get captured and does not exist separately
> as a column within each file's data.  As a general problem, I'd like
> to be able to add a new column via Hive that contains the filename of
> the files read in that were present in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container
> bucket, and I am thinking that at some point Hadoop or Hive must have
> a file handle with the filenames that perhaps could be of use.  My
> hope is that this information could be added in (upon request) via
> Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> ) row format
> delimited fields terminated by ','
> add_filename as 'filename'
> stored as textfile location 's3:/somebucket/';
>
>
> Has anyone thought of this?  Is there a way to add a new column within
> Hive that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com


RE: adding filenames as new columns via Hive

Posted by Ashish Thusoo <at...@facebook.com>.
You could also do this as a simple udf instead of a virtual column. Virtual columns do get shown in the describe command and I don't think it would make sense to show this in the describe command. So instead of


Select FILENAME, xyz from T

We could just do

Select Filename(), xyz from T

Thoughts?

Ashish 

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxguru@gmail.com] 
Sent: Wednesday, September 16, 2009 12:05 PM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we also have a command like 'show files in partition' so we can inspect what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <nj...@facebook.com> wrote:
> I don't think it is a good idea to make it a part of table metadata in 
> any way.
>
> What happens if the filename changes ? It will be very difficult to 
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one 
> of them
>
> to start with - it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them - this is mostly for advanced 
> users and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:AvramAelony@eharmony.com]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action. J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select 
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <hi...@hadoop.apache.org>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new) 
> derived fields, I will be happy. J For example, in a later query I'd 
> like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to 
> be a partition column (one of such) instead of a separate type of 
> column. Would that work?
>
> create external table FOO (  <list of fields and types> ) row format 
> delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/';
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> ) row format 
> delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME) stored as textfile 
> location 's3:/somebucket/';
>
>
> So FILENAME becomes a new type. I like this because partition columns 
> are virtual columns just like the filename column and do not exist 
> along with data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <hi...@hadoop.apache.org>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M 
> compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the 
> same schema,  they have individual filenames that provide useful 
> information that does not get captured and does not exist separately 
> as a column within each file's data.  As a general problem, I'd like 
> to be able to add a new column via Hive that contains the filename of 
> the files read in that were present in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container 
> bucket, and I am thinking that at some point Hadoop or Hive must have 
> a file handle with the filenames that perhaps could be of use.  My 
> hope is that this information could be added in (upon request) via 
> Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> ) row format 
> delimited fields terminated by ','
> add_filename as 'filename'
> stored as textfile location 's3:/somebucket/';
>
>
> Has anyone thought of this?  Is there a way to add a new column within 
> Hive that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com

Re: adding filenames as new columns via Hive

Posted by Edward Capriolo <ed...@gmail.com>.
I just put in a related thread about this. This would be really nice.
It is just a virtual column, we dont need it in the metadata if we
also have a command like 'show files in partition' so we can inspect
what is there as well.


On Wed, Sep 16, 2009 at 3:02 PM, Namit Jain <nj...@facebook.com> wrote:
> I don’t think it is a good idea to make it a part of table metadata in any
> way.
>
> What happens if the filename changes ? It will be very difficult to
> maintain.
>
> But, we can definitely add some virtual columns (FILENAME can be one of them
>
> to start with – it should not show up in describe, select * etc.
>
>
>
> But, the user can query based on them – this is mostly for advanced users
> and
>
> can be used for pruning etc. also
>
>
>
>
>
> I will open a new jira, and we can continue the discussion there.
>
>
>
>
>
> -namit
>
>
>
>
>
>
>
>
>
> From: Avram Aelony [mailto:AvramAelony@eharmony.com]
> Sent: Wednesday, September 16, 2009 11:39 AM
> To: hive-user@hadoop.apache.org
> Subject: RE: adding filenames as new columns via Hive
>
>
>
>
>
> Very cool.  Looking forward to seeing this feature in action… J
>
>
>
> Thanks,
>
> -A
>
>
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:33 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
>
>
> FYI, all partition columns can be used as any regular columns select
> queries. So it should be fine.
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 11:23:45 -0700
> To: <hi...@hadoop.apache.org>
> Subject: RE: adding filenames as new columns via Hive
>
> Sounds great, Prasad.
>
> As long as I can further parse the filename field to piece out (new) derived
> fields, I will be happy… J
> For example, in a later query I’d like to be able to do something like:
>
> select
> substr(filename, 4, 7) as  class_A,
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7),
> substr(filename,  8, 10) ;
>
>
> thanks,
> -A
>
>
>
> From: Prasad Chakka [mailto:pchakka@facebook.com]
> Sent: Wednesday, September 16, 2009 11:10 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: adding filenames as new columns via Hive
>
> I think this can be a good feature though I would like the filename to be a
> partition column (one of such) instead of a separate type of column. Would
> that work?
>
> create external table FOO (  <list of fields and types> )
> row format delimited fields terminated by ','
> partitioned by (file_name FILENAME)
> stored as textfile location 's3:/somebucket/’;
>
> Or table partitioned by datestamp and filename
>
> create external table FOO (  <list of fields and types> )
> row format delimited fields terminated by ','
> Partitioned by (ds STRING, file_name FILENAME)
> stored as textfile location 's3:/somebucket/’;
>
>
> So FILENAME becomes a new type. I like this because partition columns are
> virtual columns just like the filename column and do not exist along with
> data on the disk.
>
> Prasad
>
> ________________________________
>
> From: Avram Aelony <Av...@eharmony.com>
> Reply-To: <hi...@hadoop.apache.org>
> Date: Wed, 16 Sep 2009 10:48:33 -0700
> To: <hi...@hadoop.apache.org>
> Subject: adding filenames as new columns via Hive
>
> Dear Hive list,
>
> I am processing a large volume of files (many files, roughly 500M compressed
> ) with Hive that reside in an S3 bucket.  Although the files share the same
> schema,  they have individual filenames that provide useful information that
> does not get captured and does not exist separately as a column within each
> file’s data.  As a general problem, I’d like to be able to add a new column
> via Hive that contains the filename of the files read in that were present
> in the bucket.
>
> My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and
> I am thinking that at some point Hadoop or Hive must have a file handle with
> the filenames that perhaps could be of use.  My hope is that this
> information could be added in (upon request) via Hive.   Perhaps as this
> could be a new Hive feature request (if it does not currently exist) ??
>
> Ideally, the syntax would look something like this:
>
> create external table FOO (  <list of fields and types> )
> row format delimited fields terminated by ','
> add_filename as ‘filename’
> stored as textfile location 's3:/somebucket/’;
>
>
> Has anyone thought of this?  Is there a way to add a new column within Hive
> that contains the filename?
>
>
>
> Many thanks in advance!!
> -Avram
>
>
>
> Avram Aelony
> Senior Analyst, Matching
> eHarmony.com

RE: adding filenames as new columns via Hive

Posted by Namit Jain <nj...@facebook.com>.
I don't think it is a good idea to make it a part of table metadata in any way.
What happens if the filename changes ? It will be very difficult to maintain.
But, we can definitely add some virtual columns (FILENAME can be one of them
to start with - it should not show up in describe, select * etc.

But, the user can query based on them - this is mostly for advanced users and
can be used for pruning etc. also


I will open a new jira, and we can continue the discussion there.


-namit




From: Avram Aelony [mailto:AvramAelony@eharmony.com]
Sent: Wednesday, September 16, 2009 11:39 AM
To: hive-user@hadoop.apache.org
Subject: RE: adding filenames as new columns via Hive


Very cool.  Looking forward to seeing this feature in action... :)

Thanks,
-A


From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 11:33 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

FYI, all partition columns can be used as any regular columns select queries. So it should be fine.
________________________________
From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 11:23:45 -0700
To: <hi...@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

Sounds great, Prasad.

As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
For example, in a later query I'd like to be able to do something like:

select
substr(filename, 4, 7) as  class_A,
substr(filename,  8, 10) as class_B
count( x ) as cnt
from FOO
group by
substr(filename, 4, 7),
substr(filename,  8, 10) ;


thanks,
-A



From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 11:10 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
partitioned by (file_name FILENAME)
stored as textfile location 's3:/somebucket/';

Or table partitioned by datestamp and filename

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
Partitioned by (ds STRING, file_name FILENAME)
stored as textfile location 's3:/somebucket/';


So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

Prasad
________________________________

From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 10:48:33 -0700
To: <hi...@hadoop.apache.org>
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data.  As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use.  My hope is that this information could be added in (upon request) via Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com

RE: adding filenames as new columns via Hive

Posted by Avram Aelony <Av...@eharmony.com>.
Very cool.  Looking forward to seeing this feature in action... :)

Thanks,
-A


From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 11:33 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

FYI, all partition columns can be used as any regular columns select queries. So it should be fine.

________________________________
From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 11:23:45 -0700
To: <hi...@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

Sounds great, Prasad.

As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
For example, in a later query I'd like to be able to do something like:

select
substr(filename, 4, 7) as  class_A,
substr(filename,  8, 10) as class_B
count( x ) as cnt
from FOO
group by
substr(filename, 4, 7),
substr(filename,  8, 10) ;


thanks,
-A



From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 11:10 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
partitioned by (file_name FILENAME)
stored as textfile location 's3:/somebucket/';

Or table partitioned by datestamp and filename

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
Partitioned by (ds STRING, file_name FILENAME)
stored as textfile location 's3:/somebucket/';


So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

Prasad
________________________________

From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 10:48:33 -0700
To: <hi...@hadoop.apache.org>
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data.  As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use.  My hope is that this information could be added in (upon request) via Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com


Re: adding filenames as new columns via Hive

Posted by Prasad Chakka <pc...@facebook.com>.
FYI, all partition columns can be used as any regular columns select queries. So it should be fine.


________________________________
From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 11:23:45 -0700
To: <hi...@hadoop.apache.org>
Subject: RE: adding filenames as new columns via Hive

Sounds great, Prasad.

As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
For example, in a later query I'd like to be able to do something like:

select
substr(filename, 4, 7) as  class_A,
substr(filename,  8, 10) as class_B
count( x ) as cnt
from FOO
group by
substr(filename, 4, 7),
substr(filename,  8, 10) ;


thanks,
-A



From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 11:10 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
partitioned by (file_name FILENAME)
stored as textfile location 's3:/somebucket/';

Or table partitioned by datestamp and filename

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
Partitioned by (ds STRING, file_name FILENAME)
stored as textfile location 's3:/somebucket/';


So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

Prasad

________________________________

From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 10:48:33 -0700
To: <hi...@hadoop.apache.org>
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data.  As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use.  My hope is that this information could be added in (upon request) via Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com



RE: adding filenames as new columns via Hive

Posted by Avram Aelony <Av...@eharmony.com>.
Sounds great, Prasad.

As long as I can further parse the filename field to piece out (new) derived fields, I will be happy... :)
For example, in a later query I'd like to be able to do something like:

select
substr(filename, 4, 7) as  class_A,
substr(filename,  8, 10) as class_B
count( x ) as cnt
from FOO
group by
substr(filename, 4, 7),
substr(filename,  8, 10) ;


thanks,
-A


From: Prasad Chakka [mailto:pchakka@facebook.com]
Sent: Wednesday, September 16, 2009 11:10 AM
To: hive-user@hadoop.apache.org
Subject: Re: adding filenames as new columns via Hive

I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
partitioned by (file_name FILENAME)
stored as textfile location 's3:/somebucket/';

Or table partitioned by datestamp and filename

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
Partitioned by (ds STRING, file_name FILENAME)
stored as textfile location 's3:/somebucket/';


So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

Prasad
________________________________
From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 10:48:33 -0700
To: <hi...@hadoop.apache.org>
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data.  As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use.  My hope is that this information could be added in (upon request) via Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com


Re: adding filenames as new columns via Hive

Posted by Prasad Chakka <pc...@facebook.com>.
I think this can be a good feature though I would like the filename to be a partition column (one of such) instead of a separate type of column. Would that work?

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
partitioned by (file_name FILENAME)
stored as textfile location 's3:/somebucket/';

Or table partitioned by datestamp and filename

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
Partitioned by (ds STRING, file_name FILENAME)
stored as textfile location 's3:/somebucket/';


So FILENAME becomes a new type. I like this because partition columns are virtual columns just like the filename column and do not exist along with data on the disk.

Prasad
________________________________
From: Avram Aelony <Av...@eharmony.com>
Reply-To: <hi...@hadoop.apache.org>
Date: Wed, 16 Sep 2009 10:48:33 -0700
To: <hi...@hadoop.apache.org>
Subject: adding filenames as new columns via Hive

Dear Hive list,

I am processing a large volume of files (many files, roughly 500M compressed ) with Hive that reside in an S3 bucket.  Although the files share the same schema,  they have individual filenames that provide useful information that does not get captured and does not exist separately as a column within each file's data.  As a general problem, I'd like to be able to add a new column via Hive that contains the filename of the files read in that were present in the bucket.

My Hive CREATE EXTERNAL TABLE command points to the S3 container bucket, and I am thinking that at some point Hadoop or Hive must have a file handle with the filenames that perhaps could be of use.  My hope is that this information could be added in (upon request) via Hive.   Perhaps as this could be a new Hive feature request (if it does not currently exist) ??

Ideally, the syntax would look something like this:

create external table FOO (  <list of fields and types> )
row format delimited fields terminated by ','
add_filename as 'filename'
stored as textfile location 's3:/somebucket/';


Has anyone thought of this?  Is there a way to add a new column within Hive that contains the filename?



Many thanks in advance!!
-Avram



Avram Aelony
Senior Analyst, Matching
eHarmony.com