You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2009/09/16 21:05:57 UTC

[jira] Created: (HIVE-837) virtual column support (filename) in hive

virtual column support (filename) in hive
-----------------------------------------

                 Key: HIVE-837
                 URL: https://issues.apache.org/jira/browse/HIVE-837
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Namit Jain


Copying from some mails:


I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.

weblogs
web1.00
web1.05
web1.10
...
web2.00
web2.05
web1.10
....

Things that would be useful..

Select files from the folder with a regex or exact name

select * FROM logs where FILENAME LIKE(WEB1*)

select * FROM LOGS WHERE FILENAME=web2.00

Also it would be nice to be able to select offsets in a file, this would make sense with appends

select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]




select  
substr(filename, 4, 7) as  class_A, 
substr(filename,  8, 10) as class_B
count( x ) as cnt
from FOO
group by
substr(filename, 4, 7), 
substr(filename,  8, 10) ;



Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-837) virtual column support (filename) in hive

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi reassigned HIVE-837:
-------------------------------

    Assignee: He Yongqiang

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: He Yongqiang
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790287#action_12790287 ] 

Carl Steinbach commented on HIVE-837:
-------------------------------------

I agree. I think most people will want to use this in conjunction with external tables, and in that
case they are really interested in the actual filename as opposed to the directory name.

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894714#action_12894714 ] 

John Sichi commented on HIVE-837:
---------------------------------

Yongqiang added the following virtual columns as part of HIVE-417:

INPUT__FILE__NAME
BLOCK__OFFSET__INSIDE__FILE

Those need to be documented, and we may still need a fine-grained row offset.


> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756185#action_12756185 ] 

Edward Capriolo commented on HIVE-837:
--------------------------------------

also "describe partition show files" would be useful.

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756563#action_12756563 ] 

Edward Capriolo commented on HIVE-837:
--------------------------------------

It would be nice and very useful. sometimes I want to select my own 'partition' or 'datafile' explicitly.. something like below:

SELECT *  FROM weblogs PARTITION ('2009-09-17', '2009-09-18') WHERE col1='..' and col2= ...

Or users can select data files from directory:

SELECT *  FROM weblogs DATAFILE ('log1.txt', 'log2.txt') WHERE col1='..' and col2= ...

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765394#action_12765394 ] 

He Yongqiang commented on HIVE-837:
-----------------------------------

Support filename as a virtual column is more easier for file pruning, and a udf called filename() will allow capturing file name in query.
i think both are very useful and we should support both.

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Prasad Chakka (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756658#action_12756658 ] 

Prasad Chakka commented on HIVE-837:
------------------------------------

buckets have other semantic meaning which is not the case for files so we should not lump buckets with meta/virtual columns. we could possibly add a virtual column/udf called bucket() for that.

mysql gives lot of virtual data as udfs (curtime(), database(), current_user(), default(column)) etc instead of virtual columns. i think it makes sense to make them udfs just incase some virtual columns need arguments.

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790277#action_12790277 ] 

Namit Jain commented on HIVE-837:
---------------------------------

You are right, but we need the association between the row and the actual file.
Directory is not good enough. 

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765413#action_12765413 ] 

He Yongqiang commented on HIVE-837:
-----------------------------------

It seems supporting filename as a virtual column will also allow capturing file name in query.

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756630#action_12756630 ] 

Namit Jain commented on HIVE-837:
---------------------------------

yesterday, i was having a offline conversation with Raghu, and we were thinking that this is similar to the concept of buckets that exists currently.
So, do we enhance the tablesample clause to include filenames also, and not expose it as a virtual column at all ?

I think it is more intuitive to have filenames in the where clause - maybe, we should have some virtual columns for buckets also and leave the current
syntax for buckets as is for backward compatibility.

File pruning is must - so, having the filename as udf might be more difficult. The udf filename() will return the same value at compile time.
So, I would prefer virtual columns instead of filenames. 

SELECT * FROM weblogs DATAFILE ('log1.txt', 'log2.txt') WHERE col1='..' and col2= ...
would solve the pruning problem since the file names are part of the syntax, but how do you propose to select the filename in that case ?

So, I think the original syntax:
select * FROM logs where FILENAME LIKE(WEB1*)
might be easier

> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12790263#action_12790263 ] 

Carl Steinbach commented on HIVE-837:
-------------------------------------

I think this can be implemented as either a UDF or virtual column as long as we choose to define FILE() as the directory path that contains the table's/partition's data files, but I think things get significantly more complicated if we want FILE() to evaluate to the actual *file* that the current row is drawn from. At compile-time we have access to the table/partition directory and can retrieve a list of data files, but I don't think it is possible to make the row<->File association until runtime. Does this make sense or am I missing something?


> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-837) virtual column support (filename) in hive

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795904#action_12795904 ] 

Todd Lipcon commented on HIVE-837:
----------------------------------

bq. in that case they are really interested in the actual filename as opposed to the directory name. 

+1. I'm currently working with a 200G dataset that has lots of rows that Hive is interpreting as NULL. As far as I knew, there are no NULLs in the dataset to begin with, so I'd love to do: SELECT FILENAME(), FILEOFFSET() FROM t WHERE some_col IS NULL;


> virtual column support (filename) in hive
> -----------------------------------------
>
>                 Key: HIVE-837
>                 URL: https://issues.apache.org/jira/browse/HIVE-837
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Copying from some mails:
> I am dumping files into a hive partion on five minute intervals. I am using LOAD DATA into a partition.
> weblogs
> web1.00
> web1.05
> web1.10
> ...
> web2.00
> web2.05
> web1.10
> ....
> Things that would be useful..
> Select files from the folder with a regex or exact name
> select * FROM logs where FILENAME LIKE(WEB1*)
> select * FROM LOGS WHERE FILENAME=web2.00
> Also it would be nice to be able to select offsets in a file, this would make sense with appends
> select * from logs WHERE FILENAME=web2.00 FROMOFFSET=454644 [TOOFFSET=]
> select  
> substr(filename, 4, 7) as  class_A, 
> substr(filename,  8, 10) as class_B
> count( x ) as cnt
> from FOO
> group by
> substr(filename, 4, 7), 
> substr(filename,  8, 10) ;
> Hive should support virtual columns

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.