You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Sean McNamara (Created) (JIRA)" <ji...@apache.org> on 2012/03/21 19:15:42 UTC

[jira] [Created] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

LOAD DATA IF NOT EXISTS functionality
-------------------------------------

                 Key: HIVE-2889
                 URL: https://issues.apache.org/jira/browse/HIVE-2889
             Project: Hive
          Issue Type: Improvement
          Components: Import/Export
    Affects Versions: 0.8.1
            Reporter: Sean McNamara
             Fix For: 0.9.0


*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.


*Example:*
{{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated HIVE-2889:
-----------------------------------

    Fix Version/s:     (was: 0.9.0)

Unlinking from 0.9 
                
> LOAD DATA IF NOT EXISTS functionality
> -------------------------------------
>
>                 Key: HIVE-2889
>                 URL: https://issues.apache.org/jira/browse/HIVE-2889
>             Project: Hive
>          Issue Type: Improvement
>          Components: Import/Export
>    Affects Versions: 0.8.1
>            Reporter: Sean McNamara
>
> *Background:*
> The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.
> *Original discussion:*
> http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
> *Issue:*
> There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.
> *Example:*
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> *Result:*
> {{test_a.bz2}}
> {{test_b.bz2}}
> {{test_b_copy_1.bz2}}
> {{test_b_copy_2.bz2}}
> _test_b data was inserted 3 times, which is not the desired behavior in this instance._
> *Proposal:*
> Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.
> *Proposed HiveQL Example:*
> {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

Posted by "Sean McNamara (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean McNamara updated HIVE-2889:
--------------------------------

    Description: 
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.


*Example:*
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2}}
{{test_b.bz2}}
{{test_b_copy_1.bz2}}
{{test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

  was:
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.


*Example:*
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

    
> LOAD DATA IF NOT EXISTS functionality
> -------------------------------------
>
>                 Key: HIVE-2889
>                 URL: https://issues.apache.org/jira/browse/HIVE-2889
>             Project: Hive
>          Issue Type: Improvement
>          Components: Import/Export
>    Affects Versions: 0.8.1
>            Reporter: Sean McNamara
>             Fix For: 0.9.0
>
>
> *Background:*
> The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.
> *Original discussion:*
> http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
> *Issue:*
> There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.
> *Example:*
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> *Result:*
> {{test_a.bz2}}
> {{test_b.bz2}}
> {{test_b_copy_1.bz2}}
> {{test_b_copy_2.bz2}}
> _test_b data was inserted 3 times, which is not the desired behavior in this instance._
> *Proposal:*
> Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.
> *Proposed HiveQL Example:*
> {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

Posted by "Sean McNamara (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean McNamara updated HIVE-2889:
--------------------------------

    Description: 
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.


*Example:*
{{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

  was:
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.


*Example:*
{{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"
/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}


    
> LOAD DATA IF NOT EXISTS functionality
> -------------------------------------
>
>                 Key: HIVE-2889
>                 URL: https://issues.apache.org/jira/browse/HIVE-2889
>             Project: Hive
>          Issue Type: Improvement
>          Components: Import/Export
>    Affects Versions: 0.8.1
>            Reporter: Sean McNamara
>             Fix For: 0.9.0
>
>
> *Background:*
> The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.
> *Original discussion:*
> http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
> *Issue:*
> There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.
> *Example:*
> {{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> *Result:*
> {{test_a.bz2
> test_b.bz2
> test_b_copy_1.bz2
> test_b_copy_2.bz2}}
> _test_b data was inserted 3 times, which is not the desired behavior in this instance._
> *Proposal:*
> Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.
> *Proposed HiveQL Example:*
> {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2889) LOAD DATA IF NOT EXISTS functionality

Posted by "Sean McNamara (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean McNamara updated HIVE-2889:
--------------------------------

    Description: 
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.


*Example:*
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

  was:
*Background:*
The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.


*Original discussion:*
http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E


*Issue:*
There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.


*Example:*
{{usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
{{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}

*Result:*
{{test_a.bz2
test_b.bz2
test_b_copy_1.bz2
test_b_copy_2.bz2}}

_test_b data was inserted 3 times, which is not the desired behavior in this instance._


*Proposal:*
Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.


*Proposed HiveQL Example:*
{{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

    
> LOAD DATA IF NOT EXISTS functionality
> -------------------------------------
>
>                 Key: HIVE-2889
>                 URL: https://issues.apache.org/jira/browse/HIVE-2889
>             Project: Hive
>          Issue Type: Improvement
>          Components: Import/Export
>    Affects Versions: 0.8.1
>            Reporter: Sean McNamara
>             Fix For: 0.9.0
>
>
> *Background:*
> The behavior of LOAD DATA LOCAL INPATH has changed.  It used to give you an error when trying to copy in a log that already existed.  Now it re-names the file with copy_1 so the file always goes into hdfs.
> *Original discussion:*
> http://mail-archives.apache.org/mod_mbox/hive-user/201203.mbox/%3CCB8D2849.14F69%25sean.mcnamara%40webtrends.com%3E
> *Issue:*
> There is no longer an atomic way to insert files into hive and guarantee that the file won't go in twice.  Using OVERWRITE will cause other logs in the table/partition to be deleted.
> *Example:*
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> {{/usr/local/hive/bin/hive -e "LOAD DATA LOCAL INPATH 'test_b.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')"}}
> *Result:*
> {{test_a.bz2
> test_b.bz2
> test_b_copy_1.bz2
> test_b_copy_2.bz2}}
> _test_b data was inserted 3 times, which is not the desired behavior in this instance._
> *Proposal:*
> Add _IF NOT EXISTS_ flag to indicate copy semantics.  If the the log file does not exist in the table/partition, the log would go in normally.  If the log does exist in the table/partition hive would return an error and return an exit code.
> *Proposed HiveQL Example:*
> {{LOAD DATA LOCAL IF NOT EXISTS INPATH 'test_a.bz2' INTO TABLE logs PARTITION(ds='2012-03-19', hr='23')}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira