You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by blrunner <gi...@git.apache.org> on 2015/06/16 00:50:06 UTC

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

GitHub user blrunner opened a pull request:

    https://github.com/apache/tajo/pull/601

    TAJO-1644: When inserting empty data into a partitioned table, existing data would be removed.

    When inserting empty data into a partitioned table, existing data would be removed. Tajo provides column value partition which is hive-style partition. In hive, when inserting empty data into a partition, there are two cases. If you use dynamic partitions, existing data never would be removed. But if you don't use dynamic partitions, existing data would be removed. When inserting a data to partition, tajo user don't specify each column and each column value. So, it is similar to dynamic partition of hive. It seems to update deletion logic of partitioned table.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrunner/tajo TAJO-1644

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #601
    
----
commit 67f818e23e82099a4dd84f8deb21f9008c912faf
Author: JaeHwa Jung <bl...@apache.org>
Date:   2015-06-15T22:47:53Z

    TAJO-1644: When inserting empty data into a partitioned table, existing data would be removed.

commit 74c3e29bae6cba30dec33bf19f61af8141ab2dd4
Author: JaeHwa Jung <bl...@apache.org>
Date:   2015-06-15T22:49:14Z

    Remove unnecessary packages.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r33110732
  
    --- Diff: tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java ---
    @@ -376,7 +375,10 @@ public static int setDateOrder(int dateOrder) {
         // Behavior Control ---------------------------------------------------------
         $BEHAVIOR_ARITHMETIC_ABORT("tajo.behavior.arithmetic-abort", false),
     
    -    // ResultSet ---------------------------------------------------------
    +    // When inserting empty data into a partitioned table, set to keep existing data.
    --- End diff --
    
    This comment needs to be updated by my above comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r32494311
  
    --- Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/FileTablespace.java ---
    @@ -892,9 +892,8 @@ protected Path commitOutputData(OverridableConf queryContext, boolean changeFile
               boolean movedToOldTable = false;
               boolean committed = false;
               Path oldTableDir = new Path(stagingDir, TajoConstants.INSERT_OVERWIRTE_OLD_TABLE_NAME);
    -          ContentSummary summary = fs.getContentSummary(stagingResultDir);
     
    -          if (!queryContext.get(QueryVars.OUTPUT_PARTITIONS, "").isEmpty() && summary.getFileCount() > 0L) {
    +          if (!queryContext.get(QueryVars.OUTPUT_PARTITIONS, "").isEmpty()) {
    --- End diff --
    
    It would be better if it works according to session variable to allow users to decide whether original data set is removed or not when inserting rows is empty.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on the pull request:

    https://github.com/apache/tajo/pull/601#issuecomment-115009452
  
    +1
    Looks good to me. I leaved one trivial comment. If you agree, you can commit the patch after you reflect my comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r33110852
  
    --- Diff: tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java ---
    @@ -376,7 +375,10 @@ public static int setDateOrder(int dateOrder) {
         // Behavior Control ---------------------------------------------------------
         $BEHAVIOR_ARITHMETIC_ABORT("tajo.behavior.arithmetic-abort", false),
     
    -    // ResultSet ---------------------------------------------------------
    +    // When inserting empty data into a partitioned table, set to keep existing data.
    +    $TABLE_PARTITION_KEPT_EXISTING_DATA_ENABLED("tajo.table-partition.kept-existing-data.enabled", true),
    --- End diff --
    
    This also should be updated by my above comment. Also, it would be nice if the config key is changed to 'tajo.partition.overwrite.even-if-no-result', and its default config should be ``false``.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on the pull request:

    https://github.com/apache/tajo/pull/601#issuecomment-113364944
  
    Hi @blrunner,
    
    Thank you for your work. How about the insert overwrite behavior for not partitioned table? Does that the same semantic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/601#issuecomment-112447587
  
    Thanks @hyunsik 
    
    I've just updated the patch using your comments. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r32494015
  
    --- Diff: tajo-core/src/test/java/org/apache/tajo/engine/query/TestTablePartitions.java ---
    @@ -548,54 +526,61 @@ public final void testInsertIntoColumnPartitionedTableByThreeColumns() throws Ex
             + " where l_orderkey = 1 and l_partkey = 1 and  l_linenumber = 1");
         res.close();
     
    -    desc = catalog.getTableDesc(DEFAULT_DATABASE_NAME, tableName);
    -    assertTrue(fs.isDirectory(path));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=1")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=1/col2=1")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=1/col2=1/col3=17.0")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=1/col2=1/col3=30.0")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=2")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=2/col2=2")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=2/col2=2/col3=38.0")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=3")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=3/col2=2")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=3/col2=3")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=3/col2=2/col3=45.0")));
    -    assertTrue(fs.isDirectory(new Path(path.toUri() + "/col1=3/col2=3/col3=49.0")));
    -
    +    verifyDirectoriesForThreeColumns(fs, path, 3);
         if (!testingCluster.isHiveCatalogStoreRunning()) {
           // TODO: If there is existing another partition directory, we must add its rows number to result row numbers.
    +      // desc = catalog.getTableDesc(DEFAULT_DATABASE_NAME, tableName);
           // assertEquals(6, desc.getStats().getNumRows().intValue());
         }
     
    -    res = executeString("select * from " + tableName + " where col2 = 1");
    -    resultSetData = resultSetToString(res);
    -    res.close();
    -    expected = "col4,col1,col2,col3\n" +
    -        "-------------------------------\n" +
    -        "N,1,1,17.0\n" +
    -        "N,1,1,17.0\n" +
    -        "N,1,1,30.0\n" +
    -        "N,1,1,36.0\n" +
    -        "N,1,1,36.0\n";
    -
    -    assertEquals(expected, resultSetData);
    +    verifyMaintainExistingData(res, tableName);
     
         // insert overwrite empty result to partitioned table
         res = executeString("insert overwrite into " + tableName
    -      + " select l_returnflag, l_orderkey, l_partkey, l_quantity from lineitem where l_orderkey" +
    -      " > 100");
    +      + " select l_returnflag, l_orderkey, l_partkey, l_quantity from lineitem where l_orderkey > 100");
         res.close();
     
    -    desc = catalog.getTableDesc(DEFAULT_DATABASE_NAME, tableName);
    +    verifyDirectoriesForThreeColumns(fs, path, 4);
    +    verifyMaintainExistingData(res, tableName);
     
    -    ContentSummary summary = fs.getContentSummary(new Path(desc.getPath()));
    +    executeString("DROP TABLE " + tableName + " PURGE").close();
    +  }
     
    -    assertEquals(summary.getDirectoryCount(), 1L);
    -    assertEquals(summary.getFileCount(), 0L);
    -    assertEquals(summary.getLength(), 0L);
    +  private final void verifyMaintainExistingData(ResultSet res, String tableName) throws Exception {
    --- End diff --
    
    It would be better if you rename verifyMaintainExistingData to verifyKeptExistingData.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r33110868
  
    --- Diff: tajo-storage/tajo-storage-hdfs/src/main/java/org/apache/tajo/storage/FileTablespace.java ---
    @@ -894,7 +894,12 @@ protected Path commitOutputData(OverridableConf queryContext, boolean changeFile
               Path oldTableDir = new Path(stagingDir, TajoConstants.INSERT_OVERWIRTE_OLD_TABLE_NAME);
               ContentSummary summary = fs.getContentSummary(stagingResultDir);
     
    -          if (!queryContext.get(QueryVars.OUTPUT_PARTITIONS, "").isEmpty() && summary.getFileCount() > 0L) {
    +          // When inserting empty data into a partitioned table, check if keep existing data need to be remove or not.
    +          boolean keptEnabled = queryContext.getBool(SessionVars.TABLE_PARTITION_KEPT_EXISTING_DATA_ENABLED);
    +
    +          // If existing data doesn't need to keep, check if there are some files.
    +          if ( (!queryContext.get(QueryVars.OUTPUT_PARTITIONS, "").isEmpty())
    --- End diff --
    
    This condition should be changed depending on the changed default value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r33194715
  
    --- Diff: tajo-common/src/main/java/org/apache/tajo/conf/TajoConf.java ---
    @@ -376,7 +375,11 @@ public static int setDateOrder(int dateOrder) {
         // Behavior Control ---------------------------------------------------------
         $BEHAVIOR_ARITHMETIC_ABORT("tajo.behavior.arithmetic-abort", false),
     
    -    // ResultSet ---------------------------------------------------------
    +    // If True, a partitioned table is overwritten even if a sub query leads to no result.
    +    // Otherwise, the table data will be kept if there is no result
    +    $TABLE_PARTITION_NO_RESULT_OVERWRITE_ENABLED("tajo.partition.overwrite.even-if-no-result", false),
    --- End diff --
    
    I have one more suggestion. It would be great if you use PARTITION instead of TABLE_PARTITION. It is still clear because we only use the term 'partition' to indicate table partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/601#issuecomment-114696318
  
    @hyunsik 
    
    Thank you for your detailed review.
    I've just updated the patch using your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r33110575
  
    --- Diff: tajo-common/src/main/java/org/apache/tajo/SessionVars.java ---
    @@ -126,6 +126,9 @@
       NULL_CHAR(ConfVars.$TEXT_NULL, "null char of text file output", DEFAULT),
       CODEGEN(ConfVars.$CODEGEN, "Runtime code generation enabled (experiment)", DEFAULT),
     
    +  TABLE_PARTITION_KEPT_EXISTING_DATA_ENABLED(ConfVars.$TABLE_PARTITION_KEPT_EXISTING_DATA_ENABLED,
    --- End diff --
    
    How about {{TABLE_PARTITION_NO_RESULT_OVERWRITE_ENABLED}} for the config name? It seems to be more intuitive for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/601#issuecomment-113871681
  
    Hi @hyunsik 
    
    Thank you for your review. I also agree with you. But when inserting empty data into a non partitioned table in hive, existing data would be removed always. I'm a bit anxious about user's confusion between hive and tajo. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r33194745
  
    --- Diff: tajo-common/src/main/java/org/apache/tajo/SessionVars.java ---
    @@ -126,6 +126,10 @@
       NULL_CHAR(ConfVars.$TEXT_NULL, "null char of text file output", DEFAULT),
       CODEGEN(ConfVars.$CODEGEN, "Runtime code generation enabled (experiment)", DEFAULT),
     
    +  TABLE_PARTITION_NO_RESULT_OVERWRITE_ENABLED(ConfVars.$TABLE_PARTITION_NO_RESULT_OVERWRITE_ENABLED,
    --- End diff --
    
    I have one more suggestion. It would be great if you use PARTITION instead of TABLE_PARTITION. It is still clear because we only use the term 'partition' to indicate table partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/tajo/pull/601


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1644: When inserting empty data into a par...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/601#discussion_r33110710
  
    --- Diff: tajo-common/src/main/java/org/apache/tajo/SessionVars.java ---
    @@ -126,6 +126,9 @@
       NULL_CHAR(ConfVars.$TEXT_NULL, "null char of text file output", DEFAULT),
       CODEGEN(ConfVars.$CODEGEN, "Runtime code generation enabled (experiment)", DEFAULT),
     
    +  TABLE_PARTITION_KEPT_EXISTING_DATA_ENABLED(ConfVars.$TABLE_PARTITION_KEPT_EXISTING_DATA_ENABLED,
    +    "When inserting empty data into a partitioned table, set to keep existing data.", DEFAULT),
    --- End diff --
    
    It would be better if the description is 'If True, a partitioned table is overwritten even if a sub query leads to no result. Otherwise, the table data will be kept if there is no result.'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---