You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/05/26 11:18:55 UTC

[GitHub] [hive] oleksiy-sayankin opened a new pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

oleksiy-sayankin opened a new pull request #2323:
URL: https://github.com/apache/hive/pull/2323


   …tgres DB
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/Hive/HowToContribute
     2. Ensure that you have created an issue on the Hive project JIRA: https://issues.apache.org/jira/projects/HIVE/summary
     3. Ensure you have added or run the appropriate tests for your PR: 
     4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]HIVE-XXXXX:  Your PR title ...'.
     5. Be sure to keep the PR description updated to reflect all changes.
     6. Please write your PR title to summarize what this PR proposes.
     7. If possible, provide a concise example to reproduce the issue for a faster review.
   
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description, screenshot and/or a reproducable example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Hive versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r639649310



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5269,6 +5293,32 @@ private void preDropStorageDescriptor(MStorageDescriptor msd) {
     removeUnusedColumnDescriptor(mcd);
   }
 
+  /**
+   * Get a list of storage descriptors that reference a particular Column Descriptor
+   * @param oldCD the column descriptor to get storage descriptors for
+   * @return a list of storage descriptors
+   */
+  private List<MStorageDescriptor> listStorageDescriptorsWithCD(MColumnDescriptor oldCD, Query query) {
+    boolean success = false;
+    List<MStorageDescriptor> sds = null;
+    try {
+      openTransaction();

Review comment:
       Why did we move the transaction here?
   If the previous check runs in a different transaction then we might end up in a situation where the check did not found any usage for the descriptor but before the transaction for the removal started someone inserted new data and started to use the CD.
   
   I think it is only theoretical possibility, but we should check this out.
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-887111645


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the dev@hive.apache.org list if the patch is in need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] oleksiy-sayankin commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
oleksiy-sayankin commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r640409691



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5269,6 +5293,32 @@ private void preDropStorageDescriptor(MStorageDescriptor msd) {
     removeUnusedColumnDescriptor(mcd);
   }
 
+  /**
+   * Get a list of storage descriptors that reference a particular Column Descriptor
+   * @param oldCD the column descriptor to get storage descriptors for
+   * @return a list of storage descriptors
+   */
+  private List<MStorageDescriptor> listStorageDescriptorsWithCD(MColumnDescriptor oldCD, Query query) {
+    boolean success = false;
+    List<MStorageDescriptor> sds = null;
+    try {
+      openTransaction();

Review comment:
       And what do you think of method _listStorageDescriptorsWithCD_? It also uses open/commit transaction and If I implement open/commit transaction in _removeUnusedColumnDescriptor()_ it seems to be like this
          
          removeUnusedColumnDescriptor(){
            openTransaction()
            if (isPostgres()) {
              listStorageDescriptorsWithCD {  
                openTransaction()  <--- do we need this?
                1. SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1)
                commitTransaction()   <--- do we need this?
                }
              }
            2. select count(1) from org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)
            3. pm.deletePersistentAll(mConstraintsList);
            4. pm.deletePersistent(oldCD);
            commitTransaction()
           }
   
   
   so open/commit transaction  inside _listStorageDescriptorsWithCD()_ is redundant. Agree?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] belugabehr commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
belugabehr commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-849788882


   ```
   sqlite> create table a (id int, value int);
   sqlite> create table b (id int, a_id, value int);
   
   sqlite> insert into a values (1,1);
   sqlite> insert into a values (2,2);
   sqlite> insert into b values (1,1,3);
   
   sqlite> select * from a where exists (select * from b where a.id=1) and a.id=1;
   ```
   
   I think this is what you are attempting.  I'm not sure what DN will do with this or what the underlying DB will do but please investigate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] yongzhi commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
yongzhi commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-981943366


   The Change Looks good +2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] oleksiy-sayankin commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
oleksiy-sayankin commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r640409691



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5269,6 +5293,32 @@ private void preDropStorageDescriptor(MStorageDescriptor msd) {
     removeUnusedColumnDescriptor(mcd);
   }
 
+  /**
+   * Get a list of storage descriptors that reference a particular Column Descriptor
+   * @param oldCD the column descriptor to get storage descriptors for
+   * @return a list of storage descriptors
+   */
+  private List<MStorageDescriptor> listStorageDescriptorsWithCD(MColumnDescriptor oldCD, Query query) {
+    boolean success = false;
+    List<MStorageDescriptor> sds = null;
+    try {
+      openTransaction();

Review comment:
       And what do you think of method _listStorageDescriptorsWithCD_? It also uses open/commit transaction and If I implement open/commit transaction in _removeUnusedColumnDescriptor()_ it seems to be like this
          
          removeUnusedColumnDescriptor(){
            openTransaction()
            if (isPostgres()) {
              listStorageDescriptorsWithCD {  
                openTransaction()  <--- do we need this?
                1. SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1)
                commitTransaction()   <--- do we need this?
                }
              } else {
            2. select count(1) from org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)
            }
            3. pm.deletePersistentAll(mConstraintsList);
            4. pm.deletePersistent(oldCD);
            commitTransaction()
           }
   
   
   so open/commit transaction  inside _listStorageDescriptorsWithCD()_ is redundant. Agree?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] github-actions[bot] closed pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #2323:
URL: https://github.com/apache/hive/pull/2323


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] oleksiy-sayankin commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
oleksiy-sayankin commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r640402121



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5269,6 +5293,32 @@ private void preDropStorageDescriptor(MStorageDescriptor msd) {
     removeUnusedColumnDescriptor(mcd);
   }
 
+  /**
+   * Get a list of storage descriptors that reference a particular Column Descriptor
+   * @param oldCD the column descriptor to get storage descriptors for
+   * @return a list of storage descriptors
+   */
+  private List<MStorageDescriptor> listStorageDescriptorsWithCD(MColumnDescriptor oldCD, Query query) {
+    boolean success = false;
+    List<MStorageDescriptor> sds = null;
+    try {
+      openTransaction();

Review comment:
       Ok, I see. So your suggestion is to use transactions this way
   
       openTransaction()
       1. SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1)
       2. select count(1) from org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)
       3. pm.deletePersistentAll(mConstraintsList);
       4. pm.deletePersistent(oldCD);
       commitTransaction()
   
   to prevent any possible changes in tables we read. Correct? If so I will change the PR in this manner.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] oleksiy-sayankin commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
oleksiy-sayankin commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-849601363


   Refactored according to the comments above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] belugabehr commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
belugabehr commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-849664125


   Let me take a look at this.  I would be very cautious about adding DB vendor specific code into Hive (Yes, I know there is a lot already, too much).  We should be farming out as much of this stuff to DN as possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] oleksiy-sayankin commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
oleksiy-sayankin commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r640351626



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5269,6 +5293,32 @@ private void preDropStorageDescriptor(MStorageDescriptor msd) {
     removeUnusedColumnDescriptor(mcd);
   }
 
+  /**
+   * Get a list of storage descriptors that reference a particular Column Descriptor
+   * @param oldCD the column descriptor to get storage descriptors for
+   * @return a list of storage descriptors
+   */
+  private List<MStorageDescriptor> listStorageDescriptorsWithCD(MColumnDescriptor oldCD, Query query) {
+    boolean success = false;
+    List<MStorageDescriptor> sds = null;
+    try {
+      openTransaction();

Review comment:
       In my understanding transactions are usually used when one has CREATE, UPDATE or DELETE statements and one wants to have the atomic behavior, that is, either commit everything or commit nothing. Consider all queries we have here
   
   1. Query for postgres
   
           SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1)
   
   2. Query for all DBs except postgres
   
           select count(1) from org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)
   
   3. Constraint deletion
   
           pm.deletePersistentAll(mConstraintsList);
   
   4. CD deletion
   
           pm.deletePersistent(oldCD);
   
   IMHO we need transaction only for items 3 and 4 and it should look this way:
   
           1. SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1)
           2. select count(1) from org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)
           openTransaction()
           3. pm.deletePersistentAll(mConstraintsList);
           4. pm.deletePersistent(oldCD);
           commitTransaction()
   
   whilst the old implementation was
   
           openTransaction()        
           1. select count(1) from org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)        
           2. pm.deletePersistentAll(mConstraintsList);
           3. pm.deletePersistent(oldCD);
           commitTransaction()  
   
   which seems to me redundant. Make sense?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r639647076



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5218,31 +5218,55 @@ private void removeUnusedColumnDescriptor(MColumnDescriptor oldCD) {
       return;
     }
 
-    boolean success = false;
     Query query = null;
+    LOG.debug("execute removeUnusedColumnDescriptor");
+    DatabaseProduct dbProduct = DatabaseProduct.determineDatabaseProduct(MetaStoreDirectSql.getProductName(pm), conf);
 
-    try {
-      openTransaction();
-      LOG.debug("execute removeUnusedColumnDescriptor");
+    /**
+     * In order to workaround oracle not supporting limit statement caused performance issue, HIVE-9447 makes
+     * all the backend DB run select count(1) from SDS where SDS.CD_ID=? to check if the specific CD_ID is
+     * referenced in SDS table before drop a partition. This select count(1) statement does not scale well in
+     * Postgres, and there is no index for CD_ID column in SDS table.
+     * For a SDS table with with 1.5 million rows, select count(1) has average 700ms without index, while in
+     * 10-20ms with index. But the statement before
+     * HIVE-9447( SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1) uses less than 10ms .
+     */
 
-      query = pm.newQuery("select count(1) from " +
-        "org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)");
+    if (dbProduct.isPOSTGRES()) {
+      query = pm.newQuery(MStorageDescriptor.class, "this.cd == inCD");
       query.declareParameters("MColumnDescriptor inCD");
-      long count = ((Long)query.execute(oldCD)).longValue();
-
+      List<MStorageDescriptor> referencedSDs = listStorageDescriptorsWithCD(oldCD, query);

Review comment:
       Is this really the fastest way to check if the `oldCD` is used?
   Since postgres supports `limit` we might want to use that here.
   Also my guess is that mysql could also use limit, so we might want to test the performance for different engines and use the appropriate query for them.
   
   What do you think?
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r639647076



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5218,31 +5218,55 @@ private void removeUnusedColumnDescriptor(MColumnDescriptor oldCD) {
       return;
     }
 
-    boolean success = false;
     Query query = null;
+    LOG.debug("execute removeUnusedColumnDescriptor");
+    DatabaseProduct dbProduct = DatabaseProduct.determineDatabaseProduct(MetaStoreDirectSql.getProductName(pm), conf);
 
-    try {
-      openTransaction();
-      LOG.debug("execute removeUnusedColumnDescriptor");
+    /**
+     * In order to workaround oracle not supporting limit statement caused performance issue, HIVE-9447 makes
+     * all the backend DB run select count(1) from SDS where SDS.CD_ID=? to check if the specific CD_ID is
+     * referenced in SDS table before drop a partition. This select count(1) statement does not scale well in
+     * Postgres, and there is no index for CD_ID column in SDS table.
+     * For a SDS table with with 1.5 million rows, select count(1) has average 700ms without index, while in
+     * 10-20ms with index. But the statement before
+     * HIVE-9447( SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1) uses less than 10ms .
+     */
 
-      query = pm.newQuery("select count(1) from " +
-        "org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)");
+    if (dbProduct.isPOSTGRES()) {
+      query = pm.newQuery(MStorageDescriptor.class, "this.cd == inCD");
       query.declareParameters("MColumnDescriptor inCD");
-      long count = ((Long)query.execute(oldCD)).longValue();
-
+      List<MStorageDescriptor> referencedSDs = listStorageDescriptorsWithCD(oldCD, query);

Review comment:
       My guess is that mysql could also use limit, so we might want to test the performance for different engines and use the appropriate query for them.
   
   What do you think?
   
   Thanks,
   Peter




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-849619348


   > So I can re-fix this way:
   > 
   > ```
   >     if (dbProduct.isPOSTGRES() || dbProduct.isMYSQL()) {
   >     // use limit clause here
   >     }
   > ```
   > 
   > to make it work faster on MySQL.
   
   Sounds good to me.
   
   Thanks for the investigation!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r640364312



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5269,6 +5293,32 @@ private void preDropStorageDescriptor(MStorageDescriptor msd) {
     removeUnusedColumnDescriptor(mcd);
   }
 
+  /**
+   * Get a list of storage descriptors that reference a particular Column Descriptor
+   * @param oldCD the column descriptor to get storage descriptors for
+   * @return a list of storage descriptors
+   */
+  private List<MStorageDescriptor> listStorageDescriptorsWithCD(MColumnDescriptor oldCD, Query query) {
+    boolean success = false;
+    List<MStorageDescriptor> sds = null;
+    try {
+      openTransaction();

Review comment:
       Transactions could be used also to prevent other queries modifying the data which was read by another transaction depending on the isolation level used (for SERIALIZABLE isolation level). [One good description](https://www.sqlservercentral.com/articles/isolation-levels-in-sql-server)
   
   If there are concurrent transactions that could be an issue:
   - T1 reads the data - decides that the CD is not used anymore
   - T2 inserts a new row
   - T1 drops the CD
   
   The previous version of the code read the data inside the transaction, we can change the behavior if we decide so, but that has to be a conscious decision.
   
   IMHO this is all theoretical since if I remember correctly the `openTransaction()` is called before this method call, so in our case the new `openTranscation()` call just increments a counter.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] belugabehr commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
belugabehr commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-849725487


   Might this scenario be covered by `[NOT] EXISTS` Expressions?
   
   Find records where there does not exist a relationship.
   
   https://www.datanucleus.org/products/accessplatform_5_2/jpa/query.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] oleksiy-sayankin commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
oleksiy-sayankin commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-849617289


    **MySQL**
    
   MySQL does support limit. See [here](https://www.mysqltutorial.org/mysql-limit.aspx)
   
   > SELECT 
   >     select_list
   > FROM
   >    table_name
   > LIMIT [offset,] row_count;
   
   **MS SQL**
   
   Does not support limit. See [here](https://stackoverflow.com/questions/603724/how-to-implement-limit-with-sql-server). There is an alternative
   
   > SELECT TOP 10 * FROM (SELECT TOP 20 FROM Table ORDER BY Id) ORDER BY Id DESC
   
   **ORACLE**
   
   Does not support limits. See [here](https://stackoverflow.com/questions/470542/how-do-i-limit-the-number-of-rows-returned-by-an-oracle-query-after-ordering). There is an alternative:
   
   > SELECT * 
   > FROM   sometable
   > ORDER BY name
   > OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY
   
   So I can re-fix this way:
   
           if (dbProduct.isPOSTGRES() || dbProduct.isMYSQL()) {
           // use limit clause here
           }
   
   to make it work faster on MySQL.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
pvary commented on pull request #2323:
URL: https://github.com/apache/hive/pull/2323#issuecomment-849603523


   > Refactored according to the comments above.
   
   Thanks!
   
   Would it make sense to check which method is more performant with mysql / mssql db? I would guess that the query with `limit` performs better with mysql too. (not sure about mssql)
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] pvary commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
pvary commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r640453187



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5269,6 +5293,32 @@ private void preDropStorageDescriptor(MStorageDescriptor msd) {
     removeUnusedColumnDescriptor(mcd);
   }
 
+  /**
+   * Get a list of storage descriptors that reference a particular Column Descriptor
+   * @param oldCD the column descriptor to get storage descriptors for
+   * @return a list of storage descriptors
+   */
+  private List<MStorageDescriptor> listStorageDescriptorsWithCD(MColumnDescriptor oldCD, Query query) {
+    boolean success = false;
+    List<MStorageDescriptor> sds = null;
+    try {
+      openTransaction();

Review comment:
       We the  goal is with `openTransaction()` and the `commitTransaction()` to allow embedding method calls and keeping / enforcing transactions around them.
   
   So if `listStorageDescriptorsWithCD()` is never called from non-transactional methods then adding `openTransaction()` and `commitTransaction()` is redundant. We wrap the calls into transactions if want to make sure to run the code in a transaction whenever it is called.
   
   Wrapping a read only query in a transaction does not really make sense. It is either good without a transaction, or it requires a transaction because of the wider context, and then it is the callers responsibility to handle the transaction open/close. **So you are right, we do not need the transaction for the select itself.**




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] oleksiy-sayankin commented on a change in pull request #2323: HIVE-21075 : Metastore: Drop partition performance downgrade with Pos…

Posted by GitBox <gi...@apache.org>.
oleksiy-sayankin commented on a change in pull request #2323:
URL: https://github.com/apache/hive/pull/2323#discussion_r640590281



##########
File path: standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java
##########
@@ -5218,31 +5218,55 @@ private void removeUnusedColumnDescriptor(MColumnDescriptor oldCD) {
       return;
     }
 
-    boolean success = false;
     Query query = null;
+    LOG.debug("execute removeUnusedColumnDescriptor");
+    DatabaseProduct dbProduct = DatabaseProduct.determineDatabaseProduct(MetaStoreDirectSql.getProductName(pm), conf);
 
-    try {
-      openTransaction();
-      LOG.debug("execute removeUnusedColumnDescriptor");
+    /**
+     * In order to workaround oracle not supporting limit statement caused performance issue, HIVE-9447 makes
+     * all the backend DB run select count(1) from SDS where SDS.CD_ID=? to check if the specific CD_ID is
+     * referenced in SDS table before drop a partition. This select count(1) statement does not scale well in
+     * Postgres, and there is no index for CD_ID column in SDS table.
+     * For a SDS table with with 1.5 million rows, select count(1) has average 700ms without index, while in
+     * 10-20ms with index. But the statement before
+     * HIVE-9447( SELECT * FROM "SDS" "A0" WHERE "A0"."CD_ID" = $1 limit 1) uses less than 10ms .
+     */
 
-      query = pm.newQuery("select count(1) from " +
-        "org.apache.hadoop.hive.metastore.model.MStorageDescriptor where (this.cd == inCD)");
+    if (dbProduct.isPOSTGRES()) {
+      query = pm.newQuery(MStorageDescriptor.class, "this.cd == inCD");
       query.declareParameters("MColumnDescriptor inCD");
-      long count = ((Long)query.execute(oldCD)).longValue();
-
+      List<MStorageDescriptor> referencedSDs = listStorageDescriptorsWithCD(oldCD, query);

Review comment:
       I have found [this](https://stackoverflow.com/questions/14667580/why-is-the-limit-clause-not-supported-in-sql-server). They say
   
   > MS SQL Server has TOP (n) clause. Example from SQL Server
   > SELECT TOP 5 * FROM Table
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org