You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@phoenix.apache.org by BinShi-SecularBird <gi...@git.apache.org> on 2018/09/17 19:58:50 UTC

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

GitHub user BinShi-SecularBird opened a pull request:

    https://github.com/apache/phoenix/pull/351

    PHOENIX-4008: UPDATE STATISTIC should run raw scan with all versions of cells.

    @twdsilva @karanmehta93 
    
    The test should fail at the assertion showed below, but it didn't. The debugging shows that, even in "UPDATE STATISTICS" command we set the flag to read all versions of cells and pass to the scanner, the scanner still only returns the latest version of the cell. Could you help to check whether the change I made and the test case is correct? I'm correctly debugging the internal.
    
                // Because I updated the values of the row with primary key 100,
                // I expected now the estimated bytes should include two versions of the cells
                // and the following assertion should fail, but the assertion actually succeed.
                assertEquals((Long) 30L, info.getEstimatedBytes());

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/BinShi-SecularBird/phoenix PHOENIX-4008

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/phoenix/pull/351.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #351
    
----
commit f4b9f2c2eb6151806c6e655b7c01d5e33a212d14
Author: Bin <bs...@...>
Date:   2018-09-17T19:50:46Z

    PHOENIX-4008: UPDATE STATISTIC should run raw scan with all versions of cells.

----


---

[GitHub] phoenix issue #351: PHOENIX-4008: UPDATE STATISTIC should run raw scan with ...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on the issue:

    https://github.com/apache/phoenix/pull/351
  
    Patch looks good overall! I have a few nits, please address them and it should be good to go!


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218565923
  
    --- Diff: phoenix-core/src/it/java/org/apache/phoenix/schema/stats/StatsCollectorIT.java ---
    @@ -38,6 +38,7 @@
     import java.util.Properties;
     import java.util.Random;
     
    +import com.google.common.collect.Lists;
    --- End diff --
    
    nit: unused import


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by twdsilva <gi...@git.apache.org>.

Github user twdsilva commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218604090
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    setRaw() will all you to read all versions event deleted cells. 


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by BinShi-SecularBird <gi...@git.apache.org>.

Github user BinShi-SecularBird commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r219249172
  
    --- Diff: phoenix-core/src/it/java/org/apache/phoenix/schema/stats/StatsCollectorIT.java ---
    @@ -736,6 +737,71 @@ public void testEmptyGuidePostGeneratedWhenDataSizeLessThanGPWidth() throws Exce
             }
         }
     
    +    @Test
    +    public void testCollectingAllVersionsOfCells() throws Exception {
    +        String tableName = generateUniqueName();
    +        try (Connection conn = DriverManager.getConnection(getUrl())) {
    +            long guidePostWidth = 70;
    +            String ddl =
    +                    "CREATE TABLE " + tableName + " (k INTEGER PRIMARY KEY, c1.a bigint, c2.b bigint)"
    +                            + " GUIDE_POSTS_WIDTH=" + guidePostWidth
    +                            + ", USE_STATS_FOR_PARALLELIZATION=true" + ", VERSIONS=3";
    +            conn.createStatement().execute(ddl);
    +            conn.createStatement().execute("upsert into " + tableName + " values (100,100,3)");
    +            conn.commit();
    +            conn.createStatement().execute("UPDATE STATISTICS " + tableName);
    +
    +            ConnectionQueryServices queryServices =
    +                    conn.unwrap(PhoenixConnection.class).getQueryServices();
    +
    +            // The table only has one row. All cells just has one version, and the data size of the row
    +            // is less than the guide post width, so we generate empty guide post.
    +            try (Table statsHTable =
    --- End diff --
    
    refactored


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218635042
  
    --- Diff: phoenix-core/src/it/java/org/apache/phoenix/schema/stats/StatsCollectorIT.java ---
    @@ -38,6 +38,7 @@
     import java.util.Properties;
     import java.util.Random;
     
    +import com.google.common.collect.Lists;
    --- End diff --
    
    @twdsilva 
    I had a discussion with @BinShi-SecularBird offline. The scope of this Jira is limited to make sure that `UPDATE STATISTICS` runs with all versions of the cell (not the deleted versions, the actual number of allowed versions), since currently this is a feature gap (and actually a functional issue).
    We will make sure that the title of Jira is updated accordingly and a new Jira will be filed to discuss whether deleted rows need to handled with `UPDATE STATISTICS` sql or not.


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by BinShi-SecularBird <gi...@git.apache.org>.

Github user BinShi-SecularBird commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218903831
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    BTW, the TABLE SAMPLING(n) means run the query over the n percentage of rows, and including multiple versions of cells in stats doesn't change the count of rows, so these two shouldn't have conflicts theoretically.  Even we want to reduce time complexity from O(n) to O(m), where n is the number of rows in the table and m is the number of guide posts(or parallel scans), the algorithm of sampling should adhere to this basic fact.


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218652492
  
    --- Diff: phoenix-core/src/it/java/org/apache/phoenix/schema/stats/StatsCollectorIT.java ---
    @@ -736,6 +737,71 @@ public void testEmptyGuidePostGeneratedWhenDataSizeLessThanGPWidth() throws Exce
             }
         }
     
    +    @Test
    +    public void testCollectingAllVersionsOfCells() throws Exception {
    +        String tableName = generateUniqueName();
    +        try (Connection conn = DriverManager.getConnection(getUrl())) {
    +            long guidePostWidth = 70;
    +            String ddl =
    +                    "CREATE TABLE " + tableName + " (k INTEGER PRIMARY KEY, c1.a bigint, c2.b bigint)"
    +                            + " GUIDE_POSTS_WIDTH=" + guidePostWidth
    +                            + ", USE_STATS_FOR_PARALLELIZATION=true" + ", VERSIONS=3";
    +            conn.createStatement().execute(ddl);
    +            conn.createStatement().execute("upsert into " + tableName + " values (100,100,3)");
    +            conn.commit();
    +            conn.createStatement().execute("UPDATE STATISTICS " + tableName);
    +
    +            ConnectionQueryServices queryServices =
    +                    conn.unwrap(PhoenixConnection.class).getQueryServices();
    +
    +            // The table only has one row. All cells just has one version, and the data size of the row
    +            // is less than the guide post width, so we generate empty guide post.
    +            try (Table statsHTable =
    --- End diff --
    
    nit: Extract out similar looking pieces of test code into a method


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by twdsilva <gi...@git.apache.org>.

Github user twdsilva commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218604650
  
    --- Diff: phoenix-core/src/it/java/org/apache/phoenix/schema/stats/StatsCollectorIT.java ---
    @@ -38,6 +38,7 @@
     import java.util.Properties;
     import java.util.Random;
     
    +import com.google.common.collect.Lists;
    --- End diff --
    
    You should also add a test where you row a few rows and then  delete them and call update statistics with and without include deleted cells and verify that they guide point width takes into account delete cells correctly.


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by BinShi-SecularBird <gi...@git.apache.org>.

Github user BinShi-SecularBird commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r219249109
  
    --- Diff: phoenix-core/src/it/java/org/apache/phoenix/schema/stats/StatsCollectorIT.java ---
    @@ -38,6 +38,7 @@
     import java.util.Properties;
     import java.util.Random;
     
    +import com.google.common.collect.Lists;
    --- End diff --
    
    removed unused import


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by BinShi-SecularBird <gi...@git.apache.org>.

Github user BinShi-SecularBird commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218897162
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    @twdsilva, @karanmehta93 , could you please confirm that "setRaw() will provide all versions of cells"? I use setRaw(true) without using readAllVersions() API and debug to internal, it only returns deleted row and the latest version of cell which matches the comment on Scan.setRaw() in the code.
    
    Now let's discuss the relationship between tabling sampling and the two cases, counting multiple versions in the stats and including deleted rows in the stats, respectively.
    1. Counting multiple versions in the stats.
        Yes, it will make tabling sampling result to be inaccurate based on the current implementation of table sampling. The current implementation of table sampling is -- (see BaseResultIterators.getParallelScan() which calls sampleScans(...) at the end of function) iterate all parallel scans generated, for each scan, if the hash code of start row key of the scan < tableSamplingRate (See TableSamplerPredicate.java) see pick this scan otherwise discard this scan. The algorithm has an assumption that the ranges between have two consecutive guide posts has equal number of rows. Even without counting multiple versions in the stats, we know this assumption will make the sampling result to be inaccurate. The right algorithm of table sampling should be based on the count of rows in each guide post. With the right algorithm,  counting multiple versions in the stats has no conflict with table sampling.
    2. Counting the deleted rows in the stats.
        We discussed about it in https://issues.apache.org/jira/browse/PHOENIX-4008. What I propose now is that we call SetRaw(true) to collect all deleted rows which only contribute to estimated size of guide post and has NO contribute to the estimated row count of the guide post. This solution has no conflict with table sampling and query complexity estimation based on estimated size is still accurate.
    
    Regarding "the select query will return the latest version ... then it would mean that we probably don't need this PR at all",  I don't think so. It doesn't matter whether the select query will return the latest version or multiple version, the important thing is that we need to include multiple versions of cells to make the estimated size of guide post to be more accurate for query complexity esitmation and the cost for a scan.
     



---

[GitHub] phoenix issue #351: PHOENIX-4008: UPDATE STATISTIC should collect all versio...

Posted by BinShi-SecularBird <gi...@git.apache.org>.

Github user BinShi-SecularBird commented on the issue:

    https://github.com/apache/phoenix/pull/351
  
    I checked the following two test failures, they even failed in my clean local repository (just cloned from master without any local changes). Any suggestions?
    
    Test Result (2 failures / ±0)
    org.apache.phoenix.end2end.ConcurrentMutationsIT.testRowLockDuringPreBatchMutateWhenIndexed
    org.apache.phoenix.end2end.ConcurrentMutationsIT.testLockUntilMVCCAdvanced


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218668144
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    Then it would mean that we probably don't need this PR at all.


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by twdsilva <gi...@git.apache.org>.

Github user twdsilva commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218663493
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    scan.readAllVersions() isn't used when we run a query with table sampling. If you have 100 versions of a row and run query you will only see the latest one, or if an SCN is set you will see the last tow at the timestamp just before the SCN. If the guideposts are calculated using all the versions then sampling will be incorrect. 


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218230917
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    I think the API you are looking for is https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Scan.java#L1017


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218662932
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    If the table is configured to have multiple versions as part of schema, then it is functionally correct to consider all of them anyways (for tablesampling as well as update stats). `scan.readAllVersions()` provides multiple versions of the cell that are configured at HBase level, these are the cells that won't be removed when major compacted. This method helps return all versions, which is otherwise defaulted to just 1. It `doesn't` provide deleted versions of cell.
    `scan.setRaw()` provides all the versions, even the ones which are deleted. I was earlier suggesting @BinShi-SecularBird to use that, but in the scope of this Jira, we will not use it, since the problem that we are trying to solve is different. Bin will file a separate Jira to address the deleted cells issue.


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by karanmehta93 <gi...@git.apache.org>.

Github user karanmehta93 commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218663829
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    Ohh that is strange. Does that mean even if we use `versions` while creating the table, the scan  or select query would just return the latest version?


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by twdsilva <gi...@git.apache.org>.

Github user twdsilva commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218665392
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    The select query will return the latest version. See this JIRA to provide support to return multiple . versions https://issues.apache.org/jira/browse/PHOENIX-590


---

[GitHub] phoenix pull request #351: PHOENIX-4008: UPDATE STATISTIC should run raw sca...

Posted by twdsilva <gi...@git.apache.org>.

Github user twdsilva commented on a diff in the pull request:

    https://github.com/apache/phoenix/pull/351#discussion_r218656332
  
    --- Diff: phoenix-core/src/main/java/org/apache/phoenix/schema/MetaDataClient.java ---
    @@ -1279,6 +1279,7 @@ private long updateStatisticsInternal(PName physicalName, PTable logicalTable, M
                 MutationPlan plan = compiler.compile(Collections.singletonList(tableRef), null, cfs, null, clientTimeStamp);
                 Scan scan = plan.getContext().getScan();
                 scan.setCacheBlocks(false);
    +            scan.readAllVersions();
    --- End diff --
    
    How is this going to work with the table sampling feature? When we query we only see the latest version of a cell, so if UPDATE STATISTICS is run for a table with has rows with multiple versions, won't the sampling be off? 


---