You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/10/17 22:14:00 UTC

[jira] [Commented] (TRAFODION-3223) Row count estimation code works poorly on time-ordered aged-out data

    [ https://issues.apache.org/jira/browse/TRAFODION-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16654288#comment-16654288 ] 

ASF GitHub Bot commented on TRAFODION-3223:
-------------------------------------------

GitHub user DaveBirdsall opened a pull request:

    https://github.com/apache/trafodion/pull/1730

    [TRAFODION-3223] Don't scale down for non-Puts when estimating row counts

    The estimateRowCount code in HBaseClient.java tried to scale down row counts by the proportion of non-Put cells in the file. That is, it was trying to estimate row count from cell count, in part by discounting the effect of Delete tombstone cells. It was doing this on the basis of a sample of 500 rows in one HFile.
    
    We find, however, that with time-ordered data that is aged out, the Delete cells are not uniformly distributed but instead tend to clump in one place. If we are unlucky and get an HFile that begins with 500 Delete tombstones, we will incorrectly assume most of the table consists of deleted rows and drastically underestimate the number of rows.
    
    Drastically underestimating can be very bad. It is much better to overestimate. So the code that attempted to scale down row count based on the number of non-Put cells has been deleted. Also, if we find that the number of Puts in our sample is very small (< 50), we will instead ignore the sample and use the total number of entries.
    
    The changes described above are in HBaseClient.java.
    
    There are two other small, unrelated changes in this pull request as well:
    
    1. The regression test filter for filtering out SYSKEYS has been changed. The current minimum number of decimal digits in a SYSKEY is 15; the filter was assuming they were at least 16 digits. This would lead to regression failures if someone was very unlucky and got just the wrong Linux thread ID for their process.
    
    2. An uninitialized member of class ExRtFragTable is now initialized. This is a long-standing bug; the changes for pull request https://github.com/apache/trafodion/pull/1724 made it observable. For random parallel queries, the Executor GUI might come up at run time if the uninitialized value happened to be non-zero.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DaveBirdsall/trafodion Trafodion3223

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/trafodion/pull/1730.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1730
    
----
commit 898812f84c510ab8798d5af6e3e63559f4078a07
Author: Dave Birdsall <db...@...>
Date:   2018-10-17T22:06:44Z

    [TRAFODION-3223] Don't scale down for non-Puts when estimating row counts

----


> Row count estimation code works poorly on time-ordered aged-out data
> --------------------------------------------------------------------
>
>                 Key: TRAFODION-3223
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-3223
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: any
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>            Priority: Major
>
> The estimateRowCountBody method in module HBaseClient.java samples cells from the first 500 rows from the first HFile it sees in order to estimate the number of rows in a Trafodion table. If the table happens to have a time-ordered key, and data are aged out over time, we can get large clumps of "delete" tombstones in one or more HFiles. If estimateRowCountBody happens to look at such an HFile, it will incorrectly conclude that most cells are "delete" tombstones and therefore drastically underestimate the row count.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)