You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Hyunsik Choi (JIRA)" <ji...@apache.org> on 2014/10/02 08:03:34 UTC
[jira] [Commented] (TAJO-1091) (Umbrella) Improve selectivity estimation by histograms

    [ https://issues.apache.org/jira/browse/TAJO-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156119#comment-14156119 ] 

Hyunsik Choi commented on TAJO-1091:
------------------------------------

++1 for this issue.

We also need histogram for distributed sort. Currently, a QueryMaster collects min-max values from all tasks and than the QueryMaster gets a merged min-max range from them. Next, a QueryMaster uses UniformRangePartition class to split one range space into uniform-sized sub ranges. But, this approach does not consider data volumes of sub ranges. As a result, this way is like to have data skewness problem, causing performance degradation. So, we also need histogram in order to split a range into subranges with evenly uniform data volume.

> (Umbrella) Improve selectivity estimation by histograms
> -------------------------------------------------------
>
>                 Key: TAJO-1091
>                 URL: https://issues.apache.org/jira/browse/TAJO-1091
>             Project: Tajo
>          Issue Type: Improvement
>            Reporter: Mai Hai Thanh
>
> Accurate selectivity estimations are very useful for the query optimizer to choose the best query plan. However, Tajo currently assumes that the selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR), which is far from correct in many cases. Therefore, I want to improve Tajo's ability in selectivity estimation.
> There are alternative methods for selectivity estimation, such as histogram, wavelet transformation, singular value decomposition, discrete cosine transform, kernel estimators, and sampling. Among these methods, histograms have been shown to be one of the most popular and effective ways to obtain accurate selectivity estimates. Thus, I propose to implement and use histograms in Tajo soon. Other methods can be added later if needed. (An ensemble method that combines the results of many methods among the above ones to get the best selectivity estimate is also attractive)
> For histograms, more technically, there are some tasks to do as follows.
> 1. Implement a selectivity estimation interface
>     + For the adding of a new histogram
>     + For the use of a histogram in the query optimizer
>     + In doing this, should consider the use of different kinds of histograms as well as different kinds of selectivity estimation methods
> 2. Implement some popular kinds of histograms
>     + First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
>     + Including: histogram construction, maintenance, and selectivity estimation given a predicate (both simple and complex predicates)
> 3. Implement the registration of histogram information to Catalog Server
>     + Histograms can be a sub-part of a TableStats
>     + The content of a histogram is stored in a file (Question: should this file be stored only in TajoMaster ? or replicated in all TajoWorkers ?)
> 4. Implement a histogram creation and maintenance mechanism
>     + Constructing a histogram for a column requires at least a column scan (full or random partial scan), so it may take a substantial amount of time. By default, histogram creation and maintenance can be triggerred after a default amount of time that there is no query processing. On the other hand, the creation and maintenance process can also be forced to run by a command through the CLI
> 5. Improve the query optimizer to use histograms
>     + If a histogram is available, use histogram instead of the default value
> 6. Further future improvements
>     6.1. A table may have many columns. Creating and maintaining histograms for all columns of all tables is too costly. Hence, we should have a method to monitor the frequently accessed columns and we create histograms for only these special columns.
>     6.2. For simplicity, the construction and maintenance processes of a histogram are independent of those processes of other histograms. However, for efficiency, those processes of different histograms of the same table should be done at the same time to exploit cache locality.
>     6.3. So far, we have considered only one-dimensional histograms (in which EquiWidth and EquiDepth are representatives). These histograms are enough for us if the value distributions of all attributes are independent of each other. However, this is not always true in real-life data. So, additional techniques to detect and exploit dependency of the attributes will be useful.
>     6.4. The selectivity of past predicates (of past queries) should be use in some ways to improve the future selectivity estimation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)