You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Hyunsik Choi (JIRA)" <ji...@apache.org> on 2014/10/17 10:54:33 UTC
[jira] [Comment Edited] (TAJO-1091) (Umbrella) Improve selectivity estimation by histograms

    [ https://issues.apache.org/jira/browse/TAJO-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174841#comment-14174841 ] 

Hyunsik Choi edited comment on TAJO-1091 at 10/17/14 8:54 AM:
--------------------------------------------------------------

Hi [~mvhlong],

Welcome to Tajo community. It may be a bit late :)

Anyway, I'm very happy to discuss it with you guys. I leave inline comments.

bq. 1) which stats we should collect:   IMO, the ones that you mentioned in TAJO-1120 (histograms, # distinct values, # NULL values,...) are quite enough.

Yes, you are right. These stats are enough. If so, we need to design how keep them in our catalog system. For it, we also need to consider main-memory data structure and catalog schema for them. In my opinion, these works should achieved prior to other works.

bq. 2) how we collect stats:   Also in TAJO-1120, Jihoon plans to collect the stats when storing a table. This approach is good if it can be done quickly and does not require much CPU and memory.

According to my experience, in practice, this approach is very limited. Storing tables involve lots of network traffic, I/O overheads, and CPU overhead caused by compression and transformation. In addition, some stats type (histogram and # of distinct values) requires sort or big memory. Only few stats like min/max and the number of rows can be computed with slight overheads.

bq. In addition to this approach, I suggest another one as follows: + Support a query to enforce statistics collection (for instance, "ANALYZE TABLE ..."). The users are recommended to run this query > every day, or after the tables' data has changed to some extent.

A user can choose when Tajo collects statistic information. This approach is most practical in this time. If we should implement the way to collect stats, this way should be achieved firstly in my opinion.

bq. + A background process in Tajo automatically "run" the above query to collect statistics for all tables (or all important tables) when there is no or very low workload. For example, a few hours after midnight is a good time.

This is a variation of 'ANALYZE TABLE'. It can be achieved by scheduling 'ANALYZE TABLE' in a background job.

bq. Stats of a big table may be collected in a distributed and partially incremental manner. A big data sample is splitted into a number of small-enough samples, each of which is sent to a worker. The stats information are collected separately in different workers, then merged into a complete version.

I agree with it. We mostly should use incremental ways and merge approaches. 

In sum, above collection ways can be categorized into a kind of the static stats collection way. In addition, I'd like to propose a runtime stats collection way. We also can collect stats from running queries in runtime. The runtime-collected stats can be used for (1) improving already collected stats, for (2) reoptimizing the remain plan fragments, or (3) for adjusting parallel degree or shuffle output number in runtime. Actually, 3) is already used in the current Tajo. I usually call (2) as *progress optimization*.

I think that runtime stats collection is very important in big data. In Hadoop scale systems, unlike OLTP or OLAP, suboptimal query probably takes hours instead of minutes. Estimation way still has a possibility to cause suboptimal optimization. So, we have to avoid the worst plans by reoptimizing the remain plan fragments with runtime-collected stats. It actually has research issues, so it would be interesting.

bq. 3) how/when Tajo exploit the collected stats:   At least, whenever the query planner needs to estimate the selectivity of a predicate, the stats can be used.

Yes, you are right. We can do selectivity estimation. In addition, query optimizer can know or estimate the the number of groupby keys, the intermediate data size, and join output. As you already know, It will help current query optimizer to find the best plan. Here, we may need to design how optimizer retrives desired stats from catalog. It will includes interface, catalog stats schema, and catalog stats index.

After we hear other guys' discussion more, we will make the design of overall parts we discuss and make a detailed roadmap including schedules. Next, we will create subtasks.


was (Author: hyunsik):
Hi [~mvhlong],

Welcome to Tajo community. It may be a bit late :)

Anyway, I'm very happy to discuss it with you guys. I leave inline comments.

bq. 1) which stats we should collect:   IMO, the ones that you mentioned in TAJO-1120 (histograms, # distinct values, # NULL values,...) are quite enough.

Yes, you are right. These stats are enough. If so, we need to design how keep them in our catalog system. For it, we also need to consider main-memory data structure and catalog schema for them. In my opinion, these works should achieved prior to other works.

bq. 2) how we collect stats:   Also in TAJO-1120, Jihoon plans to collect the stats when storing a table. This approach is good if it can be done quickly and does not require much CPU and memory.

According to my experience, in practice, this approach is very limited. Storing tables involve lots of network traffic, I/O overheads, and CPU overhead caused by compression and transformation. In addition, some stats type (histogram and # of distinct values) requires sort or big memory. Only few stats like min/max and the number of rows can be computed with slight overheads.

bq. In addition to this approach, I suggest another one as follows: + Support a query to enforce statistics collection (for instance, "ANALYZE TABLE ..."). The users are recommended to run this query > every day, or after the tables' data has changed to some extent.

A user can choose when Tajo collects statistic information. This approach is most practical in this time. If we should implement the way to collect stats, this way should be achieved firstly in my opinion.

bq. + A background process in Tajo automatically "run" the above query to collect statistics for all tables (or all important tables) when there is no or very low workload. For example, a few hours after midnight is a good time.

This is a variation of 'ANALYZE TABLE'. It can be achieved by scheduling 'ANALYZE TABLE' in a background job.

bq. Stats of a big table may be collected in a distributed and partially incremental manner. A big data sample is splitted into a number of small-enough samples, each of which is sent to a worker. The stats information are collected separately in different workers, then merged into a complete version.

I agree with it. We mostly should use incremental ways and merge approaches. Above collection ways may belong to a kind of the static stats collection way.

In addition, I'd like to propose a runtime stats collection way. We also can collect stats from running queries in runtime. The runtime-collected stats can be used for (1) improving already collected stats, for (2) reoptimizing the remain plan fragments, or (3) for adjusting parallel degree or shuffle output number in runtime. Actually, 3) is already used in the current Tajo. I usually call (2) as *progress optimization*.

I think that runtime stats collection is very important in big data. In Hadoop scale systems, unlike OLTP or OLAP, suboptimal query probably takes hours instead of minutes. Estimation way still has a possibility to cause suboptimal optimization. So, we have to avoid the worst plans by reoptimizing the remain plan fragments with runtime-collected stats. It actually has research issues, so it would be interesting.

bq. 3) how/when Tajo exploit the collected stats:   At least, whenever the query planner needs to estimate the selectivity of a predicate, the stats can be used.

Yes, you are right. We can do selectivity estimation. In addition, query optimizer can know or estimate the the number of groupby keys, the intermediate data size, and join output. As you already know, It will help current query optimizer to find the best plan. Here, we may need to design how optimizer retrives desired stats from catalog. It will includes interface, catalog stats schema, and catalog stats index.

After we hear other guys' discussion more, we will make the design of overall parts we discuss and make a detailed roadmap including schedules. Next, we will create subtasks.

> (Umbrella) Improve selectivity estimation by histograms
> -------------------------------------------------------
>
>                 Key: TAJO-1091
>                 URL: https://issues.apache.org/jira/browse/TAJO-1091
>             Project: Tajo
>          Issue Type: Improvement
>            Reporter: Mai Hai Thanh
>
> Accurate selectivity estimations are very useful for the query optimizer to choose the best query plan. However, Tajo currently assumes that the selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR), which is far from correct in many cases. Therefore, I want to improve Tajo's ability in selectivity estimation.
> There are alternative methods for selectivity estimation, such as histogram, wavelet transformation, singular value decomposition, discrete cosine transform, kernel estimators, and sampling. Among these methods, histograms have been shown to be one of the most popular and effective ways to obtain accurate selectivity estimates. Thus, I propose to implement and use histograms in Tajo soon. Other methods can be added later if needed. (An ensemble method that combines the results of many methods among the above ones to get the best selectivity estimate is also attractive)
> For histograms, more technically, there are some tasks to do as follows.
> 1. Implement a selectivity estimation interface
>     + For the adding of a new histogram
>     + For the use of a histogram in the query optimizer
>     + In doing this, should consider the use of different kinds of histograms as well as different kinds of selectivity estimation methods
> 2. Implement some popular kinds of histograms
>     + First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
>     + Including: histogram construction, maintenance, and selectivity estimation given a predicate (both simple and complex predicates)
> 3. Implement the registration of histogram information to Catalog Server
>     + Histograms can be a sub-part of a TableStats
>     + The content of a histogram is stored in the Catalog Server
> 4. Implement a histogram creation and maintenance mechanism
>     + Constructing a histogram for a column requires at least a column scan (full or random partial scan), so it may take a substantial amount of time. By default, histogram creation and maintenance can be triggerred after a default amount of time that there is no query processing. On the other hand, the creation and maintenance process can also be forced to run by a command, such as "ANALYZE table_name"
> 5. Improve the query optimizer to use histograms
>     + If a histogram is available, use histogram instead of the default value
> 6. Further future improvements
>     6.1. A table may have many columns. Creating and maintaining histograms for all columns of all tables is too costly. Hence, we should have a method to monitor the frequently accessed columns and we create histograms for only these special columns.
>     6.2. For simplicity, the construction and maintenance processes of a histogram are independent of those processes of other histograms. However, for efficiency, those processes of different histograms of the same table should be done at the same time to exploit cache locality.
>     6.3. So far, we have considered only one-dimensional histograms (in which EquiWidth and EquiDepth are representatives). These histograms are enough for us if the value distributions of all attributes are independent of each other. However, this is not always true in real-life data. So, additional techniques to detect and exploit dependency of the attributes will be useful.
>     6.4. The selectivity of past predicates (of past queries) should be use in some ways to improve the future selectivity estimation.
>     6.5. For each column, we should find the list of Most Common Values and compute their selectivities. This kind of information will be very useful when used together with histograms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)