You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by mvhlong <gi...@git.apache.org> on 2014/10/14 03:32:38 UTC

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

GitHub user mvhlong opened a pull request:

    https://github.com/apache/tajo/pull/200

    TAJO-1112: Implement histogram interface and a candidate histogram

    Hi everyone,
    This patch contains:
    + a histogram interface with utility functions, including selectivity estimation
    + 2 candidate histograms: equi-width and equi-depth
    + some unit tests for integrity and accuracy of the histograms
    
    In the accuracy tests, given a 100k data set and a 10k random sample (10% of the data set), the estimation accuracy is about 80% - 95%, for both random data of uniform and Gaussian distributions. Histogram construction time (just consider the first construction time, without cache effect) is about 15 ms.
    
    Please review and advice me if anything should be improved. Sincerely!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mvhlong/tajo TAJO-1112-new

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/200.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #200
    
----
commit 91b2cb7ab9d60020990250cd390e277f3f1d074a
Author: mvhlong <mv...@gmail.com>
Date:   2014-10-14T00:57:02Z

    internal histogram support

commit f40ec9d9d330da67aba30face5bae07366bfb444
Author: mvhlong <mv...@gmail.com>
Date:   2014-10-14T01:08:01Z

    fix minor bug

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59472317
  
    @mvhlong and @hyunsik, thanks for your great discussion.
    
    On using Datum, I agree with @mvhlong. IMO, a simple way is good for the first implementation. We can improve it if any performance problems are found. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59323696
  
    Your patch is a very nice starting point for histogram in Apache Tajo. The patch looks good to me. Additionally, I leave some comments.
    
    As you mentioned, precision is less important in sampling approaches. In many cases, using double may be a right solution.  But, when it comes to value range representation, Double still has limitation.
    
    in many cases, Text value can be easily longer than the representation range of DOUBLE or LONG. I already fixed many bugs of sort bugs caused by long text value (up to 256 bytes). For ETL workloads, it is usual. Also, can Double represent an entire Long range? We need to check it. In my opinion, we need to use four value types: Long, Double, Text, and Byte [].
    
    Also, I'll start other discussions in TAJO-1091 (https://issues.apache.org/jira/browse/TAJO-1091).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59176888
  
    @mvhlong sorry for confusing.
    I first intended that using NumericDatum will be useful for the later histogram extension.
    However, you are right. Double can contain a broad range of numeric types, and TEXT can be mapped to a numeric value. I missed it.
    In addition, the histogram is stored in only the catalog, so we don't need to use Datum.
    Thanks for your explanation!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/200#discussion_r18819484
  
    --- Diff: tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/Histogram.java ---
    @@ -0,0 +1,238 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.tajo.catalog.statistics;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.tajo.catalog.json.CatalogGsonHelper;
    +import org.apache.tajo.catalog.proto.CatalogProtos;
    +import org.apache.tajo.common.ProtoObject;
    +import org.apache.tajo.json.GsonObject;
    +import org.apache.tajo.util.TUtil;
    +
    +import com.google.common.base.Objects;
    +import com.google.gson.annotations.Expose;
    +
    +public class Histogram implements ProtoObject<CatalogProtos.HistogramProto>, Cloneable, GsonObject {
    +  
    +  @Expose protected Long lastAnalyzed = null; // optional
    --- End diff --
    
    Would you add an explanation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/tajo/pull/200


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/200#discussion_r18819683
  
    --- Diff: tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/Histogram.java ---
    @@ -0,0 +1,238 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.tajo.catalog.statistics;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.tajo.catalog.json.CatalogGsonHelper;
    +import org.apache.tajo.catalog.proto.CatalogProtos;
    +import org.apache.tajo.common.ProtoObject;
    +import org.apache.tajo.json.GsonObject;
    +import org.apache.tajo.util.TUtil;
    +
    +import com.google.common.base.Objects;
    +import com.google.gson.annotations.Expose;
    +
    +public class Histogram implements ProtoObject<CatalogProtos.HistogramProto>, Cloneable, GsonObject {
    +  
    +  @Expose protected Long lastAnalyzed = null; // optional
    +  @Expose protected List<HistogramBucket> buckets = null; // repeated
    +  @Expose protected boolean isReady; // whether this histogram is ready to be used for selectivity estimation
    +  protected int DEFAULT_MAX_BUCKETS = 100; // Same as PostgreSQL
    +
    +  public Histogram() {
    +    buckets = TUtil.newList();
    +    isReady = false;
    +  }
    +  
    +  public Histogram(CatalogProtos.HistogramProto proto) {
    +    if (proto.hasLastAnalyzed()) {
    +      this.lastAnalyzed = proto.getLastAnalyzed();
    +    }    
    +    buckets = TUtil.newList();
    +    for (CatalogProtos.HistogramBucketProto bucketProto : proto.getBucketsList()) {
    +      this.buckets.add(new HistogramBucket(bucketProto));
    +    }
    +    isReady = true;
    +  }
    +  
    +  public Long getLastAnalyzed() {
    +    return this.lastAnalyzed;
    +  }
    +  
    +  public void setLastAnalyzed(Long lastAnalyzed) {
    +    this.lastAnalyzed = lastAnalyzed;
    +  }
    +  
    +  public List<HistogramBucket> getBuckets() {
    +    return this.buckets;
    +  }
    +
    +  public void setBuckets(List<HistogramBucket> buckets) {
    +    this.buckets = new ArrayList<HistogramBucket>(buckets);
    +  }
    +
    +  public void addBucket(HistogramBucket bucket) {
    +    this.buckets.add(bucket);
    +  }
    +
    +  public int getBucketsCount() {
    +    return this.buckets.size();
    +  }
    +
    +  public boolean getIsReady() {
    +    return this.isReady;
    +  }
    +  
    +  public void setIsReady(boolean isReady) {
    +    this.isReady = isReady;
    +  }
    +  
    +  public boolean equals(Object obj) {
    +    if (obj instanceof Histogram) {
    +      Histogram other = (Histogram) obj;
    +      return getLastAnalyzed().equals(other.getLastAnalyzed())
    +          && TUtil.checkEquals(this.buckets, other.buckets);
    +    } else {
    +      return false;
    +    }
    +  }
    +
    +  public int hashCode() {
    +    return Objects.hashCode(this.lastAnalyzed, this.buckets);
    +  }
    +
    +  public Histogram clone() throws CloneNotSupportedException {
    +    Histogram hist = (Histogram) super.clone();
    +    hist.lastAnalyzed = this.lastAnalyzed;
    +    hist.buckets = new ArrayList<HistogramBucket>(this.buckets);
    +    hist.isReady = this.isReady;
    +    return hist;
    +  }
    +
    +  public String toString() {
    +    return CatalogGsonHelper.getPrettyInstance().toJson(this, Histogram.class);
    +  }
    +
    +  @Override
    +  public String toJson() {
    +    return CatalogGsonHelper.toJson(this, Histogram.class);
    +  }
    +
    +
    +  @Override
    +  public CatalogProtos.HistogramProto getProto() {
    +    CatalogProtos.HistogramProto.Builder builder = CatalogProtos.HistogramProto.newBuilder();
    --- End diff --
    
    It seems that the 'isReady' variable is missing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/200#discussion_r18820375
  
    --- Diff: tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/Histogram.java ---
    @@ -0,0 +1,238 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.tajo.catalog.statistics;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.tajo.catalog.json.CatalogGsonHelper;
    +import org.apache.tajo.catalog.proto.CatalogProtos;
    +import org.apache.tajo.common.ProtoObject;
    +import org.apache.tajo.json.GsonObject;
    +import org.apache.tajo.util.TUtil;
    +
    +import com.google.common.base.Objects;
    +import com.google.gson.annotations.Expose;
    +
    +public class Histogram implements ProtoObject<CatalogProtos.HistogramProto>, Cloneable, GsonObject {
    +  
    +  @Expose protected Long lastAnalyzed = null; // optional
    +  @Expose protected List<HistogramBucket> buckets = null; // repeated
    +  @Expose protected boolean isReady; // whether this histogram is ready to be used for selectivity estimation
    +  protected int DEFAULT_MAX_BUCKETS = 100; // Same as PostgreSQL
    +
    +  public Histogram() {
    +    buckets = TUtil.newList();
    +    isReady = false;
    +  }
    +  
    +  public Histogram(CatalogProtos.HistogramProto proto) {
    +    if (proto.hasLastAnalyzed()) {
    +      this.lastAnalyzed = proto.getLastAnalyzed();
    +    }    
    +    buckets = TUtil.newList();
    +    for (CatalogProtos.HistogramBucketProto bucketProto : proto.getBucketsList()) {
    +      this.buckets.add(new HistogramBucket(bucketProto));
    +    }
    +    isReady = true;
    +  }
    +  
    +  public Long getLastAnalyzed() {
    +    return this.lastAnalyzed;
    +  }
    +  
    +  public void setLastAnalyzed(Long lastAnalyzed) {
    +    this.lastAnalyzed = lastAnalyzed;
    +  }
    +  
    +  public List<HistogramBucket> getBuckets() {
    +    return this.buckets;
    +  }
    +
    +  public void setBuckets(List<HistogramBucket> buckets) {
    +    this.buckets = new ArrayList<HistogramBucket>(buckets);
    +  }
    +
    +  public void addBucket(HistogramBucket bucket) {
    +    this.buckets.add(bucket);
    +  }
    +
    +  public int getBucketsCount() {
    +    return this.buckets.size();
    +  }
    +
    +  public boolean getIsReady() {
    +    return this.isReady;
    +  }
    +  
    +  public void setIsReady(boolean isReady) {
    +    this.isReady = isReady;
    +  }
    +  
    +  public boolean equals(Object obj) {
    +    if (obj instanceof Histogram) {
    +      Histogram other = (Histogram) obj;
    +      return getLastAnalyzed().equals(other.getLastAnalyzed())
    +          && TUtil.checkEquals(this.buckets, other.buckets);
    +    } else {
    +      return false;
    +    }
    +  }
    +
    +  public int hashCode() {
    +    return Objects.hashCode(this.lastAnalyzed, this.buckets);
    +  }
    +
    +  public Histogram clone() throws CloneNotSupportedException {
    +    Histogram hist = (Histogram) super.clone();
    +    hist.lastAnalyzed = this.lastAnalyzed;
    +    hist.buckets = new ArrayList<HistogramBucket>(this.buckets);
    +    hist.isReady = this.isReady;
    +    return hist;
    +  }
    +
    +  public String toString() {
    +    return CatalogGsonHelper.getPrettyInstance().toJson(this, Histogram.class);
    +  }
    +
    +  @Override
    +  public String toJson() {
    +    return CatalogGsonHelper.toJson(this, Histogram.class);
    +  }
    +
    +
    +  @Override
    +  public CatalogProtos.HistogramProto getProto() {
    +    CatalogProtos.HistogramProto.Builder builder = CatalogProtos.HistogramProto.newBuilder();
    +    if (this.lastAnalyzed != null) {
    +      builder.setLastAnalyzed(this.lastAnalyzed);
    +    }
    +    if (this.buckets != null) {
    +      for (HistogramBucket bucket : buckets) {
    +        builder.addBuckets(bucket.getProto());
    +      }
    +    }
    +    return builder.build();
    +  }
    +
    +  /**
    +   * Construct a histogram. Compute the number of buckets and the min, max, frequency values for each of them. This
    +   * method must be overridden by specific sub-classes. The number of buckets should be less than or equal to the
    +   * sample size.
    +   * 
    +   * @param samples
    +   *          Sample data points to construct the histogram. This collection should fit in the memory and Null values
    +   *          should never appear in it.
    +   * @return Return true if the computation is done without any problem. Otherwise, return false
    +   */
    +  public boolean construct(List<Double> samples) {
    --- End diff --
    
    Do you have any reasons for restricting the type of samples as double? 
    In Tajo, Datum is the basic class to contain values.
    IMO, NumericDatum will be more proper type rather than Double.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/200#discussion_r18821699
  
    --- Diff: tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/EquiDepthHistogram.java ---
    @@ -0,0 +1,95 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.tajo.catalog.statistics;
    +
    +import java.util.ArrayList;
    +import java.util.Collections;
    +import java.util.List;
    +
    +import org.apache.tajo.catalog.proto.CatalogProtos.HistogramProto;
    +import org.apache.tajo.util.TUtil;
    +
    +public class EquiDepthHistogram extends Histogram {
    +
    +  public EquiDepthHistogram() {
    +    super();
    +  }
    +
    +  public EquiDepthHistogram(HistogramProto proto) {
    +    super(proto);
    +  }
    +
    +  @Override
    +  public boolean construct(List<Double> samples) {
    +    int numBuckets = samples.size() > DEFAULT_MAX_BUCKETS ? DEFAULT_MAX_BUCKETS : samples.size();
    +    return construct(samples, numBuckets);
    +  }
    +
    +  /**
    +   * An EquiDepth histogram is constructed by dividing the data points into buckets of equal frequencies (as equal as
    +   * possible).
    +   * 
    +   * Note that, this function, which allows the specification of the (maximum) number of buckets, should be called
    +   * directly only in the unit tests. In non-test cases, the construct(samples) version above should be used.
    +   */
    +  public boolean construct(List<Double> srcSamples, int numBuckets) {
    +    isReady = false;
    +    buckets = TUtil.newList();
    +
    +    ArrayList<Double> samples = new ArrayList<Double>(srcSamples); // sorted samples
    +    Collections.sort(samples);
    +
    +    int averageFrequency = Math.round((float) samples.size() / numBuckets);
    +    int bFrequency;
    +    int bMinIndex = 0, bMaxIndex;
    +    Double bMin, bMax;
    +
    +    for (int i = 0; i < numBuckets && bMinIndex < samples.size(); i++) {
    +      bMin = samples.get(bMinIndex);
    +      bFrequency = averageFrequency;
    +      bMaxIndex = bMinIndex + averageFrequency - 1;
    +
    +      if (bMaxIndex > samples.size() - 1) {
    +	bMaxIndex = samples.size() - 1;
    +	bFrequency = bMaxIndex - bMinIndex + 1;
    +      }
    +
    +      while (bMaxIndex + 1 < samples.size() && samples.get(bMaxIndex).equals(samples.get(bMaxIndex + 1))) {
    +	bMaxIndex++;
    +	bFrequency++;
    +      }
    +      bMax = samples.get(bMaxIndex);
    +      bMinIndex = bMaxIndex + 1;
    +      buckets.add(new HistogramBucket(bMin, bMax, bFrequency));
    +    }
    +
    +    if (bMinIndex < samples.size()) {
    +      int lastBucket = buckets.size() - 1;
    +      int additionalFrequency = samples.size() - bMinIndex;
    +      buckets.get(lastBucket).setFrequency(buckets.get(lastBucket).getFrequency() + additionalFrequency);
    +      buckets.get(lastBucket).setMax(samples.get(samples.size() - 1));
    +    }
    +
    +    samples.clear();
    +    isReady = true;
    +    lastAnalyzed = System.currentTimeMillis();
    --- End diff --
    
    You need to use org.apache.hadoop.yarn.util.Clock.
    You can refer SubQuery.setStartTime().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/200#discussion_r18822124
  
    --- Diff: tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/EquiWidthHistogram.java ---
    @@ -0,0 +1,92 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.tajo.catalog.statistics;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.tajo.catalog.proto.CatalogProtos.HistogramProto;
    +import org.apache.tajo.util.TUtil;
    +
    +public class EquiWidthHistogram extends Histogram {
    +
    +  public EquiWidthHistogram() {
    +    super();
    +  }
    +
    +  public EquiWidthHistogram(HistogramProto proto) {
    +    super(proto);
    +  }
    +
    +  @Override
    +  public boolean construct(List<Double> samples) {
    +    int numBuckets = samples.size() > DEFAULT_MAX_BUCKETS ? DEFAULT_MAX_BUCKETS : samples.size();
    +    return construct(samples, numBuckets);
    +  }
    +  
    +  /**
    +   * Originally, an EquiWidth histogram is constructed by dividing the entire value range of the data points into
    +   * buckets of equal sizes. Here, we construct it similarly, then improve it by removing empty buckets, to save disk
    +   * space and processing time, and by compacting the buckets' boundaries, to increase selectivity estimation accuracy.
    +   * 
    +   * Note that, this function, which allows the specification of the (maximum) number of buckets, should be called
    +   * directly only in the unit tests. In non-test cases, the construct(samples) version above should be used.
    +   */
    +  public boolean construct(List<Double> samples, int numBuckets) {
    +    isReady = false;
    +    buckets = TUtil.newList();
    +    Double globalMin = Double.MAX_VALUE;
    +    Double globalMax = -Double.MAX_VALUE;
    +    List<Double> bMinValues = new ArrayList<Double>(numBuckets);
    +    List<Double> bMaxValues = new ArrayList<Double>(numBuckets);
    +    List<Long> bFrequencies = new ArrayList<Long>(numBuckets);
    +
    +    for (Double p : samples) {
    +      if (p < globalMin) globalMin = p;
    +      if (p > globalMax) globalMax = p;
    +    }
    +    double bWidth = (globalMax - globalMin) / numBuckets;
    +
    +    for (int i = 0; i < numBuckets; i++) {
    +      bMinValues.add(Double.MAX_VALUE);
    +      bMaxValues.add(-Double.MAX_VALUE);
    +      bFrequencies.add(0l);
    +    }
    +
    +    for (Double p : samples) {
    +      int bIndex = (int) Math.round(Math.floor((p - globalMin) / bWidth));
    --- End diff --
    
    Do you have any reasons for conducting the round after floor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/200#discussion_r18819530
  
    --- Diff: tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/Histogram.java ---
    @@ -0,0 +1,238 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.tajo.catalog.statistics;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.tajo.catalog.json.CatalogGsonHelper;
    +import org.apache.tajo.catalog.proto.CatalogProtos;
    +import org.apache.tajo.common.ProtoObject;
    +import org.apache.tajo.json.GsonObject;
    +import org.apache.tajo.util.TUtil;
    +
    +import com.google.common.base.Objects;
    +import com.google.gson.annotations.Expose;
    +
    +public class Histogram implements ProtoObject<CatalogProtos.HistogramProto>, Cloneable, GsonObject {
    +  
    +  @Expose protected Long lastAnalyzed = null; // optional
    +  @Expose protected List<HistogramBucket> buckets = null; // repeated
    +  @Expose protected boolean isReady; // whether this histogram is ready to be used for selectivity estimation
    +  protected int DEFAULT_MAX_BUCKETS = 100; // Same as PostgreSQL
    +
    +  public Histogram() {
    +    buckets = TUtil.newList();
    +    isReady = false;
    +  }
    +  
    +  public Histogram(CatalogProtos.HistogramProto proto) {
    +    if (proto.hasLastAnalyzed()) {
    +      this.lastAnalyzed = proto.getLastAnalyzed();
    +    }    
    +    buckets = TUtil.newList();
    +    for (CatalogProtos.HistogramBucketProto bucketProto : proto.getBucketsList()) {
    +      this.buckets.add(new HistogramBucket(bucketProto));
    +    }
    +    isReady = true;
    +  }
    +  
    +  public Long getLastAnalyzed() {
    +    return this.lastAnalyzed;
    +  }
    +  
    +  public void setLastAnalyzed(Long lastAnalyzed) {
    +    this.lastAnalyzed = lastAnalyzed;
    +  }
    +  
    +  public List<HistogramBucket> getBuckets() {
    +    return this.buckets;
    +  }
    +
    +  public void setBuckets(List<HistogramBucket> buckets) {
    +    this.buckets = new ArrayList<HistogramBucket>(buckets);
    +  }
    +
    +  public void addBucket(HistogramBucket bucket) {
    +    this.buckets.add(bucket);
    +  }
    +
    +  public int getBucketsCount() {
    +    return this.buckets.size();
    +  }
    +
    +  public boolean getIsReady() {
    +    return this.isReady;
    +  }
    +  
    +  public void setIsReady(boolean isReady) {
    +    this.isReady = isReady;
    +  }
    +  
    +  public boolean equals(Object obj) {
    +    if (obj instanceof Histogram) {
    +      Histogram other = (Histogram) obj;
    +      return getLastAnalyzed().equals(other.getLastAnalyzed())
    --- End diff --
    
    It seems that the 'isReady' variable is missed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59030122
  
    This patch looks truly great!
    In my opinion, the proposed histogram interface looks nice.
    In addition, this patch contains plenty tests and comments.
    I've left some minor comments!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by mvhlong <gi...@git.apache.org>.
Github user mvhlong commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59173282
  
    I'm glad to discuss with you, guys.
    
    @jihoonson I am a little confusing. Your advice is to use NumericDatum (not Datum) instead of Double. In Tajo, NumericDatum represents INTx and FLOATx. I think that Double can cover a broader range of numeric values, not only int and float, but also boolean, bit, datetime, and char. Because the data is big, the sample data is big, too (if you take too small samples, the accuracy will be too low). Meanwhile, a table may contain hundreds of columns, each of which needs to construct a separate histogram. Hence, histogram construction time can be very long and we should alleviate this burden by using simple data type, such as Double instead of NumericDatum (or Datum). Data type conversion from Datum to Double can be done by a utility function.
    
    equi-width and equi-depth are simple histograms, thus obtain not-so-great estimation accuracy. When you want to improve the accuracy of selectivity estimation by implementing more complex histograms in the future (for example, multidimensional histograms - to catch the dependencies of data between different columns), you will see that histogram construction time is a real problem. So, it'd be better to keep it simple.
    
    For a fast extension, TEXT can be approximately mapped to a numerical value, too. For more complex types (including TEXT if you wish), in my opinion, Tajo needs a special treatment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59467650
  
    Hi @mvhlong, 
    
    Why don't you make a separate histogram implementation for each type? We may need only three implementations for Long, Double, and byte []. They will cover all value types. I think that this approach would be good in practice.
    
    I have another question. Your proposed histogram implementation takes a list of Double values. Does this approach takes all values at a time? Otherwise, can we build a histogram in an incremental way?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by mvhlong <gi...@git.apache.org>.
Github user mvhlong commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59471438
  
    Hi @hyunsik and @jihoonson,
    I have just made an important update to my patch: change from using Double to Datum.
    In the source code, I throw Exceptions for non-numeric type because I do not want to mix too many changes at once and we need to update several important functions in Datum before other types can be supported. However, with this change, the histogram module can be extended to support other data types easily in the future.
    
    With the use of Datum, in contrast to my assumption, the histogram construction time is not slower in the tests. This is good for us.
    
    With a single histogram idea (for example, equi-width), I prefer to make only 1 implementation for all data types rather than to make 3 different versions for Long, Double, and byte[]. This makes the source code clean and easy for maintenance. Anyway, with the latest update, this problem has been solved.
    
    Currently, the histogram implementation takes a list of Datum values. You are right that it takes all values at a time. Building a histogram in an completely incremental way is difficult because many histograms require the sort of all data points. So, I think that we should build the histograms in a *partially* incremental way. More specifically, we first build many histograms with many different samples, then *merge* them. A function to merge the histogram will be implemented later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59316651
  
    Welcome any discussion.
    I'll wait for your review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by mvhlong <gi...@git.apache.org>.
Github user mvhlong commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59156973
  
    Thank @jihoonson,
    I added the "@VisibleForTesting" annotation and the explanation for "lastAnalyzed". I also removed the use of "round()" after "floor()" (it became redundant after a bug fix). However, I do not use "yarn.util.SystemClock.getTime()" instead of "System.currentTimeMillis()" because the source code of SystemClock.getTime() just contains a single call to System.currentTimeMillis(). The use of former function adds unnecessary package dependency while System.currentTimeMillis() is also used in many other places in Tajo source code. "isReady" is intentionally omitted somewhere because it is not related to the real content of a histogram. Finally, I try not to use NumericDatum since it's not necessary but increases processing time. As I understand, a Tajo member is trying to avoid the use of Datum in a Jira issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59313056
  
    I'm still reviewing this patch. I think that we need some discussion before committing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by mvhlong <gi...@git.apache.org>.
Github user mvhlong commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59454847
  
    I'd like to explain again about the sampling and precision.
    It is best to create a histogram from the full column data. However, because it will take too long time, people often take a sample of the data and create a histogram based on this sample. When the sampling ratio is low, the histogram based on this sample will have low selectivity estimation accuracy. In contrast, when the sampling ratio is high, the histogram will have high estimation accuracy (samplingRatio = 100% means that we use the full column data). So, we will want to build a histogram based on a big sample (i.e., high sampling ratio). But, this may lead to a significant increase in histogram construction time. Consequently, we will want to reduce the time by using simple data structures instead of complex ones.
    
    I checked the classes Long and Double. Double cannot represent the entire value range of Long, thus a histogram built on Double cannot replace another one built for Long. This is my mistake in making a careless assumption.
    
    After reading your comments, I have thought a lot more about the supporting of other data types in histograms. Now, I think that I should use Datum since it is the only way to make a unified and clean implementation, although it takes longer processing time. In HistogramBucketProto, min and max will be changed from "double" to "bytes". ( @jihoonson please note that I changed my opinion about the use of Datum )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59311684
  
    If Hyunsik and the other guys don't have any objections, I'll commit this patch.
    IMO, it will be better to proceed the histogram-related work in a new branch separated with the master branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/200#discussion_r18821241
  
    --- Diff: tajo-catalog/tajo-catalog-common/src/main/java/org/apache/tajo/catalog/statistics/EquiDepthHistogram.java ---
    @@ -0,0 +1,95 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.tajo.catalog.statistics;
    +
    +import java.util.ArrayList;
    +import java.util.Collections;
    +import java.util.List;
    +
    +import org.apache.tajo.catalog.proto.CatalogProtos.HistogramProto;
    +import org.apache.tajo.util.TUtil;
    +
    +public class EquiDepthHistogram extends Histogram {
    +
    +  public EquiDepthHistogram() {
    +    super();
    +  }
    +
    +  public EquiDepthHistogram(HistogramProto proto) {
    +    super(proto);
    +  }
    +
    +  @Override
    +  public boolean construct(List<Double> samples) {
    +    int numBuckets = samples.size() > DEFAULT_MAX_BUCKETS ? DEFAULT_MAX_BUCKETS : samples.size();
    +    return construct(samples, numBuckets);
    +  }
    +
    +  /**
    +   * An EquiDepth histogram is constructed by dividing the data points into buckets of equal frequencies (as equal as
    +   * possible).
    +   * 
    +   * Note that, this function, which allows the specification of the (maximum) number of buckets, should be called
    +   * directly only in the unit tests. In non-test cases, the construct(samples) version above should be used.
    +   */
    --- End diff --
    
    To allow the function call only in unit tests, you can use the @VisibleForTesting annotation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59166545
  
    Thanks for your comment!
    On using the system time function, you are right. Any unnecessary packaging dependencies must be avoided.
    
    On using Datum, I have a different opinion. I think that we need to extend the histogram later to support not only columns of the numeric type, but also columns of the text and more complex types. 
    However, I think the 'double' type is not sufficient to contain every type of data. 
    Even though you consider only the numeric types in this issue, we can easily extend the histogram interface later if it is designed for the more general type, Datum.
    What do you think about it?
    
    If any other guys have other opinions about this, please feel free to suggest!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by hyunsik <gi...@git.apache.org>.
Github user hyunsik commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-59167141
  
    In several days, I was very busy due to business trip schedules. I'll give comments about the patch tomorrow. 
    
    In addition, I'd like to discuss the roadmap and future direction about statistic information. I'll start the discussion in TAJO-1091 soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-1112: Implement histogram interface and a ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/200#issuecomment-58980445
  
    Great work!
    I'll review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---