You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by wgtmac <gi...@git.apache.org> on 2017/05/16 23:17:34 UTC

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

GitHub user wgtmac opened a pull request:

    https://github.com/apache/orc/pull/123

    ORC-184: Refactor ColumnStatistics classes for writer

    1. Added setters for ColumnStatisticsImpl;
    2. Added merge, update, increase, reset, etc. to ColumnStatisticsImpl;
    3. Supported comparision for Decimal types;
    4. Added UTs for Decimal and ColumnStatistics.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wgtmac/orc ORC-184

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/123.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #123
    
----
commit 116049bfe262a418302b787ca8ab0aba4d46bfda
Author: Gang Wu <ga...@alibaba-inc.com>
Date:   2017-05-16T23:15:38Z

    ORC-184: Refactor ColumnStatistics classes for writer
    
    1. Added setters for ColumnStatisticsImpl;
    2. Added merge, update, increase, reset, etc. to ColumnStatisticsImpl;
    3. Supported comparision for Decimal types;
    4. Added UTs for Decimal and ColumnStatistics.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117616712
  
    --- Diff: c++/src/Statistics.hh ---
    @@ -106,15 +106,50 @@ namespace orc {
     
         void setMinimum(T min) { _minimum = min; }
     
    -    // GET / SET valueCount_
    +    // GET / SET _valueCount
         uint64_t getNumberOfValues() const { return _valueCount; }
     
         void setNumberOfValues(uint64_t numValues) { _valueCount = numValues; }
     
    -    // GET / SET hasNullValue_
    +    // GET / SET _hasNullValue
         bool hasNull() const { return _hasNull; }
     
         void setHasNull(bool hasNull) { _hasNull = hasNull; }
    +
    +    void reset() {
    +      _hasNull = false;
    +      _hasMinimum = false;
    +      _hasMaximum = false;
    +      _hasSum = false;
    +      _hasTotalLength = false;
    +      _totalLength = 0;
    +      _valueCount = 0;
    +    }
    +
    +    // sum is not merged here as we need to check overflow
    +    void merge(const InternalStatisticsImpl& other) {
    +      _hasNull = _hasNull || other._hasNull;
    +      _valueCount += other._valueCount;
    +
    +      if (other._hasMinimum) {
    +        if (!_hasMinimum) {
    +          _hasMinimum = _hasMaximum = true;
    +          _minimum = other._minimum;
    +          _maximum = other._maximum;
    +        } else {
    +          // all template types should support operator<
    +          if (_maximum < other._maximum) {
    --- End diff --
    
    @majetideepak I agree with you for not putting opeator< into public API. The season to implement operator< is to simplify the logic here, otherwise we have to specialize InternalStatisticsImpl<Decimal>::merge(const InternalStatisticsImpl& other). Now there are two options:
    1. add Decimal::compare() and use the template specilization for Decimal;
    2. move opeator< of decimal here to simplify the merge logic.
    Which do you prefer or you have other ideas? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/123
  
    @majetideepak how does this PR look to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/123
  
    @prasanthj  I've rebased it just now. Please retry. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by prasanthj <gi...@git.apache.org>.
Github user prasanthj commented on the issue:

    https://github.com/apache/orc/pull/123
  
    I am seeing some merged conflicts with this PR. @wgtmac could you please rebase the patch after the recent commits?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/123
  
    @wgtmac  looks like you have to resolve a merge conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/123
  
    @majetideepak Resolved. Can you retry? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by prasanthj <gi...@git.apache.org>.
Github user prasanthj commented on the issue:

    https://github.com/apache/orc/pull/123
  
    Merged with master. Thanks @wgtmac for the rebase and the contribution! Thanks @majetideepak for the reviews!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117397315
  
    --- Diff: c++/include/orc/Statistics.hh ---
    @@ -392,7 +392,14 @@ namespace orc {
         virtual uint32_t getNumberOfRowIndexStats(uint32_t columnId) const = 0;
       };
     
    -
    +  /**
    +   * Create ColumnStatistics for writers
    +   * @param type of column
    +   * @param enableStringComparison whether enable string columns comparision
    +   * @return ColumnStatistics instances
    +   */
    +  std::unique_ptr<ColumnStatistics> createColumnStatistics(
    --- End diff --
    
    Make sense. I will move it to src/Statistics.hh


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/123
  
    @wgtmac I will take a look at it in some time tonight. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117397662
  
    --- Diff: c++/include/orc/Vector.hh ---
    @@ -194,6 +194,14 @@ namespace orc {
         int32_t scale;
       };
     
    +  /**
    +   * Compares two decimals
    +   * @param decimal1 1st decimal to compare
    +   * @param decimal2 2nd decimal to compare
    +   * @return true is decimal1 is less than decimal2; otherwise false
    +   */
    +  bool operator<(Decimal decimal1, Decimal decimal2);
    --- End diff --
    
    Decimal and Int128 are in the public APIs as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/123


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/123
  
    @prasanthj  can you make a pass and merge this too? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117631604
  
    --- Diff: c++/src/Statistics.hh ---
    @@ -106,15 +106,50 @@ namespace orc {
     
         void setMinimum(T min) { _minimum = min; }
     
    -    // GET / SET valueCount_
    +    // GET / SET _valueCount
         uint64_t getNumberOfValues() const { return _valueCount; }
     
         void setNumberOfValues(uint64_t numValues) { _valueCount = numValues; }
     
    -    // GET / SET hasNullValue_
    +    // GET / SET _hasNullValue
         bool hasNull() const { return _hasNull; }
     
         void setHasNull(bool hasNull) { _hasNull = hasNull; }
    +
    +    void reset() {
    +      _hasNull = false;
    +      _hasMinimum = false;
    +      _hasMaximum = false;
    +      _hasSum = false;
    +      _hasTotalLength = false;
    +      _totalLength = 0;
    +      _valueCount = 0;
    +    }
    +
    +    // sum is not merged here as we need to check overflow
    +    void merge(const InternalStatisticsImpl& other) {
    +      _hasNull = _hasNull || other._hasNull;
    +      _valueCount += other._valueCount;
    +
    +      if (other._hasMinimum) {
    +        if (!_hasMinimum) {
    +          _hasMinimum = _hasMaximum = true;
    +          _minimum = other._minimum;
    +          _maximum = other._maximum;
    +        } else {
    +          // all template types should support operator<
    +          if (_maximum < other._maximum) {
    --- End diff --
    
    The first option of using templates seems better to me.
    Can use `common.hh` and `common.cc` to put the following.
    ```
    // Return True if val1 < val2
    // Return False otherwise
    template <typename T>
    static inline bool Compare(const T& val1, const T& val2) {
        return val1 < val2
    }
    // Specialize for Decimal
    template <>
    static inline bool Compare(const Decimal& val1, const Decimal& val2) {
       ....
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117649572
  
    --- Diff: c++/src/Statistics.hh ---
    @@ -106,15 +106,50 @@ namespace orc {
     
         void setMinimum(T min) { _minimum = min; }
     
    -    // GET / SET valueCount_
    +    // GET / SET _valueCount
         uint64_t getNumberOfValues() const { return _valueCount; }
     
         void setNumberOfValues(uint64_t numValues) { _valueCount = numValues; }
     
    -    // GET / SET hasNullValue_
    +    // GET / SET _hasNullValue
         bool hasNull() const { return _hasNull; }
     
         void setHasNull(bool hasNull) { _hasNull = hasNull; }
    +
    +    void reset() {
    +      _hasNull = false;
    +      _hasMinimum = false;
    +      _hasMaximum = false;
    +      _hasSum = false;
    +      _hasTotalLength = false;
    +      _totalLength = 0;
    +      _valueCount = 0;
    +    }
    +
    +    // sum is not merged here as we need to check overflow
    +    void merge(const InternalStatisticsImpl& other) {
    +      _hasNull = _hasNull || other._hasNull;
    +      _valueCount += other._valueCount;
    +
    +      if (other._hasMinimum) {
    +        if (!_hasMinimum) {
    +          _hasMinimum = _hasMaximum = true;
    +          _minimum = other._minimum;
    +          _maximum = other._maximum;
    +        } else {
    +          // all template types should support operator<
    +          if (_maximum < other._maximum) {
    --- End diff --
    
    If we declare the template function in the Common.hh and define it in Common.cc. Then we have to declare it for every type as the template function is instantiated in Common.o while used in Statistics.o. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc issue #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/123
  
    @omalley I have created a new PR here for ORC-184 based on the current master branch. Please review it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117394503
  
    --- Diff: c++/include/orc/Statistics.hh ---
    @@ -392,7 +392,14 @@ namespace orc {
         virtual uint32_t getNumberOfRowIndexStats(uint32_t columnId) const = 0;
       };
     
    -
    +  /**
    +   * Create ColumnStatistics for writers
    +   * @param type of column
    +   * @param enableStringComparison whether enable string columns comparision
    +   * @return ColumnStatistics instances
    +   */
    +  std::unique_ptr<ColumnStatistics> createColumnStatistics(
    --- End diff --
    
    Is this required in the public API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117658958
  
    --- Diff: c++/src/Statistics.hh ---
    @@ -106,15 +106,50 @@ namespace orc {
     
         void setMinimum(T min) { _minimum = min; }
     
    -    // GET / SET valueCount_
    +    // GET / SET _valueCount
         uint64_t getNumberOfValues() const { return _valueCount; }
     
         void setNumberOfValues(uint64_t numValues) { _valueCount = numValues; }
     
    -    // GET / SET hasNullValue_
    +    // GET / SET _hasNullValue
         bool hasNull() const { return _hasNull; }
     
         void setHasNull(bool hasNull) { _hasNull = hasNull; }
    +
    +    void reset() {
    +      _hasNull = false;
    +      _hasMinimum = false;
    +      _hasMaximum = false;
    +      _hasSum = false;
    +      _hasTotalLength = false;
    +      _totalLength = 0;
    +      _valueCount = 0;
    +    }
    +
    +    // sum is not merged here as we need to check overflow
    +    void merge(const InternalStatisticsImpl& other) {
    +      _hasNull = _hasNull || other._hasNull;
    +      _valueCount += other._valueCount;
    +
    +      if (other._hasMinimum) {
    +        if (!_hasMinimum) {
    +          _hasMinimum = _hasMaximum = true;
    +          _minimum = other._minimum;
    +          _maximum = other._maximum;
    +        } else {
    +          // all template types should support operator<
    +          if (_maximum < other._maximum) {
    --- End diff --
    
    I've fixed it and created a new PR.  I have removed "static" identifier as this template function is used in several modules including UTs. Please review it again. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117394950
  
    --- Diff: c++/include/orc/Vector.hh ---
    @@ -194,6 +194,14 @@ namespace orc {
         int32_t scale;
       };
     
    +  /**
    +   * Compares two decimals
    +   * @param decimal1 1st decimal to compare
    +   * @param decimal2 2nd decimal to compare
    +   * @return true is decimal1 is less than decimal2; otherwise false
    +   */
    +  bool operator<(Decimal decimal1, Decimal decimal2);
    --- End diff --
    
    This also should not belong to the public API


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] orc pull request #123: ORC-184: Refactor ColumnStatistics classes for writer

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on a diff in the pull request:

    https://github.com/apache/orc/pull/123#discussion_r117610190
  
    --- Diff: c++/include/orc/Vector.hh ---
    @@ -194,6 +194,14 @@ namespace orc {
         int32_t scale;
       };
     
    +  /**
    +   * Compares two decimals
    +   * @param decimal1 1st decimal to compare
    +   * @param decimal2 2nd decimal to compare
    +   * @return true is decimal1 is less than decimal2; otherwise false
    +   */
    +  bool operator<(Decimal decimal1, Decimal decimal2);
    --- End diff --
    
    Libraries usually don't define these operators in the public API.
    It is also odd that we only define one operator and not the other operators.
    All other non-primitive types don't have these operators defined as well. It looks odd just to add this one. Can we just use `compare(Decimal, Decimal)` api inside? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---