You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by ajayyadava <gi...@git.apache.org> on 2017/09/18 03:06:51 UTC

[GitHub] orc pull request #169: ORC-203 Modify the StringStatistics to trim the minim...

GitHub user ajayyadava opened a pull request:

    https://github.com/apache/orc/pull/169

    ORC-203 Modify the StringStatistics to trim the minimum and maximum v…

    …alues.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ajayyadava/orc 203

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/169.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #169
    
----
commit 6a1e9de4bb700d972509ca4a240b894893c365df
Author: Ajay Yadava <aj...@apache.org>
Date:   2017-09-18T03:01:57Z

    ORC-203 Modify the StringStatistics to trim the minimum and maximum values.

----


---

[GitHub] orc issue #169: ORC-203 Modify the StringStatistics to trim the minimum and ...

Posted by ajayyadava <gi...@git.apache.org>.
Github user ajayyadava commented on the issue:

    https://github.com/apache/orc/pull/169
  
    To make sure that maximum and minimum are always correct, we need to store intermediate large strings and trim them only at the end. 
    
    If anyone has a better approach in mind, then please do let me know.


---

[GitHub] orc pull request #169: [WIP] ORC-203 Modify the StringStatistics to trim the...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/169


---

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on the issue:

    https://github.com/apache/orc/pull/169
  
    I don't understand why you need to trim the strings. Protobuf doesn't support strings over 1024 characters?


---

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on the issue:

    https://github.com/apache/orc/pull/169
  
    @dain if the concern is the performance, should we also limit the string length when generate stats in writer path, which in my opinion is more costly? I think if we keep the string compare up to 1024 bytes, we won't need to trim at the end.


---

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

Posted by dain <gi...@git.apache.org>.
Github user dain commented on the issue:

    https://github.com/apache/orc/pull/169
  
    @xndai if the min or max happens to be a multi megabyte value it can be really expensive for the reader.  Additionally, for filtering the first few bytes are the most valuable (they establish the range).


---

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

Posted by ajayyadava <gi...@git.apache.org>.
Github user ajayyadava commented on the issue:

    https://github.com/apache/orc/pull/169
  
    Thank you, everyone, for your comments! Sorry for being away for a while. I will start working on incorporating the feedback this week.


---

[GitHub] orc pull request #169: ORC-203 Modify the StringStatistics to trim the minim...

Posted by sadikovi <gi...@git.apache.org>.
Github user sadikovi commented on a diff in the pull request:

    https://github.com/apache/orc/pull/169#discussion_r139537518
  
    --- Diff: java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java ---
    @@ -608,8 +609,23 @@ public void merge(ColumnStatisticsImpl other) {
           OrcProto.StringStatistics.Builder str =
             OrcProto.StringStatistics.newBuilder();
           if (getNumberOfValues() != 0) {
    -        str.setMinimum(getMinimum());
    -        str.setMaximum(getMaximum());
    +        String value = getMinimum();
    +        if (value != null && value.length() > 1024) {
    --- End diff --
    
    I suggest you make it a constant somewhere and reference that.
    
    I would also consider making it configurable with default value of `1024`.


---

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/169
  
    Ok, starting with the representation. I'd suggest it look like:
    
        message StringStatistics {
          optional string minimum = 1;
          optional string maximum = 2;
          // sum will store the total length of all strings in a stripe
          optional sint64 sum = 3;
          // If the minimum or maximum value was longer than 1024 bytes, store a lower or upper
          // bound instead of the minimum or maximum values above.
          optional string lowerBound = 4;
          optional string upperBound = 5;
        }
    
    Now obviously the lowerBound can just be the string truncated (at a utf8 character boundary!) to at most 1024 bytes. The upperBound is the same with the last code point increased by one.
    
    In the StringStatisticsImpl, I'd keep two boolean flags as to whether it is a real value or an approximation for minimum or maximum. The value comparison is the same, since unless the current value is less than the lower bound, it won't change the lower bound and the same is true for the upper bound. If the new minimum/maximum is not truncated, the corresponding wasTruncated flag should be cleared. When merging, the flag follows the value. In the corner case of two identical values where one was truncated, the non-truncated one is the result.
    
    We should end up with four methods for each:
    * String getMinimum();
    * String getLowerBound();
    * String getMaximum();
    * String getUpperBound();
    
    If we only have a lower bound, getMinimum should be null and the same with upper bound and getMaximum. getLowerBound and getUpperBound should match getMinimum and getMaximum, if no truncation was done.


---