You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Dain Sundstrom <da...@iq80.com> on 2017/06/06 04:53:08 UTC

Re: "For dictionary encodings the dictionary is sorted"

> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <da...@iq80.com> wrote:
> On Dec 12, 2016, at 4:36 PM, Owen O'Malley <om...@apache.org> wrote:
>>> I think this should also be documented in the statistics section which
>> also uses UTF-16 BE, which is at least consistent, but still annoying for
>> everything other than Java.
>> 
>> Yes, it should be documented and we should replace it with UTF-8. (Although
>> changes to the serialized form are always painful.)
> 
> I think we can do something similar to the bloom filter code, where we add a StringUtf8Stats object and have a transition period where we can produce both.

I was looking at the change proto changes to TimestampStatistics, and I think the same thing could work here.  We add:

    optional string minimumUtf8 = 4;
    optional string maximumUtf8 = 5;

and the update the writer write just the UTF-8 version (or both during a transition).

-dain

Re: "For dictionary encodings the dictionary is sorted"

Posted by Dain Sundstrom <da...@iq80.com>.
Sorry, I should have given a full example.  I mean we could change StringStatistics as follows:

message StringStatistics {
  optional string minimum = 1;
  optional string maximum = 2;
  optional sint64 sum = 3;
  optional string minimumUtf8 = 4;
  optional string maximumUtf8 = 5;
}

New writers would fill in both in the transition period, and then just the UTF8 ones.  Then we update readers to the favor the new ones.

We can also make this backwards compatible in the transition period by truncating values to at the first surrogate pair.  This is how we do that in Presto:

public static Slice getMaxSlice(String maximum)
{
    if (maximum == null) {
        return null;
    }

    int index = firstSurrogateCharacter(maximum);
    if (index == -1) {
        return Slices.utf8Slice(maximum);
    }
    // Append 0xFF so that it is larger than maximum
    return concatSlices(Slices.utf8Slice(maximum.substring(0, index)), MAX_BYTE);
}

public static Slice getMinSlice(String minimum)
{
    if (minimum == null) {
        return null;
    }

    int index = firstSurrogateCharacter(minimum);
    if (index == -1) {
        return Slices.utf8Slice(minimum);
    }
    // truncate the string at the first surrogate character
    return Slices.utf8Slice(minimum.substring(0, index));
}

// returns index of first surrogateCharacter in the string -1 if no surrogate character is found
static int firstSurrogateCharacter(String value)
{
    char[] chars = value.toCharArray();
    for (int i = 0; i < chars.length; i++) {
        if (chars[i] >= MIN_SURROGATE) {
            return i;
        }
    }
    return -1;
}

The getMaxSliceCode might need to be adjusted depending on how it it used.  We are using UTF-8 binary so we can use the illegal 0xFF as the stop byte, but if you want to transition back to Java Strings, we would need to be a bit smarter.

-dain

> On Jun 6, 2017, at 3:39 PM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> I'm confused. TimestampStatistics uses integers not strings.
> 
> .. Owen
> 
> On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom <da...@iq80.com> wrote:
> 
>> 
>>> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <da...@iq80.com> wrote:
>>> On Dec 12, 2016, at 4:36 PM, Owen O'Malley <om...@apache.org> wrote:
>>>>> I think this should also be documented in the statistics section which
>>>> also uses UTF-16 BE, which is at least consistent, but still annoying
>> for
>>>> everything other than Java.
>>>> 
>>>> Yes, it should be documented and we should replace it with UTF-8.
>> (Although
>>>> changes to the serialized form are always painful.)
>>> 
>>> I think we can do something similar to the bloom filter code, where we
>> add a StringUtf8Stats object and have a transition period where we can
>> produce both.
>> 
>> I was looking at the change proto changes to TimestampStatistics, and I
>> think the same thing could work here.  We add:
>> 
>>    optional string minimumUtf8 = 4;
>>    optional string maximumUtf8 = 5;
>> 
>> and the update the writer write just the UTF-8 version (or both during a
>> transition).
>> 
>> -dain


Re: "For dictionary encodings the dictionary is sorted"

Posted by Owen O'Malley <ow...@gmail.com>.
I'm confused. TimestampStatistics uses integers not strings.

.. Owen

On Mon, Jun 5, 2017 at 9:53 PM, Dain Sundstrom <da...@iq80.com> wrote:

>
> > On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <da...@iq80.com> wrote:
> > On Dec 12, 2016, at 4:36 PM, Owen O'Malley <om...@apache.org> wrote:
> >>> I think this should also be documented in the statistics section which
> >> also uses UTF-16 BE, which is at least consistent, but still annoying
> for
> >> everything other than Java.
> >>
> >> Yes, it should be documented and we should replace it with UTF-8.
> (Although
> >> changes to the serialized form are always painful.)
> >
> > I think we can do something similar to the bloom filter code, where we
> add a StringUtf8Stats object and have a transition period where we can
> produce both.
>
> I was looking at the change proto changes to TimestampStatistics, and I
> think the same thing could work here.  We add:
>
>     optional string minimumUtf8 = 4;
>     optional string maximumUtf8 = 5;
>
> and the update the writer write just the UTF-8 version (or both during a
> transition).
>
> -dain