You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Owen O'Malley <ow...@gmail.com> on 2021/02/08 19:40:46 UTC

Should we remove the requirement that dictionaries should be sorted?

All,
   Now that Lei is working on creating a replacement for the red-black
string dictionaries, it is a good time to discuss whether we should
continue to sort the dictionaries as they are written.

Reason to stay sorted:

   1. Searching for values in the dictionaries can use binary search.
   2. Ranges in the column can be translated into ranges in the indexes.
   (eg. if you want myString >= "bar" and myString < "foo" that will translate
   into a range of indexes.

Reasons to stop sorting:

   1. Sorting the dictionary means that we need to hold all of the values
   in uncompressed memory until the stripe is flushed. (The values are
   translated to the sorted order.)
   2. As far as I know, there isn't any code that takes advantage of the
   sorted dictionaries.
   3. It makes the writing code much simpler.

Thoughts?

.. Owen

Re: Should we remove the requirement that dictionaries should be sorted?

Posted by Jacques Nadeau <ja...@apache.org>.
+1 to remove the sort requirement. Using a bitmap for range evaluation
generally works quite well.

On Mon, Feb 8, 2021, 11:41 AM Owen O'Malley <ow...@gmail.com> wrote:

> All,
>    Now that Lei is working on creating a replacement for the red-black
> string dictionaries, it is a good time to discuss whether we should
> continue to sort the dictionaries as they are written.
>
> Reason to stay sorted:
>
>    1. Searching for values in the dictionaries can use binary search.
>    2. Ranges in the column can be translated into ranges in the indexes.
>    (eg. if you want myString >= "bar" and myString < "foo" that will
> translate
>    into a range of indexes.
>
> Reasons to stop sorting:
>
>    1. Sorting the dictionary means that we need to hold all of the values
>    in uncompressed memory until the stripe is flushed. (The values are
>    translated to the sorted order.)
>    2. As far as I know, there isn't any code that takes advantage of the
>    sorted dictionaries.
>    3. It makes the writing code much simpler.
>
> Thoughts?
>
> .. Owen
>

Re: Should we remove the requirement that dictionaries should be sorted?

Posted by Dongjoon Hyun <do...@gmail.com>.
+1 for new suggestions.
I agree with "Reasons to stop sorting".

Bests,
Dongjoon.


On Mon, Feb 8, 2021 at 12:58 PM Owen O'Malley <ow...@gmail.com>
wrote:

> Ugh, I mean that the data stream doesn't need to be held in an array of
> longs. The dictionary itself needs to stay decompressed. :)
>
> .. Owen
>
> On Mon, Feb 8, 2021 at 8:57 PM Owen O'Malley <ow...@gmail.com>
> wrote:
>
> >
> >
> > On Mon, Feb 8, 2021 at 8:44 PM Gopal V <go...@apache.org> wrote:
> >
> >>
> >> > Reason to stay sorted:
> >> >
> >> >     1. Searching for values in the dictionaries can use binary search.
> >>
> >> We did get some compression advantages from this in the past, but the
> >> write-throughput is hurt by this one factor both on memory bloat and
> cpu.
> >>
> >> The alternative to sorting is to add an order vector to the dictionary
> >> after it gets built, which doesn't help compression with small windows,
> >> but can bring back binary search improvements if we ever want it in the
> >> format.
> >>
> >>  > Sorting the dictionary means that we need to hold all of the values
> >>
> >> I think we still need to hold values unless we allow duplicate entries
> >> in the dictionary after a partial flush, but can hold them in contiguous
> >> memory instead of a linked structure.
> >>
> >
> > We can hold them in a compressed stream rather than in arrays of longs.
> >
> > .. Owen
> >
> >
> >>  > Reasons to stop sorting:
> >>
> >> +1
> >>
> >> Making it optional with a stream or flag to mark it would be enough
> >> instead of forcing first insert slowness & that doesn't have to happen
> >> immediately.
> >>
> >> Cheers,
> >> Gopal
> >>
> >
>

Re: Should we remove the requirement that dictionaries should be sorted?

Posted by Owen O'Malley <ow...@gmail.com>.
Ugh, I mean that the data stream doesn't need to be held in an array of
longs. The dictionary itself needs to stay decompressed. :)

.. Owen

On Mon, Feb 8, 2021 at 8:57 PM Owen O'Malley <ow...@gmail.com> wrote:

>
>
> On Mon, Feb 8, 2021 at 8:44 PM Gopal V <go...@apache.org> wrote:
>
>>
>> > Reason to stay sorted:
>> >
>> >     1. Searching for values in the dictionaries can use binary search.
>>
>> We did get some compression advantages from this in the past, but the
>> write-throughput is hurt by this one factor both on memory bloat and cpu.
>>
>> The alternative to sorting is to add an order vector to the dictionary
>> after it gets built, which doesn't help compression with small windows,
>> but can bring back binary search improvements if we ever want it in the
>> format.
>>
>>  > Sorting the dictionary means that we need to hold all of the values
>>
>> I think we still need to hold values unless we allow duplicate entries
>> in the dictionary after a partial flush, but can hold them in contiguous
>> memory instead of a linked structure.
>>
>
> We can hold them in a compressed stream rather than in arrays of longs.
>
> .. Owen
>
>
>>  > Reasons to stop sorting:
>>
>> +1
>>
>> Making it optional with a stream or flag to mark it would be enough
>> instead of forcing first insert slowness & that doesn't have to happen
>> immediately.
>>
>> Cheers,
>> Gopal
>>
>

Re: Should we remove the requirement that dictionaries should be sorted?

Posted by Owen O'Malley <ow...@gmail.com>.
On Mon, Feb 8, 2021 at 8:44 PM Gopal V <go...@apache.org> wrote:

>
> > Reason to stay sorted:
> >
> >     1. Searching for values in the dictionaries can use binary search.
>
> We did get some compression advantages from this in the past, but the
> write-throughput is hurt by this one factor both on memory bloat and cpu.
>
> The alternative to sorting is to add an order vector to the dictionary
> after it gets built, which doesn't help compression with small windows,
> but can bring back binary search improvements if we ever want it in the
> format.
>
>  > Sorting the dictionary means that we need to hold all of the values
>
> I think we still need to hold values unless we allow duplicate entries
> in the dictionary after a partial flush, but can hold them in contiguous
> memory instead of a linked structure.
>

We can hold them in a compressed stream rather than in arrays of longs.

.. Owen


>  > Reasons to stop sorting:
>
> +1
>
> Making it optional with a stream or flag to mark it would be enough
> instead of forcing first insert slowness & that doesn't have to happen
> immediately.
>
> Cheers,
> Gopal
>

Re: Should we remove the requirement that dictionaries should be sorted?

Posted by Gopal V <go...@apache.org>.
> Reason to stay sorted:
> 
>     1. Searching for values in the dictionaries can use binary search.

We did get some compression advantages from this in the past, but the 
write-throughput is hurt by this one factor both on memory bloat and cpu.

The alternative to sorting is to add an order vector to the dictionary 
after it gets built, which doesn't help compression with small windows, 
but can bring back binary search improvements if we ever want it in the 
format.

 > Sorting the dictionary means that we need to hold all of the values

I think we still need to hold values unless we allow duplicate entries 
in the dictionary after a partial flush, but can hold them in contiguous 
memory instead of a linked structure.

 > Reasons to stop sorting:

+1

Making it optional with a stream or flag to mark it would be enough 
instead of forcing first insert slowness & that doesn't have to happen 
immediately.

Cheers,
Gopal