You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Timothy Chen <tn...@gmail.com> on 2013/04/16 20:30:50 UTC

Questions

Hi Jacques,

Want to ask some questions I forgot to bring up in the meetup:

1, Can you elaborate some items on the last slide what they are:
 -  Execution fragment format
 -  Forman

2, The in-memory format that supports either ValueVector, RLE or Dict, I
assume RLE or Dict will be leveraging either Orc or Parquet right?


Tim

Re: Questions

Posted by Jacques Nadeau <ja...@apache.org>.

I think that it is likely that ORC and Parquet will experiment with
alternative encoding techniques for compression/performance purposes.
Also, as you point out, a field level encoding may actually be
sub-composed of multiple types of data structures.  While these are
fine at the storage layer, it is hard for the execution layer to
directly operate on these variations.  When I say container, I am
trying to call our these points of flexibility and clarify that these
explorations are generally outside the domain of what we're initially
focused on for Drill.

As per your other statement regarding a shift in encoding formats.
ProtoBuf, Thrift and Avro all describe a set of things including apis,
schemas and serialization formats.  I agree that the on disk format of
these objects is morphing (and thus serialization is changing).  In
fact, that is part of what we're betting on with Drill's more
pipelined vectorized model of execution.  Exciting times!

On Tue, Apr 30, 2013 at 9:02 PM, Owen O'Malley <om...@apache.org> wrote:
>> > 2, The in-memory format that supports either ValueVector, RLE or Dict, I
>> > assume RLE or Dict will be leveraging either Orc or Parquet right?
>> >
>> >
>> Kind of.  RLE and Dict are abstraction where a particular operator can take
>> advantage of the nature of that encoding.  Parquet and ORC are really
>> container formats as opposed to field level formats.
>
>
> Not really. Unless you mean something very specific that I'm missing, they
> are field level formats. ORC relies on the fact that the types are known to
> pick the right encoder for each column. For example, ORC uses RLE for
> integer data. (In fact, because the dictionary encoding includes integer
> data, so do string columns.) In some cases, the ORC writer has a choice of
> encodings, but it is focused on picking the right encoding for a particular
> set of data. For example, if a string column has enough duplicated values
> it will chose a dictionary encoder instead of a direct encoder. But it is
> certainly not the case that ORC is a container format where the choice of
> serialization is an additional choice.
>
> Unlike RCFile, SequenceFile, TFile, or HFile, it doesn't make sense to
> store ProtoBuf or Writables in an ORC file. One of the amusing
> characteristics of these new file formats is EXACTLY that. In 2 years, I
> would be surprised if anyone is writing new data to files in ProtoBuf,
> Thrift, or Avro. It will be one of these new formats. That is a big change.
>
> -- Owen

Re: Questions

Posted by Owen O'Malley <om...@apache.org>.

> > 2, The in-memory format that supports either ValueVector, RLE or Dict, I
> > assume RLE or Dict will be leveraging either Orc or Parquet right?
> >
> >
> Kind of.  RLE and Dict are abstraction where a particular operator can take
> advantage of the nature of that encoding.  Parquet and ORC are really
> container formats as opposed to field level formats.


Not really. Unless you mean something very specific that I'm missing, they
are field level formats. ORC relies on the fact that the types are known to
pick the right encoder for each column. For example, ORC uses RLE for
integer data. (In fact, because the dictionary encoding includes integer
data, so do string columns.) In some cases, the ORC writer has a choice of
encodings, but it is focused on picking the right encoding for a particular
set of data. For example, if a string column has enough duplicated values
it will chose a dictionary encoder instead of a direct encoder. But it is
certainly not the case that ORC is a container format where the choice of
serialization is an additional choice.

Unlike RCFile, SequenceFile, TFile, or HFile, it doesn't make sense to
store ProtoBuf or Writables in an ORC file. One of the amusing
characteristics of these new file formats is EXACTLY that. In 2 years, I
would be surprised if anyone is writing new data to files in ProtoBuf,
Thrift, or Avro. It will be one of these new formats. That is a big change.

-- Owen

Re: Questions

Posted by Jacques Nadeau <ja...@apache.org>.

See below

On Tue, Apr 16, 2013 at 11:30 AM, Timothy Chen <tn...@gmail.com> wrote:

> Hi Jacques,
>
> Want to ask some questions I forgot to bring up in the meetup:
>
> 1, Can you elaborate some items on the last slide what they are:
>  -  Execution fragment format
>

A physical plan provides information about parallelization but doesn't
actual node assignments.  The execution engine is responsible for
converting the physical plan into an execution plan which includes node
level assignments.  This is then broken into pieces where each particular
node only gets there respsecitve piece.  These are the execution fragments.


>  -  Forman
>
This is a accidental mispelling of Foreman.  The Foreman drives execution
of one particular query.  Dealing with bit level status messages, warnings
errors and cancellation.



> 2, The in-memory format that supports either ValueVector, RLE or Dict, I
> assume RLE or Dict will be leveraging either Orc or Parquet right?
>
>
Kind of.  RLE and Dict are abstraction where a particular operator can take
advantage of the nature of that encoding.  Parquet and ORC are really
container formats as opposed to field level formats.  I believe both are
going to support multiple internal encodings within the container (for
example, Parquet uses RLE to manage repetition level storage and ORC has a
dictionary coding capability   Once we start to work through Dict and RLE,
we could very likely leverage one of the encoding formats used within one
of these systems.  The hope would be that whatever we pick would be a cheap
translation from/to either format if it isn't the exact same.




>
> Tim
>