You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Azuryy Yu <az...@gmail.com> on 2012/09/15 14:30:41 UTC

Storage file format

Hi All,

I am interested in working on storage format. (sign up?)

I wrote a HDFS  file format, which is similar to Sequence file (row
storage, block management, compress), I provide InputFormat and
OutputFormat,

sometimes it get a great performance, sometimes not, depends on the data.

for Drill, we should implement a column-storage, this can skip some columns
during query, and skip some rows within one column file. but this
column-storage should based on the distributed file system, such as HDFS,
Mapr DFS, I like Mapr DFS because of HA.

we can implement the following column storage file format, I think it's
enough to us.

http://arxiv.org/pdf/1105.4252.pdf

Re: Storage file format

Posted by Ted Dunning <te...@gmail.com>.

For some file systems such as the one in the MapR distribution, you can
force files to have identical locality but on less advanced systems like
HDFS, this is not typically possible.

This is one of the major motivations for storing row-wise chunks in a
columnar form in the same file.  You can still get moderately large
sequential reads and you get columnar collocation.

Column databases are essentially the same as column major matrix formats.
 If the column are compressible, life is good.  Sparse vectors are only one
kind of compression.

On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dh...@gmail.com>wrote:

> For columnar storage, IMO each column can be managed in a separate file.
> Dremel also seems to have each column in a separate file. This should be
> easy to manage and update are possible. Please see
> https://issues.apache.org/jira/browse/AVRO-806
>
> Drill architecture slides shows AVRO-806 and trevni in Column storage box.
> Are we looking them as candidate for storage format for drill?
>
> If we have lot of data with high amount of sparsity and major use case is
> to read only once data is written - Another way could be to store in a
> column major sparse matrix format. It  looks easy to implement but updates
> may be problematic. just a thought.
>
> Regards,
> Dharm
>
> On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
> naveen.maanju.apache@gmail.com> wrote:
>
> > make sense..
> >
> > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > The key goal here is to get something simple working quickly in a way
> > that
> > > allows additional, more advanced implementations.
> > >
> > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <le...@gmail.com>
> > > wrote:
> > >
> > > > for column-storage, how about leverage Hbase or Accumulo?
> > > >
> > > > they'll also give a chance to data update (future work?)
> > > >
> > > >
> > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com>
> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I am interested in working on storage format. (sign up?)
> > > > >
> > > > > I wrote a HDFS  file format, which is similar to Sequence file (row
> > > > > storage, block management, compress), I provide InputFormat and
> > > > > OutputFormat,
> > > > >
> > > > > sometimes it get a great performance, sometimes not, depends on the
> > > data.
> > > > >
> > > > > for Drill, we should implement a column-storage, this can skip some
> > > > columns
> > > > > during query, and skip some rows within one column file. but this
> > > > > column-storage should based on the distributed file system, such as
> > > HDFS,
> > > > > Mapr DFS, I like Mapr DFS because of HA.
> > > > >
> > > > > we can implement the following column storage file format, I think
> > it's
> > > > > enough to us.
> > > > >
> > > > > http://arxiv.org/pdf/1105.4252.pdf
> > > > >
> > > >
> > >
> >
>

Re: Storage file format

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

Sure :-) I'll create the ticket to Drill jira.

Thanks,
Tsuyoshi

On Sun, Sep 16, 2012 at 6:11 AM, Ted Dunning <te...@gmail.com> wrote:
> There is no project-wide roadmap in a real open source project.
>
> There are vision documents that various people use to try to motivate
> consensus.
>
> There are also individual roadmaps that describe what the individual
> contributors plan to do.
>
> Power Drill style in memory data is definitely intriguing and once Drill
> works and works fast on simpler structures, I would expect that somebody
> would be interested in implementing it.
>
> Perhaps that would be you?
>
> On Sat, Sep 15, 2012 at 10:16 AM, Tsuyoshi OZAWA
> <oz...@gmail.com>wrote:
>
>> Hello,
>>
>> Is there a roadmap to suppor in-memory index and storage like
>> PowerDrill? It's one kind of storage, though its format is different
>> from the columnar storage format in Dremel paper as you mentioned.
>>
>> IMO, the in-memory index and storage are much useful for analysis with
>> small cluster.
>>
>> Thanks,
>> - Tsuyoshi
>>
>> On Sun, Sep 16, 2012 at 2:02 AM, Dharm Raj <dh...@gmail.com>
>> wrote:
>> > You are right Camuel. While thinking  storage format I was thinking about
>> > append. Misplaced update.
>> >
>> > On Sat, Sep 15, 2012 at 9:49 PM, Camuel Gilyadov <ca...@gmail.com>
>> wrote:
>> >
>> >> Drill doesn't support updates. It is append only data store and append
>> is
>> >> usually expected to be a nice data chunk not a single row
>> >>
>> >> On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dharmrajbaliyan@gmail.com
>> >> >wrote:
>> >>
>> >> > For columnar storage, IMO each column can be managed in a separate
>> file.
>> >> > Dremel also seems to have each column in a separate file. This should
>> be
>> >> > easy to manage and update are possible. Please see
>> >> > https://issues.apache.org/jira/browse/AVRO-806
>> >> >
>> >> > Drill architecture slides shows AVRO-806 and trevni in Column storage
>> >> box.
>> >> > Are we looking them as candidate for storage format for drill?
>> >> >
>> >> > If we have lot of data with high amount of sparsity and major use
>> case is
>> >> > to read only once data is written - Another way could be to store in a
>> >> > column major sparse matrix format. It  looks easy to implement but
>> >> updates
>> >> > may be problematic. just a thought.
>> >> >
>> >> > Regards,
>> >> > Dharm
>> >> >
>> >> > On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
>> >> > naveen.maanju.apache@gmail.com> wrote:
>> >> >
>> >> > > make sense..
>> >> > >
>> >> > > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <ted.dunning@gmail.com
>> >
>> >> > > wrote:
>> >> > >
>> >> > > > The key goal here is to get something simple working quickly in a
>> way
>> >> > > that
>> >> > > > allows additional, more advanced implementations.
>> >> > > >
>> >> > > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <
>> leemoonsoo@gmail.com>
>> >> > > > wrote:
>> >> > > >
>> >> > > > > for column-storage, how about leverage Hbase or Accumulo?
>> >> > > > >
>> >> > > > > they'll also give a chance to data update (future work?)
>> >> > > > >
>> >> > > > >
>> >> > > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com>
>> >> > wrote:
>> >> > > > >
>> >> > > > > > Hi All,
>> >> > > > > >
>> >> > > > > > I am interested in working on storage format. (sign up?)
>> >> > > > > >
>> >> > > > > > I wrote a HDFS  file format, which is similar to Sequence file
>> >> (row
>> >> > > > > > storage, block management, compress), I provide InputFormat
>> and
>> >> > > > > > OutputFormat,
>> >> > > > > >
>> >> > > > > > sometimes it get a great performance, sometimes not, depends
>> on
>> >> the
>> >> > > > data.
>> >> > > > > >
>> >> > > > > > for Drill, we should implement a column-storage, this can skip
>> >> some
>> >> > > > > columns
>> >> > > > > > during query, and skip some rows within one column file. but
>> this
>> >> > > > > > column-storage should based on the distributed file system,
>> such
>> >> as
>> >> > > > HDFS,
>> >> > > > > > Mapr DFS, I like Mapr DFS because of HA.
>> >> > > > > >
>> >> > > > > > we can implement the following column storage file format, I
>> >> think
>> >> > > it's
>> >> > > > > > enough to us.
>> >> > > > > >
>> >> > > > > > http://arxiv.org/pdf/1105.4252.pdf
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>>



-- 
OZAWA Tsuyoshi

Re: Storage file format

Posted by Julien Le Dem <ju...@twitter.com>.

Hi,
I'm prototyping something to use the Dremel format in Hadoop
Map/Reduce jobs and Pig.
So far I've implemented the record assembly algorithm from the paper
and defined interfaces to abstract the other layers (see bellow)
What I'm missing for now is the columnar low level storage that has
been discussed on this thread.
Basically: the ability to collocate chunks of data while still being
able to skip the columns that are not needed.
I'd be interested to see what people have done to that regard. I was
thinking of starting with either RCFile, TFile or HFile. each row
containing one chunk of data (instead of a single record).

Here is an overview of what I've got:

Column abstraction: calling consume() advances to the next record,
repetitionLevel, definitionLevel are then accessible. This is the low
level interface. For now I've implemented a very simple not efficient
version. There is a corresponding ColumnWriter
https://github.com/julienledem/redelm/blob/master/redelm-column/src/main/java/redelm/column/ColumnReader.java
public interface ColumnReader {
  boolean isFullyConsumed();
  void consume();
  int getCurrentRepetitionLevel();
  int getCurrentDefinitionLevel();
  String getString();
  int getInt();
  boolean getBool();
  byte[] getBinary();
}

Record abstraction: This is used either to write records to a column
store or read records from it in a push parser style.
https://github.com/julienledem/redelm/blob/master/redelm-column/src/main/java/redelm/io/RecordConsumer.java
abstract public class RecordConsumer {
  abstract public void startMessage();
  abstract public void endMessage();
  abstract public void startField(String field, int index);
  abstract public void endField(String field, int index);
  abstract public void startGroup();
  abstract public void endGroup();
  abstract public void addInt(int value);
  abstract public void addString(String value);
  abstract public void addBoolean(boolean value);
  abstract public void addBinary(byte[] value);
}
The tests use a SimpleGroup implementation but obviously that's just for tests.

I've implemented this interface for Pig Tuples here:
for read:
https://github.com/julienledem/redelm/blob/master/redelm-pig/src/main/java/redelm/pig/TupleRecordConsumer.java
for write:
https://github.com/julienledem/redelm/blob/master/redelm-pig/src/main/java/redelm/pig/TupleWriter.java

I've also worked on a class generation framework to avoid List and
Object based tuples here:
https://github.com/julienledem/brennus/blob/master/brennus-asm/src/test/java/brennus/asm/TestTuple.java

I'd be happy to contribute to Drill as well.

Julien

On Sat, Sep 15, 2012 at 5:07 PM, Azuryy Yu <az...@gmail.com> wrote:
>
> there should be two seperate topics here:
> 1) storage file format
> 2) DFS
>
> because we should support map/reduce output data to Drill,(maybe this is
> the only way for Drill to load data)
>
> for the second topic, I mentioned in this thread, I prefer Mapr DFS, which
> is really HA.
>
> as for the first topic, we should try to find mature open source project
> and do some modification to fit for us.
>
>
>
> On Sun, Sep 16, 2012 at 5:11 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > There is no project-wide roadmap in a real open source project.
> >
> > There are vision documents that various people use to try to motivate
> > consensus.
> >
> > There are also individual roadmaps that describe what the individual
> > contributors plan to do.
> >
> > Power Drill style in memory data is definitely intriguing and once Drill
> > works and works fast on simpler structures, I would expect that somebody
> > would be interested in implementing it.
> >
> > Perhaps that would be you?
> >
> > On Sat, Sep 15, 2012 at 10:16 AM, Tsuyoshi OZAWA
> > <oz...@gmail.com>wrote:
> >
> > > Hello,
> > >
> > > Is there a roadmap to suppor in-memory index and storage like
> > > PowerDrill? It's one kind of storage, though its format is different
> > > from the columnar storage format in Dremel paper as you mentioned.
> > >
> > > IMO, the in-memory index and storage are much useful for analysis with
> > > small cluster.
> > >
> > > Thanks,
> > > - Tsuyoshi
> > >
> > > On Sun, Sep 16, 2012 at 2:02 AM, Dharm Raj <dh...@gmail.com>
> > > wrote:
> > > > You are right Camuel. While thinking  storage format I was thinking
> > about
> > > > append. Misplaced update.
> > > >
> > > > On Sat, Sep 15, 2012 at 9:49 PM, Camuel Gilyadov <ca...@gmail.com>
> > > wrote:
> > > >
> > > >> Drill doesn't support updates. It is append only data store and append
> > > is
> > > >> usually expected to be a nice data chunk not a single row
> > > >>
> > > >> On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dharmrajbaliyan@gmail.com
> > > >> >wrote:
> > > >>
> > > >> > For columnar storage, IMO each column can be managed in a separate
> > > file.
> > > >> > Dremel also seems to have each column in a separate file. This
> > should
> > > be
> > > >> > easy to manage and update are possible. Please see
> > > >> > https://issues.apache.org/jira/browse/AVRO-806
> > > >> >
> > > >> > Drill architecture slides shows AVRO-806 and trevni in Column
> > storage
> > > >> box.
> > > >> > Are we looking them as candidate for storage format for drill?
> > > >> >
> > > >> > If we have lot of data with high amount of sparsity and major use
> > > case is
> > > >> > to read only once data is written - Another way could be to store
> > in a
> > > >> > column major sparse matrix format. It  looks easy to implement but
> > > >> updates
> > > >> > may be problematic. just a thought.
> > > >> >
> > > >> > Regards,
> > > >> > Dharm
> > > >> >
> > > >> > On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
> > > >> > naveen.maanju.apache@gmail.com> wrote:
> > > >> >
> > > >> > > make sense..
> > > >> > >
> > > >> > > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <
> > ted.dunning@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > The key goal here is to get something simple working quickly in
> > a
> > > way
> > > >> > > that
> > > >> > > > allows additional, more advanced implementations.
> > > >> > > >
> > > >> > > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <
> > > leemoonsoo@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > for column-storage, how about leverage Hbase or Accumulo?
> > > >> > > > >
> > > >> > > > > they'll also give a chance to data update (future work?)
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <
> > azuryyyu@gmail.com>
> > > >> > wrote:
> > > >> > > > >
> > > >> > > > > > Hi All,
> > > >> > > > > >
> > > >> > > > > > I am interested in working on storage format. (sign up?)
> > > >> > > > > >
> > > >> > > > > > I wrote a HDFS  file format, which is similar to Sequence
> > file
> > > >> (row
> > > >> > > > > > storage, block management, compress), I provide InputFormat
> > > and
> > > >> > > > > > OutputFormat,
> > > >> > > > > >
> > > >> > > > > > sometimes it get a great performance, sometimes not, depends
> > > on
> > > >> the
> > > >> > > > data.
> > > >> > > > > >
> > > >> > > > > > for Drill, we should implement a column-storage, this can
> > skip
> > > >> some
> > > >> > > > > columns
> > > >> > > > > > during query, and skip some rows within one column file. but
> > > this
> > > >> > > > > > column-storage should based on the distributed file system,
> > > such
> > > >> as
> > > >> > > > HDFS,
> > > >> > > > > > Mapr DFS, I like Mapr DFS because of HA.
> > > >> > > > > >
> > > >> > > > > > we can implement the following column storage file format, I
> > > >> think
> > > >> > > it's
> > > >> > > > > > enough to us.
> > > >> > > > > >
> > > >> > > > > > http://arxiv.org/pdf/1105.4252.pdf
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >

Re: Storage file format

Posted by Azuryy Yu <az...@gmail.com>.

there should be two seperate topics here:
1) storage file format
2) DFS

because we should support map/reduce output data to Drill,(maybe this is
the only way for Drill to load data)

for the second topic, I mentioned in this thread, I prefer Mapr DFS, which
is really HA.

as for the first topic, we should try to find mature open source project
and do some modification to fit for us.



On Sun, Sep 16, 2012 at 5:11 AM, Ted Dunning <te...@gmail.com> wrote:

> There is no project-wide roadmap in a real open source project.
>
> There are vision documents that various people use to try to motivate
> consensus.
>
> There are also individual roadmaps that describe what the individual
> contributors plan to do.
>
> Power Drill style in memory data is definitely intriguing and once Drill
> works and works fast on simpler structures, I would expect that somebody
> would be interested in implementing it.
>
> Perhaps that would be you?
>
> On Sat, Sep 15, 2012 at 10:16 AM, Tsuyoshi OZAWA
> <oz...@gmail.com>wrote:
>
> > Hello,
> >
> > Is there a roadmap to suppor in-memory index and storage like
> > PowerDrill? It's one kind of storage, though its format is different
> > from the columnar storage format in Dremel paper as you mentioned.
> >
> > IMO, the in-memory index and storage are much useful for analysis with
> > small cluster.
> >
> > Thanks,
> > - Tsuyoshi
> >
> > On Sun, Sep 16, 2012 at 2:02 AM, Dharm Raj <dh...@gmail.com>
> > wrote:
> > > You are right Camuel. While thinking  storage format I was thinking
> about
> > > append. Misplaced update.
> > >
> > > On Sat, Sep 15, 2012 at 9:49 PM, Camuel Gilyadov <ca...@gmail.com>
> > wrote:
> > >
> > >> Drill doesn't support updates. It is append only data store and append
> > is
> > >> usually expected to be a nice data chunk not a single row
> > >>
> > >> On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dharmrajbaliyan@gmail.com
> > >> >wrote:
> > >>
> > >> > For columnar storage, IMO each column can be managed in a separate
> > file.
> > >> > Dremel also seems to have each column in a separate file. This
> should
> > be
> > >> > easy to manage and update are possible. Please see
> > >> > https://issues.apache.org/jira/browse/AVRO-806
> > >> >
> > >> > Drill architecture slides shows AVRO-806 and trevni in Column
> storage
> > >> box.
> > >> > Are we looking them as candidate for storage format for drill?
> > >> >
> > >> > If we have lot of data with high amount of sparsity and major use
> > case is
> > >> > to read only once data is written - Another way could be to store
> in a
> > >> > column major sparse matrix format. It  looks easy to implement but
> > >> updates
> > >> > may be problematic. just a thought.
> > >> >
> > >> > Regards,
> > >> > Dharm
> > >> >
> > >> > On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
> > >> > naveen.maanju.apache@gmail.com> wrote:
> > >> >
> > >> > > make sense..
> > >> > >
> > >> > > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <
> ted.dunning@gmail.com
> > >
> > >> > > wrote:
> > >> > >
> > >> > > > The key goal here is to get something simple working quickly in
> a
> > way
> > >> > > that
> > >> > > > allows additional, more advanced implementations.
> > >> > > >
> > >> > > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <
> > leemoonsoo@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > for column-storage, how about leverage Hbase or Accumulo?
> > >> > > > >
> > >> > > > > they'll also give a chance to data update (future work?)
> > >> > > > >
> > >> > > > >
> > >> > > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <
> azuryyyu@gmail.com>
> > >> > wrote:
> > >> > > > >
> > >> > > > > > Hi All,
> > >> > > > > >
> > >> > > > > > I am interested in working on storage format. (sign up?)
> > >> > > > > >
> > >> > > > > > I wrote a HDFS  file format, which is similar to Sequence
> file
> > >> (row
> > >> > > > > > storage, block management, compress), I provide InputFormat
> > and
> > >> > > > > > OutputFormat,
> > >> > > > > >
> > >> > > > > > sometimes it get a great performance, sometimes not, depends
> > on
> > >> the
> > >> > > > data.
> > >> > > > > >
> > >> > > > > > for Drill, we should implement a column-storage, this can
> skip
> > >> some
> > >> > > > > columns
> > >> > > > > > during query, and skip some rows within one column file. but
> > this
> > >> > > > > > column-storage should based on the distributed file system,
> > such
> > >> as
> > >> > > > HDFS,
> > >> > > > > > Mapr DFS, I like Mapr DFS because of HA.
> > >> > > > > >
> > >> > > > > > we can implement the following column storage file format, I
> > >> think
> > >> > > it's
> > >> > > > > > enough to us.
> > >> > > > > >
> > >> > > > > > http://arxiv.org/pdf/1105.4252.pdf
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Re: Storage file format

Posted by Ted Dunning <te...@gmail.com>.

There is no project-wide roadmap in a real open source project.

There are vision documents that various people use to try to motivate
consensus.

There are also individual roadmaps that describe what the individual
contributors plan to do.

Power Drill style in memory data is definitely intriguing and once Drill
works and works fast on simpler structures, I would expect that somebody
would be interested in implementing it.

Perhaps that would be you?

On Sat, Sep 15, 2012 at 10:16 AM, Tsuyoshi OZAWA
<oz...@gmail.com>wrote:

> Hello,
>
> Is there a roadmap to suppor in-memory index and storage like
> PowerDrill? It's one kind of storage, though its format is different
> from the columnar storage format in Dremel paper as you mentioned.
>
> IMO, the in-memory index and storage are much useful for analysis with
> small cluster.
>
> Thanks,
> - Tsuyoshi
>
> On Sun, Sep 16, 2012 at 2:02 AM, Dharm Raj <dh...@gmail.com>
> wrote:
> > You are right Camuel. While thinking  storage format I was thinking about
> > append. Misplaced update.
> >
> > On Sat, Sep 15, 2012 at 9:49 PM, Camuel Gilyadov <ca...@gmail.com>
> wrote:
> >
> >> Drill doesn't support updates. It is append only data store and append
> is
> >> usually expected to be a nice data chunk not a single row
> >>
> >> On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dharmrajbaliyan@gmail.com
> >> >wrote:
> >>
> >> > For columnar storage, IMO each column can be managed in a separate
> file.
> >> > Dremel also seems to have each column in a separate file. This should
> be
> >> > easy to manage and update are possible. Please see
> >> > https://issues.apache.org/jira/browse/AVRO-806
> >> >
> >> > Drill architecture slides shows AVRO-806 and trevni in Column storage
> >> box.
> >> > Are we looking them as candidate for storage format for drill?
> >> >
> >> > If we have lot of data with high amount of sparsity and major use
> case is
> >> > to read only once data is written - Another way could be to store in a
> >> > column major sparse matrix format. It  looks easy to implement but
> >> updates
> >> > may be problematic. just a thought.
> >> >
> >> > Regards,
> >> > Dharm
> >> >
> >> > On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
> >> > naveen.maanju.apache@gmail.com> wrote:
> >> >
> >> > > make sense..
> >> > >
> >> > > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <ted.dunning@gmail.com
> >
> >> > > wrote:
> >> > >
> >> > > > The key goal here is to get something simple working quickly in a
> way
> >> > > that
> >> > > > allows additional, more advanced implementations.
> >> > > >
> >> > > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <
> leemoonsoo@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > for column-storage, how about leverage Hbase or Accumulo?
> >> > > > >
> >> > > > > they'll also give a chance to data update (future work?)
> >> > > > >
> >> > > > >
> >> > > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > Hi All,
> >> > > > > >
> >> > > > > > I am interested in working on storage format. (sign up?)
> >> > > > > >
> >> > > > > > I wrote a HDFS  file format, which is similar to Sequence file
> >> (row
> >> > > > > > storage, block management, compress), I provide InputFormat
> and
> >> > > > > > OutputFormat,
> >> > > > > >
> >> > > > > > sometimes it get a great performance, sometimes not, depends
> on
> >> the
> >> > > > data.
> >> > > > > >
> >> > > > > > for Drill, we should implement a column-storage, this can skip
> >> some
> >> > > > > columns
> >> > > > > > during query, and skip some rows within one column file. but
> this
> >> > > > > > column-storage should based on the distributed file system,
> such
> >> as
> >> > > > HDFS,
> >> > > > > > Mapr DFS, I like Mapr DFS because of HA.
> >> > > > > >
> >> > > > > > we can implement the following column storage file format, I
> >> think
> >> > > it's
> >> > > > > > enough to us.
> >> > > > > >
> >> > > > > > http://arxiv.org/pdf/1105.4252.pdf
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: Storage file format

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

Hello,

Is there a roadmap to suppor in-memory index and storage like
PowerDrill? It's one kind of storage, though its format is different
from the columnar storage format in Dremel paper as you mentioned.

IMO, the in-memory index and storage are much useful for analysis with
small cluster.

Thanks,
- Tsuyoshi

On Sun, Sep 16, 2012 at 2:02 AM, Dharm Raj <dh...@gmail.com> wrote:
> You are right Camuel. While thinking  storage format I was thinking about
> append. Misplaced update.
>
> On Sat, Sep 15, 2012 at 9:49 PM, Camuel Gilyadov <ca...@gmail.com> wrote:
>
>> Drill doesn't support updates. It is append only data store and append is
>> usually expected to be a nice data chunk not a single row
>>
>> On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dharmrajbaliyan@gmail.com
>> >wrote:
>>
>> > For columnar storage, IMO each column can be managed in a separate file.
>> > Dremel also seems to have each column in a separate file. This should be
>> > easy to manage and update are possible. Please see
>> > https://issues.apache.org/jira/browse/AVRO-806
>> >
>> > Drill architecture slides shows AVRO-806 and trevni in Column storage
>> box.
>> > Are we looking them as candidate for storage format for drill?
>> >
>> > If we have lot of data with high amount of sparsity and major use case is
>> > to read only once data is written - Another way could be to store in a
>> > column major sparse matrix format. It  looks easy to implement but
>> updates
>> > may be problematic. just a thought.
>> >
>> > Regards,
>> > Dharm
>> >
>> > On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
>> > naveen.maanju.apache@gmail.com> wrote:
>> >
>> > > make sense..
>> > >
>> > > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <te...@gmail.com>
>> > > wrote:
>> > >
>> > > > The key goal here is to get something simple working quickly in a way
>> > > that
>> > > > allows additional, more advanced implementations.
>> > > >
>> > > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <le...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > for column-storage, how about leverage Hbase or Accumulo?
>> > > > >
>> > > > > they'll also give a chance to data update (future work?)
>> > > > >
>> > > > >
>> > > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com>
>> > wrote:
>> > > > >
>> > > > > > Hi All,
>> > > > > >
>> > > > > > I am interested in working on storage format. (sign up?)
>> > > > > >
>> > > > > > I wrote a HDFS  file format, which is similar to Sequence file
>> (row
>> > > > > > storage, block management, compress), I provide InputFormat and
>> > > > > > OutputFormat,
>> > > > > >
>> > > > > > sometimes it get a great performance, sometimes not, depends on
>> the
>> > > > data.
>> > > > > >
>> > > > > > for Drill, we should implement a column-storage, this can skip
>> some
>> > > > > columns
>> > > > > > during query, and skip some rows within one column file. but this
>> > > > > > column-storage should based on the distributed file system, such
>> as
>> > > > HDFS,
>> > > > > > Mapr DFS, I like Mapr DFS because of HA.
>> > > > > >
>> > > > > > we can implement the following column storage file format, I
>> think
>> > > it's
>> > > > > > enough to us.
>> > > > > >
>> > > > > > http://arxiv.org/pdf/1105.4252.pdf
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: Storage file format

Posted by Dharm Raj <dh...@gmail.com>.

You are right Camuel. While thinking  storage format I was thinking about
append. Misplaced update.

On Sat, Sep 15, 2012 at 9:49 PM, Camuel Gilyadov <ca...@gmail.com> wrote:

> Drill doesn't support updates. It is append only data store and append is
> usually expected to be a nice data chunk not a single row
>
> On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dharmrajbaliyan@gmail.com
> >wrote:
>
> > For columnar storage, IMO each column can be managed in a separate file.
> > Dremel also seems to have each column in a separate file. This should be
> > easy to manage and update are possible. Please see
> > https://issues.apache.org/jira/browse/AVRO-806
> >
> > Drill architecture slides shows AVRO-806 and trevni in Column storage
> box.
> > Are we looking them as candidate for storage format for drill?
> >
> > If we have lot of data with high amount of sparsity and major use case is
> > to read only once data is written - Another way could be to store in a
> > column major sparse matrix format. It  looks easy to implement but
> updates
> > may be problematic. just a thought.
> >
> > Regards,
> > Dharm
> >
> > On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
> > naveen.maanju.apache@gmail.com> wrote:
> >
> > > make sense..
> > >
> > > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > The key goal here is to get something simple working quickly in a way
> > > that
> > > > allows additional, more advanced implementations.
> > > >
> > > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <le...@gmail.com>
> > > > wrote:
> > > >
> > > > > for column-storage, how about leverage Hbase or Accumulo?
> > > > >
> > > > > they'll also give a chance to data update (future work?)
> > > > >
> > > > >
> > > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I am interested in working on storage format. (sign up?)
> > > > > >
> > > > > > I wrote a HDFS  file format, which is similar to Sequence file
> (row
> > > > > > storage, block management, compress), I provide InputFormat and
> > > > > > OutputFormat,
> > > > > >
> > > > > > sometimes it get a great performance, sometimes not, depends on
> the
> > > > data.
> > > > > >
> > > > > > for Drill, we should implement a column-storage, this can skip
> some
> > > > > columns
> > > > > > during query, and skip some rows within one column file. but this
> > > > > > column-storage should based on the distributed file system, such
> as
> > > > HDFS,
> > > > > > Mapr DFS, I like Mapr DFS because of HA.
> > > > > >
> > > > > > we can implement the following column storage file format, I
> think
> > > it's
> > > > > > enough to us.
> > > > > >
> > > > > > http://arxiv.org/pdf/1105.4252.pdf
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Storage file format

Posted by Camuel Gilyadov <ca...@gmail.com>.

Drill doesn't support updates. It is append only data store and append is
usually expected to be a nice data chunk not a single row

On Sat, Sep 15, 2012 at 8:09 AM, Dharm Raj <dh...@gmail.com>wrote:

> For columnar storage, IMO each column can be managed in a separate file.
> Dremel also seems to have each column in a separate file. This should be
> easy to manage and update are possible. Please see
> https://issues.apache.org/jira/browse/AVRO-806
>
> Drill architecture slides shows AVRO-806 and trevni in Column storage box.
> Are we looking them as candidate for storage format for drill?
>
> If we have lot of data with high amount of sparsity and major use case is
> to read only once data is written - Another way could be to store in a
> column major sparse matrix format. It  looks easy to implement but updates
> may be problematic. just a thought.
>
> Regards,
> Dharm
>
> On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
> naveen.maanju.apache@gmail.com> wrote:
>
> > make sense..
> >
> > On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > The key goal here is to get something simple working quickly in a way
> > that
> > > allows additional, more advanced implementations.
> > >
> > > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <le...@gmail.com>
> > > wrote:
> > >
> > > > for column-storage, how about leverage Hbase or Accumulo?
> > > >
> > > > they'll also give a chance to data update (future work?)
> > > >
> > > >
> > > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com>
> wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I am interested in working on storage format. (sign up?)
> > > > >
> > > > > I wrote a HDFS  file format, which is similar to Sequence file (row
> > > > > storage, block management, compress), I provide InputFormat and
> > > > > OutputFormat,
> > > > >
> > > > > sometimes it get a great performance, sometimes not, depends on the
> > > data.
> > > > >
> > > > > for Drill, we should implement a column-storage, this can skip some
> > > > columns
> > > > > during query, and skip some rows within one column file. but this
> > > > > column-storage should based on the distributed file system, such as
> > > HDFS,
> > > > > Mapr DFS, I like Mapr DFS because of HA.
> > > > >
> > > > > we can implement the following column storage file format, I think
> > it's
> > > > > enough to us.
> > > > >
> > > > > http://arxiv.org/pdf/1105.4252.pdf
> > > > >
> > > >
> > >
> >
>

Re: Storage file format

Posted by Dharm Raj <dh...@gmail.com>.

For columnar storage, IMO each column can be managed in a separate file.
Dremel also seems to have each column in a separate file. This should be
easy to manage and update are possible. Please see
https://issues.apache.org/jira/browse/AVRO-806

Drill architecture slides shows AVRO-806 and trevni in Column storage box.
Are we looking them as candidate for storage format for drill?

If we have lot of data with high amount of sparsity and major use case is
to read only once data is written - Another way could be to store in a
column major sparse matrix format. It  looks easy to implement but updates
may be problematic. just a thought.

Regards,
Dharm

On Sat, Sep 15, 2012 at 7:24 PM, NAVEEN MAANJU <
naveen.maanju.apache@gmail.com> wrote:

> make sense..
>
> On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The key goal here is to get something simple working quickly in a way
> that
> > allows additional, more advanced implementations.
> >
> > On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <le...@gmail.com>
> > wrote:
> >
> > > for column-storage, how about leverage Hbase or Accumulo?
> > >
> > > they'll also give a chance to data update (future work?)
> > >
> > >
> > > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am interested in working on storage format. (sign up?)
> > > >
> > > > I wrote a HDFS  file format, which is similar to Sequence file (row
> > > > storage, block management, compress), I provide InputFormat and
> > > > OutputFormat,
> > > >
> > > > sometimes it get a great performance, sometimes not, depends on the
> > data.
> > > >
> > > > for Drill, we should implement a column-storage, this can skip some
> > > columns
> > > > during query, and skip some rows within one column file. but this
> > > > column-storage should based on the distributed file system, such as
> > HDFS,
> > > > Mapr DFS, I like Mapr DFS because of HA.
> > > >
> > > > we can implement the following column storage file format, I think
> it's
> > > > enough to us.
> > > >
> > > > http://arxiv.org/pdf/1105.4252.pdf
> > > >
> > >
> >
>

Re: Storage file format

Posted by NAVEEN MAANJU <na...@gmail.com>.

make sense..

On Sat, Sep 15, 2012 at 6:44 AM, Ted Dunning <te...@gmail.com> wrote:

> The key goal here is to get something simple working quickly in a way that
> allows additional, more advanced implementations.
>
> On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <le...@gmail.com>
> wrote:
>
> > for column-storage, how about leverage Hbase or Accumulo?
> >
> > they'll also give a chance to data update (future work?)
> >
> >
> > On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I am interested in working on storage format. (sign up?)
> > >
> > > I wrote a HDFS  file format, which is similar to Sequence file (row
> > > storage, block management, compress), I provide InputFormat and
> > > OutputFormat,
> > >
> > > sometimes it get a great performance, sometimes not, depends on the
> data.
> > >
> > > for Drill, we should implement a column-storage, this can skip some
> > columns
> > > during query, and skip some rows within one column file. but this
> > > column-storage should based on the distributed file system, such as
> HDFS,
> > > Mapr DFS, I like Mapr DFS because of HA.
> > >
> > > we can implement the following column storage file format, I think it's
> > > enough to us.
> > >
> > > http://arxiv.org/pdf/1105.4252.pdf
> > >
> >
>

Re: Storage file format

Posted by Ted Dunning <te...@gmail.com>.

The key goal here is to get something simple working quickly in a way that
allows additional, more advanced implementations.

On Sat, Sep 15, 2012 at 5:47 AM, moon soo Lee <le...@gmail.com> wrote:

> for column-storage, how about leverage Hbase or Accumulo?
>
> they'll also give a chance to data update (future work?)
>
>
> On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com> wrote:
>
> > Hi All,
> >
> > I am interested in working on storage format. (sign up?)
> >
> > I wrote a HDFS  file format, which is similar to Sequence file (row
> > storage, block management, compress), I provide InputFormat and
> > OutputFormat,
> >
> > sometimes it get a great performance, sometimes not, depends on the data.
> >
> > for Drill, we should implement a column-storage, this can skip some
> columns
> > during query, and skip some rows within one column file. but this
> > column-storage should based on the distributed file system, such as HDFS,
> > Mapr DFS, I like Mapr DFS because of HA.
> >
> > we can implement the following column storage file format, I think it's
> > enough to us.
> >
> > http://arxiv.org/pdf/1105.4252.pdf
> >
>

Re: Storage file format

Posted by moon soo Lee <le...@gmail.com>.

for column-storage, how about leverage Hbase or Accumulo?

they'll also give a chance to data update (future work?)


On Sat, Sep 15, 2012 at 9:30 PM, Azuryy Yu <az...@gmail.com> wrote:

> Hi All,
>
> I am interested in working on storage format. (sign up?)
>
> I wrote a HDFS  file format, which is similar to Sequence file (row
> storage, block management, compress), I provide InputFormat and
> OutputFormat,
>
> sometimes it get a great performance, sometimes not, depends on the data.
>
> for Drill, we should implement a column-storage, this can skip some columns
> during query, and skip some rows within one column file. but this
> column-storage should based on the distributed file system, such as HDFS,
> Mapr DFS, I like Mapr DFS because of HA.
>
> we can implement the following column storage file format, I think it's
> enough to us.
>
> http://arxiv.org/pdf/1105.4252.pdf
>