You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Joydeep Sen Sarma <js...@facebook.com> on 2009/06/12 23:42:37 UTC

can RCFile* be also exported as a hadoop contrib project?

Is this columnar format reusable by other folks?

Is it possible to demonstrate the use (in case of simple delimited text data) by writing a configurable inputformat on top of this that can project out the configured columns?

The reason I am asking is that there is some parallel work going on in yahoo/pig on columnar format and no one's aware of this format - whereas it should be usable by other people (and not just Hive).

Comments?



RE: can RCFile* be also exported as a hadoop contrib project?

Posted by Ashish Thusoo <at...@facebook.com>.
It should be re usable. The work in yahoo/pig is on different line though as Alan explained to me. They are actually storing the columns in different files (I think they are using the TFile stuff) rather than having a columnar storage within a block...

Ashish
________________________________________
From: Joydeep Sen Sarma [jssarma@facebook.com]
Sent: Friday, June 12, 2009 2:42 PM
To: hive-dev@hadoop.apache.org
Subject: can RCFile* be also exported as a hadoop contrib project?

Is this columnar format reusable by other folks?

Is it possible to demonstrate the use (in case of simple delimited text data) by writing a configurable inputformat on top of this that can project out the configured columns?

The reason I am asking is that there is some parallel work going on in yahoo/pig on columnar format and no one's aware of this format - whereas it should be usable by other people (and not just Hive).

Comments?



Re: can RCFile* be also exported as a hadoop contrib project?

Posted by Edward Capriolo <ed...@gmail.com>.
Joydeep,

If i understand you correctly I think the format would useful. In
particular it might be nice to able to access the underlying hive data
directly from a M/R job.

Edward

On Fri, Jun 12, 2009 at 5:42 PM, Joydeep Sen Sarma<js...@facebook.com> wrote:
> Is this columnar format reusable by other folks?
>
> Is it possible to demonstrate the use (in case of simple delimited text data) by writing a configurable inputformat on top of this that can project out the configured columns?
>
> The reason I am asking is that there is some parallel work going on in yahoo/pig on columnar format and no one's aware of this format - whereas it should be usable by other people (and not just Hive).
>
> Comments?
>
>
>

Re: can RCFile* be also exported as a hadoop contrib project?

Posted by He Yongqiang <he...@software.ict.ac.cn>.
I think it will be great to introduce RCFile to the community and learn more
feedbacks. I will try to get some test results.

To Jeff
>> Might be worth checking out what's in TFile also and
>> see if you can merge one into the other.
Tfile is quite different with RCFile in that
1) TFile's data layout is a little complicated
2) TFile support InputStream pointint to value to avoid large memory copy or
OOM
3) I think TFile does not support append once it is created?
I took a look at Tfile long time ago, so not sure what I am saying is right
for the current Tfile work.

I think merging one into the other is not an easy work. Hadoop currently has
many file formats, such as SequenceFile, Tfile, Ifile, MapFile, Hfile,
RCFile etc. Instead of merging them, I think a better way may be to put them
into a same collection, for example, put them into a same package.



On 09-6-15 下午4:23, "Zheng Shao" <zs...@gmail.com> wrote:

> Yes we did take a look at TFile before we started. The conclusion at the
> time was that TFile was not ready yet, and as a result the RCFile is
> implemented on top of SequenceFile (most of the SequenceFile code got
> reused).
> Once the TFile is ready, we should think about adding Columnar Support to
> TFile just as the same way we do to SequenceFile.
> 
> 
> Another thing is that I think as soon as we get time, we should publish the
> current RCFile work to hadoop user mailing lists to see the feedback from
> there. We should include some introduction of the design as well as
> performance numbers. What do you think Yongqiang?
> 
> Zheng
> 
> 2009/6/14 Jeff Hammerbacher <ha...@cloudera.com>
> 
>> Hey,
>> 
>> The folks from Yahoo say that their work on TFile (see
>> http://issues.apache.org/jira/browse/HADOOP-3315) is quite similar to the
>> Hive work on RCFile. Might be worth checking out what's in TFile also and
>> see if you can merge one into the other.
>> 
>> Later,
>> Jeff
>> 
>> 2009/6/12 He Yongqiang <he...@software.ict.ac.cn>
>> 
>>> I think this would be great. And we should avoid duplicate work and
>> should
>>> put together the develop forces instead of isolating them.  I completely
>>> support this :).
>>> 
>>> 
>>> On 09-6-13 上午5:42, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
>>> 
>>>> Is this columnar format reusable by other folks?
>>>> 
>>>> Is it possible to demonstrate the use (in case of simple delimited text
>>> data)
>>>> by writing a configurable inputformat on top of this that can project
>> out
>>> the
>>>> configured columns?
>>>> 
>>>> The reason I am asking is that there is some parallel work going on in
>>>> yahoo/pig on columnar format and no one's aware of this format -
>> whereas
>>> it
>>>> should be usable by other people (and not just Hive).
>>>> 
>>>> Comments?
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 



Re: can RCFile* be also exported as a hadoop contrib project?

Posted by Zheng Shao <zs...@gmail.com>.
Yes we did take a look at TFile before we started. The conclusion at the
time was that TFile was not ready yet, and as a result the RCFile is
implemented on top of SequenceFile (most of the SequenceFile code got
reused).
Once the TFile is ready, we should think about adding Columnar Support to
TFile just as the same way we do to SequenceFile.


Another thing is that I think as soon as we get time, we should publish the
current RCFile work to hadoop user mailing lists to see the feedback from
there. We should include some introduction of the design as well as
performance numbers. What do you think Yongqiang?

Zheng

2009/6/14 Jeff Hammerbacher <ha...@cloudera.com>

> Hey,
>
> The folks from Yahoo say that their work on TFile (see
> http://issues.apache.org/jira/browse/HADOOP-3315) is quite similar to the
> Hive work on RCFile. Might be worth checking out what's in TFile also and
> see if you can merge one into the other.
>
> Later,
> Jeff
>
> 2009/6/12 He Yongqiang <he...@software.ict.ac.cn>
>
> > I think this would be great. And we should avoid duplicate work and
> should
> > put together the develop forces instead of isolating them.  I completely
> > support this :).
> >
> >
> > On 09-6-13 上午5:42, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
> >
> > > Is this columnar format reusable by other folks?
> > >
> > > Is it possible to demonstrate the use (in case of simple delimited text
> > data)
> > > by writing a configurable inputformat on top of this that can project
> out
> > the
> > > configured columns?
> > >
> > > The reason I am asking is that there is some parallel work going on in
> > > yahoo/pig on columnar format and no one's aware of this format -
> whereas
> > it
> > > should be usable by other people (and not just Hive).
> > >
> > > Comments?
> > >
> > >
> >
> >
> >
>



-- 
Yours,
Zheng

Re: can RCFile* be also exported as a hadoop contrib project?

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey,

The folks from Yahoo say that their work on TFile (see
http://issues.apache.org/jira/browse/HADOOP-3315) is quite similar to the
Hive work on RCFile. Might be worth checking out what's in TFile also and
see if you can merge one into the other.

Later,
Jeff

2009/6/12 He Yongqiang <he...@software.ict.ac.cn>

> I think this would be great. And we should avoid duplicate work and should
> put together the develop forces instead of isolating them.  I completely
> support this :).
>
>
> On 09-6-13 上午5:42, "Joydeep Sen Sarma" <js...@facebook.com> wrote:
>
> > Is this columnar format reusable by other folks?
> >
> > Is it possible to demonstrate the use (in case of simple delimited text
> data)
> > by writing a configurable inputformat on top of this that can project out
> the
> > configured columns?
> >
> > The reason I am asking is that there is some parallel work going on in
> > yahoo/pig on columnar format and no one's aware of this format - whereas
> it
> > should be usable by other people (and not just Hive).
> >
> > Comments?
> >
> >
>
>
>

Re: can RCFile* be also exported as a hadoop contrib project?

Posted by He Yongqiang <he...@software.ict.ac.cn>.
I think this would be great. And we should avoid duplicate work and should
put together the develop forces instead of isolating them.  I completely
support this :).


On 09-6-13 上午5:42, "Joydeep Sen Sarma" <js...@facebook.com> wrote:

> Is this columnar format reusable by other folks?
> 
> Is it possible to demonstrate the use (in case of simple delimited text data)
> by writing a configurable inputformat on top of this that can project out the
> configured columns?
> 
> The reason I am asking is that there is some parallel work going on in
> yahoo/pig on columnar format and no one's aware of this format - whereas it
> should be usable by other people (and not just Hive).
> 
> Comments?
> 
>