You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Juho Mäkinen <ju...@gmail.com> on 2008/09/02 09:53:46 UTC

Reading and writing Thrift data from MapReduce

We are already using Thrift to move and store our log data and I'm
looking onto how I could read the stored log data into MapReduce
processes. This article
http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html
talks about using Thrift for the IO, but it doesn't say anything
specific.

What's the current status of Thrift with Hadoop? Is there any
documentation online or even some code in the SVN which I could look
into?

 - Juho Mäkinen

RE: Reading and writing Thrift data from MapReduce

Posted by Ashish Thusoo <at...@facebook.com>.
Hi Juho,

Hive can support your partitioning scheme. Just use the partition by clause in the create table statement to identify the partitioning columns with the top level partitioning column being the first one in the list.

Ashish 

-----Original Message-----
From: Jeff Hammerbacher [mailto:jeff.hammerbacher@gmail.com] 
Sent: Thursday, September 04, 2008 5:24 AM
To: core-user@hadoop.apache.org
Subject: Re: Reading and writing Thrift data from MapReduce

Hey Juho,

Hive works with 0.17 and is in active use at Facebook using that version of Hadoop, so you should be good to go using Hive right now.

Regards,
Jeff

On Wed, Sep 3, 2008 at 12:52 PM, Juho Mäkinen <ju...@gmail.com> wrote:
> Thanks Pete!
>
> So far it looks good on my end: I can use my event logging framework 
> to store the events to HDFS, run vanilla MapReduce jobs on them (using 
> SerDe to deserialize the thrift to java objects) and keep the 
> possibility to use Hive in the future. I'm also interested to try out 
> Pigs (I might need to implement some glue from serde to pigs, but I 
> believe that's easy). But I still have one question remaining:
>
> I'm currently storing (I just got the storing code ready this
> afternoon) the events in Sequence files with timestamp as the key and 
> thrift serialized bytecode (as BytesWritable) as the value. One file 
> will contain one hour worth of data. So for one day I'll have 24 
> different files. Also each file is placed into a subdirectory which 
> contains one week files (identified by week number). So todays files 
> will be placed under "/events/thrift/MyThriftStuctureName/36/".
>
> Can Hive understand this type of data partitioning? Should I change 
> this partitioning, or can I create some class which makes hive to 
> understand this with a little effort?
>
> So far Hive looks a good way to solve some of our problems, but I'll 
> want to wait at least for hadoop 0.19.0 release (which should have 
> Hive included). But I also want to start collecting the data right now 
> and to keep my system ready to easily support hive in the future.
>
> Thanks,
>
>  - Juho Mäkinen
>
>
> On Wed, Sep 3, 2008 at 8:30 PM, Pete Wyckoff <pw...@facebook.com> wrote:
>>
>> Hi Juho,
>>
>> Excellent - yes, we do something very similar using Thrift, Scribe 
>> (our soon to be open sourced logging framework) and SerDes.
>>
>> The SerDe is a uniform interface to data for serialization and 
>> deserialization. If you look at the interface, it provides 4 methods 
>> - serialize, deserialize and 2 methods to query the type information.  
>> It supports Thrift, Jute,  and control delimited data and can be 
>> easily extended to support things like ProtocolBuffers.
>>
>> So, Hive doesn't care about the actual data format in HDFS although 
>> its default native serialization for Text data is control separated.
>>
>> For thrift, you could use ThriftSerDe passing in a Properties object 
>> to its initializer that includes the name of the thrift class it is representing.
>> In Hive, we use the Hive Metastore to store this information so the 
>> runtime just passes the name of the "table" to the  MS which returns 
>> the information needed to instantiate the SerDe.
>>
>> We are interested, however, in integrating with 
>> https://issues.apache.org/jira/browse/HADOOP-3787, but we haven't 
>> looked at this much yet.
>>
>> Thanks, pete
>>
>>
>> On 9/3/08 12:32 AM, "Juho Mäkinen" <ju...@gmail.com> wrote:
>>
>>> Thanks Jeff. I believe that you mean the serde module inside hadoop 
>>> (hadoop-core-trunk\src\contrib\hive\serde)?
>>> I'm currently looking into it, but it seems to lack a lot of useful 
>>> documentation so it'll take me some time to figure it out (all 
>>> additional info is appreciated).
>>>
>>> I've already put some effort into this and designed a partial 
>>> sollution for my log analysis which so far seems ok to me. As I 
>>> don't know the details of serde yet, I'm not sure if this is the way 
>>> I should go, or should I change my implementation and plans so that 
>>> I could use serve (if it makes my job easier). I'm not yet 
>>> interested in HIVE, but I'd like to keep the option open in the 
>>> future, so that I could easily run hive on my datas (so that I would 
>>> not need to transform my datas to hive if I choose to use it in the future).
>>>
>>> Currently I've come up with the following design:
>>> 1) Each log event type has it's own thrift structure. The structure 
>>> is compiled into php code. The log entry creators creates and 
>>> populates the structure php object with data and sends it to be 
>>> stored
>>> 2) Log sender object receiveres this object ($tbase) and serializes 
>>> it using TBinaryTransport, adds the structure name to the beginning 
>>> and sends the byte array to loc receiver using UDP. The following 
>>> code does this:
>>>
>>> $this->transport = new TResetableMemoryBuffer(); // a TMemoryBuffer 
>>> with a reset() method $this->protocol = new 
>>> TBinaryProtocol($this->transport);
>>> $this->transport->open();
>>>
>>> $this->transport->reset(); // Reset the memory buffer array 
>>> $this->protocol->writeByte(1); // version 1: we have the TBase name 
>>> in string $this->protocol->writeString($tbase->getName()); // Name 
>>> of the structure $tbase->write($this->protocol); // Serialize our 
>>> thrift structure to the memory buffer
>>>
>>> $this->sendBytes($this->transport->getBuffer());
>>>
>>> 3) Log receiver reads the structure name and stores the byte array 
>>> (without the version byte and structure name) into HDFS file 
>>> "/events/<insert structure name here>/<week
>>> number>/<timestamp>.datafile"
>>>
>>> My plan is that I could read the stored entries using MapReduce, 
>>> deserialize them into java objects (the map-reducer would need to 
>>> have the thrift compiled structures available) and use the 
>>> structures directly in Map operations. (How) can serde help me with this part?
>>> Should I modify my plans so that I could use HIVE directly in the 
>>> future? How Hive stores the thrift serialized log data into HDFS?
>>>
>>>  - Juho Mäkinen
>>>
>>> On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher 
>>> <je...@gmail.com> wrote:
>>>> Hey Juho,
>>>>
>>>> You should check out Hive
>>>> (https://issues.apache.org/jira/browse/HADOOP-3601), which was just 
>>>> committed to the Hadoop trunk today. It's what we use at Facebook 
>>>> to query our collection of Thrift-serialized logfiles. Inside of 
>>>> the Hive code, you'll find a pure-Java (using JavaCC) parser for 
>>>> Thrift-serialized data structures.
>>>>
>>>> Regards,
>>>> Jeff
>>>>
>>>> On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>>>>> On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
>>>>>> What's the current status of Thrift with Hadoop? Is there any 
>>>>>> documentation online or even some code in the SVN which I could 
>>>>>> look into?
>>>>>
>>>>> I think you have two choices: 1) wrap your Thrift code in a class 
>>>>> that implements Writable, or 2) use Thrift to serialize your data 
>>>>> to byte arrays and store them as BytesWritable.
>>>>> -Stuart
>

Re: Reading and writing Thrift data from MapReduce

Posted by Jeff Hammerbacher <je...@gmail.com>.
Hey Juho,

Hive works with 0.17 and is in active use at Facebook using that
version of Hadoop, so you should be good to go using Hive right now.

Regards,
Jeff

On Wed, Sep 3, 2008 at 12:52 PM, Juho Mäkinen <ju...@gmail.com> wrote:
> Thanks Pete!
>
> So far it looks good on my end: I can use my event logging framework
> to store the events to HDFS,
> run vanilla MapReduce jobs on them (using SerDe to deserialize the
> thrift to java objects) and keep the possibility to use Hive in the
> future. I'm also interested to try out Pigs (I might need to implement
> some glue from serde to pigs, but I believe that's easy). But I still
> have one question remaining:
>
> I'm currently storing (I just got the storing code ready this
> afternoon) the events in Sequence files with timestamp as the key and
> thrift serialized bytecode (as BytesWritable) as the value. One file
> will contain one hour worth of data. So for one day I'll have 24
> different files. Also each file is placed into a subdirectory which
> contains one week files (identified by week number). So todays files
> will be placed under "/events/thrift/MyThriftStuctureName/36/".
>
> Can Hive understand this type of data partitioning? Should I change
> this partitioning, or can I create some class which makes hive to
> understand this with a little effort?
>
> So far Hive looks a good way to solve some of our problems, but I'll
> want to wait at least for hadoop 0.19.0 release (which should have
> Hive included). But I also want to start collecting the data right now
> and to keep my system ready to easily support hive in the future.
>
> Thanks,
>
>  - Juho Mäkinen
>
>
> On Wed, Sep 3, 2008 at 8:30 PM, Pete Wyckoff <pw...@facebook.com> wrote:
>>
>> Hi Juho,
>>
>> Excellent - yes, we do something very similar using Thrift, Scribe (our soon
>> to be open sourced logging framework) and SerDes.
>>
>> The SerDe is a uniform interface to data for serialization and
>> deserialization. If you look at the interface, it provides 4 methods -
>> serialize, deserialize and 2 methods to query the type information.  It
>> supports Thrift, Jute,  and control delimited data and can be easily
>> extended to support things like ProtocolBuffers.
>>
>> So, Hive doesn't care about the actual data format in HDFS although its
>> default native serialization for Text data is control separated.
>>
>> For thrift, you could use ThriftSerDe passing in a Properties object to its
>> initializer that includes the name of the thrift class it is representing.
>> In Hive, we use the Hive Metastore to store this information so the runtime
>> just passes the name of the "table" to the  MS which returns the information
>> needed to instantiate the SerDe.
>>
>> We are interested, however, in integrating with
>> https://issues.apache.org/jira/browse/HADOOP-3787, but we haven't looked at
>> this much yet.
>>
>> Thanks, pete
>>
>>
>> On 9/3/08 12:32 AM, "Juho Mäkinen" <ju...@gmail.com> wrote:
>>
>>> Thanks Jeff. I believe that you mean the serde module inside hadoop
>>> (hadoop-core-trunk\src\contrib\hive\serde)?
>>> I'm currently looking into it, but it seems to lack a lot of useful
>>> documentation so it'll take me some time to figure it out (all
>>> additional info is appreciated).
>>>
>>> I've already put some effort into this and designed a partial
>>> sollution for my log analysis which so far seems ok to me. As I don't
>>> know the details of serde yet, I'm not sure if this is the way I
>>> should go, or should I change my implementation and plans so that I
>>> could use serve (if it makes my job easier). I'm not yet interested in
>>> HIVE, but I'd like to keep the option open in the future, so that I
>>> could easily run hive on my datas (so that I would not need to
>>> transform my datas to hive if I choose to use it in the future).
>>>
>>> Currently I've come up with the following design:
>>> 1) Each log event type has it's own thrift structure. The structure is
>>> compiled into php code. The log entry creators creates and populates
>>> the structure php object with data and sends it to be stored
>>> 2) Log sender object receiveres this object ($tbase) and serializes it
>>> using TBinaryTransport, adds the structure name to the beginning and
>>> sends the byte array to loc receiver using UDP. The following code
>>> does this:
>>>
>>> $this->transport = new TResetableMemoryBuffer(); // a TMemoryBuffer
>>> with a reset() method
>>> $this->protocol = new TBinaryProtocol($this->transport);
>>> $this->transport->open();
>>>
>>> $this->transport->reset(); // Reset the memory buffer array
>>> $this->protocol->writeByte(1); // version 1: we have the TBase name in string
>>> $this->protocol->writeString($tbase->getName()); // Name of the structure
>>> $tbase->write($this->protocol); // Serialize our thrift structure to
>>> the memory buffer
>>>
>>> $this->sendBytes($this->transport->getBuffer());
>>>
>>> 3) Log receiver reads the structure name and stores the byte array
>>> (without the version byte and structure name) into HDFS file
>>> "/events/<insert structure name here>/<week
>>> number>/<timestamp>.datafile"
>>>
>>> My plan is that I could read the stored entries using MapReduce,
>>> deserialize them into java objects (the map-reducer would need to have
>>> the thrift compiled structures available) and use the structures
>>> directly in Map operations. (How) can serde help me with this part?
>>> Should I modify my plans so that I could use HIVE directly in the
>>> future? How Hive stores the thrift serialized log data into HDFS?
>>>
>>>  - Juho Mäkinen
>>>
>>> On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher
>>> <je...@gmail.com> wrote:
>>>> Hey Juho,
>>>>
>>>> You should check out Hive
>>>> (https://issues.apache.org/jira/browse/HADOOP-3601), which was just
>>>> committed to the Hadoop trunk today. It's what we use at Facebook to
>>>> query our collection of Thrift-serialized logfiles. Inside of the Hive
>>>> code, you'll find a pure-Java (using JavaCC) parser for
>>>> Thrift-serialized data structures.
>>>>
>>>> Regards,
>>>> Jeff
>>>>
>>>> On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>>>>> On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
>>>>>> What's the current status of Thrift with Hadoop? Is there any
>>>>>> documentation online or even some code in the SVN which I could look
>>>>>> into?
>>>>>
>>>>> I think you have two choices: 1) wrap your Thrift code in a class that
>>>>> implements Writable, or 2) use Thrift to serialize your data to byte
>>>>> arrays and store them as BytesWritable.
>>>>> -Stuart
>

Re: Reading and writing Thrift data from MapReduce

Posted by Juho Mäkinen <ju...@gmail.com>.
Thanks Pete!

So far it looks good on my end: I can use my event logging framework
to store the events to HDFS,
run vanilla MapReduce jobs on them (using SerDe to deserialize the
thrift to java objects) and keep the possibility to use Hive in the
future. I'm also interested to try out Pigs (I might need to implement
some glue from serde to pigs, but I believe that's easy). But I still
have one question remaining:

I'm currently storing (I just got the storing code ready this
afternoon) the events in Sequence files with timestamp as the key and
thrift serialized bytecode (as BytesWritable) as the value. One file
will contain one hour worth of data. So for one day I'll have 24
different files. Also each file is placed into a subdirectory which
contains one week files (identified by week number). So todays files
will be placed under "/events/thrift/MyThriftStuctureName/36/".

Can Hive understand this type of data partitioning? Should I change
this partitioning, or can I create some class which makes hive to
understand this with a little effort?

So far Hive looks a good way to solve some of our problems, but I'll
want to wait at least for hadoop 0.19.0 release (which should have
Hive included). But I also want to start collecting the data right now
and to keep my system ready to easily support hive in the future.

Thanks,

 - Juho Mäkinen


On Wed, Sep 3, 2008 at 8:30 PM, Pete Wyckoff <pw...@facebook.com> wrote:
>
> Hi Juho,
>
> Excellent - yes, we do something very similar using Thrift, Scribe (our soon
> to be open sourced logging framework) and SerDes.
>
> The SerDe is a uniform interface to data for serialization and
> deserialization. If you look at the interface, it provides 4 methods -
> serialize, deserialize and 2 methods to query the type information.  It
> supports Thrift, Jute,  and control delimited data and can be easily
> extended to support things like ProtocolBuffers.
>
> So, Hive doesn't care about the actual data format in HDFS although its
> default native serialization for Text data is control separated.
>
> For thrift, you could use ThriftSerDe passing in a Properties object to its
> initializer that includes the name of the thrift class it is representing.
> In Hive, we use the Hive Metastore to store this information so the runtime
> just passes the name of the "table" to the  MS which returns the information
> needed to instantiate the SerDe.
>
> We are interested, however, in integrating with
> https://issues.apache.org/jira/browse/HADOOP-3787, but we haven't looked at
> this much yet.
>
> Thanks, pete
>
>
> On 9/3/08 12:32 AM, "Juho Mäkinen" <ju...@gmail.com> wrote:
>
>> Thanks Jeff. I believe that you mean the serde module inside hadoop
>> (hadoop-core-trunk\src\contrib\hive\serde)?
>> I'm currently looking into it, but it seems to lack a lot of useful
>> documentation so it'll take me some time to figure it out (all
>> additional info is appreciated).
>>
>> I've already put some effort into this and designed a partial
>> sollution for my log analysis which so far seems ok to me. As I don't
>> know the details of serde yet, I'm not sure if this is the way I
>> should go, or should I change my implementation and plans so that I
>> could use serve (if it makes my job easier). I'm not yet interested in
>> HIVE, but I'd like to keep the option open in the future, so that I
>> could easily run hive on my datas (so that I would not need to
>> transform my datas to hive if I choose to use it in the future).
>>
>> Currently I've come up with the following design:
>> 1) Each log event type has it's own thrift structure. The structure is
>> compiled into php code. The log entry creators creates and populates
>> the structure php object with data and sends it to be stored
>> 2) Log sender object receiveres this object ($tbase) and serializes it
>> using TBinaryTransport, adds the structure name to the beginning and
>> sends the byte array to loc receiver using UDP. The following code
>> does this:
>>
>> $this->transport = new TResetableMemoryBuffer(); // a TMemoryBuffer
>> with a reset() method
>> $this->protocol = new TBinaryProtocol($this->transport);
>> $this->transport->open();
>>
>> $this->transport->reset(); // Reset the memory buffer array
>> $this->protocol->writeByte(1); // version 1: we have the TBase name in string
>> $this->protocol->writeString($tbase->getName()); // Name of the structure
>> $tbase->write($this->protocol); // Serialize our thrift structure to
>> the memory buffer
>>
>> $this->sendBytes($this->transport->getBuffer());
>>
>> 3) Log receiver reads the structure name and stores the byte array
>> (without the version byte and structure name) into HDFS file
>> "/events/<insert structure name here>/<week
>> number>/<timestamp>.datafile"
>>
>> My plan is that I could read the stored entries using MapReduce,
>> deserialize them into java objects (the map-reducer would need to have
>> the thrift compiled structures available) and use the structures
>> directly in Map operations. (How) can serde help me with this part?
>> Should I modify my plans so that I could use HIVE directly in the
>> future? How Hive stores the thrift serialized log data into HDFS?
>>
>>  - Juho Mäkinen
>>
>> On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher
>> <je...@gmail.com> wrote:
>>> Hey Juho,
>>>
>>> You should check out Hive
>>> (https://issues.apache.org/jira/browse/HADOOP-3601), which was just
>>> committed to the Hadoop trunk today. It's what we use at Facebook to
>>> query our collection of Thrift-serialized logfiles. Inside of the Hive
>>> code, you'll find a pure-Java (using JavaCC) parser for
>>> Thrift-serialized data structures.
>>>
>>> Regards,
>>> Jeff
>>>
>>> On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>>>> On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
>>>>> What's the current status of Thrift with Hadoop? Is there any
>>>>> documentation online or even some code in the SVN which I could look
>>>>> into?
>>>>
>>>> I think you have two choices: 1) wrap your Thrift code in a class that
>>>> implements Writable, or 2) use Thrift to serialize your data to byte
>>>> arrays and store them as BytesWritable.
>>>> -Stuart

Re: Reading and writing Thrift data from MapReduce

Posted by Tom White <to...@gmail.com>.
Hi Juho,

I think you should be able to use the Thrift serialization stuff that
I've been working on in
https://issues.apache.org/jira/browse/HADOOP-3787 - at least as a
basis. Since you are not using sequence files, you will need to write
an InputFormat (probably one that extends FileInputFormat) and an
associated RecordReader that knows how to break the input into logical
records. See SequenceFileInputFormat for the kind of thing. Also,
since you store the Thrift type's name in the data, you can use a
variant of ThriftDeserializer that first reads the type name and
instantiates an instance of the type before reading its fields from
the stream.

Hope this helps.

Tom

On Wed, Sep 3, 2008 at 8:32 AM, Juho Mäkinen <ju...@gmail.com> wrote:
> Thanks Jeff. I believe that you mean the serde module inside hadoop
> (hadoop-core-trunk\src\contrib\hive\serde)?
> I'm currently looking into it, but it seems to lack a lot of useful
> documentation so it'll take me some time to figure it out (all
> additional info is appreciated).
>
> I've already put some effort into this and designed a partial
> sollution for my log analysis which so far seems ok to me. As I don't
> know the details of serde yet, I'm not sure if this is the way I
> should go, or should I change my implementation and plans so that I
> could use serve (if it makes my job easier). I'm not yet interested in
> HIVE, but I'd like to keep the option open in the future, so that I
> could easily run hive on my datas (so that I would not need to
> transform my datas to hive if I choose to use it in the future).
>
> Currently I've come up with the following design:
> 1) Each log event type has it's own thrift structure. The structure is
> compiled into php code. The log entry creators creates and populates
> the structure php object with data and sends it to be stored
> 2) Log sender object receiveres this object ($tbase) and serializes it
> using TBinaryTransport, adds the structure name to the beginning and
> sends the byte array to loc receiver using UDP. The following code
> does this:
>
> $this->transport = new TResetableMemoryBuffer(); // a TMemoryBuffer
> with a reset() method
> $this->protocol = new TBinaryProtocol($this->transport);
> $this->transport->open();
>
> $this->transport->reset(); // Reset the memory buffer array
> $this->protocol->writeByte(1); // version 1: we have the TBase name in string
> $this->protocol->writeString($tbase->getName()); // Name of the structure
> $tbase->write($this->protocol); // Serialize our thrift structure to
> the memory buffer
>
> $this->sendBytes($this->transport->getBuffer());
>
> 3) Log receiver reads the structure name and stores the byte array
> (without the version byte and structure name) into HDFS file
> "/events/<insert structure name here>/<week
> number>/<timestamp>.datafile"
>
> My plan is that I could read the stored entries using MapReduce,
> deserialize them into java objects (the map-reducer would need to have
> the thrift compiled structures available) and use the structures
> directly in Map operations. (How) can serde help me with this part?
> Should I modify my plans so that I could use HIVE directly in the
> future? How Hive stores the thrift serialized log data into HDFS?
>
>  - Juho Mäkinen
>
> On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher
> <je...@gmail.com> wrote:
>> Hey Juho,
>>
>> You should check out Hive
>> (https://issues.apache.org/jira/browse/HADOOP-3601), which was just
>> committed to the Hadoop trunk today. It's what we use at Facebook to
>> query our collection of Thrift-serialized logfiles. Inside of the Hive
>> code, you'll find a pure-Java (using JavaCC) parser for
>> Thrift-serialized data structures.
>>
>> Regards,
>> Jeff
>>
>> On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>>> On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
>>>> What's the current status of Thrift with Hadoop? Is there any
>>>> documentation online or even some code in the SVN which I could look
>>>> into?
>>>
>>> I think you have two choices: 1) wrap your Thrift code in a class that
>>> implements Writable, or 2) use Thrift to serialize your data to byte
>>> arrays and store them as BytesWritable.
>>> -Stuart
>>>
>>
>

Re: Reading and writing Thrift data from MapReduce

Posted by Pete Wyckoff <pw...@facebook.com>.
Hi Juho,

Excellent - yes, we do something very similar using Thrift, Scribe (our soon
to be open sourced logging framework) and SerDes.

The SerDe is a uniform interface to data for serialization and
deserialization. If you look at the interface, it provides 4 methods -
serialize, deserialize and 2 methods to query the type information.  It
supports Thrift, Jute,  and control delimited data and can be easily
extended to support things like ProtocolBuffers.

So, Hive doesn't care about the actual data format in HDFS although its
default native serialization for Text data is control separated.

For thrift, you could use ThriftSerDe passing in a Properties object to its
initializer that includes the name of the thrift class it is representing.
In Hive, we use the Hive Metastore to store this information so the runtime
just passes the name of the "table" to the  MS which returns the information
needed to instantiate the SerDe.

We are interested, however, in integrating with
https://issues.apache.org/jira/browse/HADOOP-3787, but we haven't looked at
this much yet.

Thanks, pete


On 9/3/08 12:32 AM, "Juho Mäkinen" <ju...@gmail.com> wrote:

> Thanks Jeff. I believe that you mean the serde module inside hadoop
> (hadoop-core-trunk\src\contrib\hive\serde)?
> I'm currently looking into it, but it seems to lack a lot of useful
> documentation so it'll take me some time to figure it out (all
> additional info is appreciated).
> 
> I've already put some effort into this and designed a partial
> sollution for my log analysis which so far seems ok to me. As I don't
> know the details of serde yet, I'm not sure if this is the way I
> should go, or should I change my implementation and plans so that I
> could use serve (if it makes my job easier). I'm not yet interested in
> HIVE, but I'd like to keep the option open in the future, so that I
> could easily run hive on my datas (so that I would not need to
> transform my datas to hive if I choose to use it in the future).
> 
> Currently I've come up with the following design:
> 1) Each log event type has it's own thrift structure. The structure is
> compiled into php code. The log entry creators creates and populates
> the structure php object with data and sends it to be stored
> 2) Log sender object receiveres this object ($tbase) and serializes it
> using TBinaryTransport, adds the structure name to the beginning and
> sends the byte array to loc receiver using UDP. The following code
> does this:
> 
> $this->transport = new TResetableMemoryBuffer(); // a TMemoryBuffer
> with a reset() method
> $this->protocol = new TBinaryProtocol($this->transport);
> $this->transport->open();
> 
> $this->transport->reset(); // Reset the memory buffer array
> $this->protocol->writeByte(1); // version 1: we have the TBase name in string
> $this->protocol->writeString($tbase->getName()); // Name of the structure
> $tbase->write($this->protocol); // Serialize our thrift structure to
> the memory buffer
> 
> $this->sendBytes($this->transport->getBuffer());
> 
> 3) Log receiver reads the structure name and stores the byte array
> (without the version byte and structure name) into HDFS file
> "/events/<insert structure name here>/<week
> number>/<timestamp>.datafile"
> 
> My plan is that I could read the stored entries using MapReduce,
> deserialize them into java objects (the map-reducer would need to have
> the thrift compiled structures available) and use the structures
> directly in Map operations. (How) can serde help me with this part?
> Should I modify my plans so that I could use HIVE directly in the
> future? How Hive stores the thrift serialized log data into HDFS?
> 
>  - Juho Mäkinen
> 
> On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher
> <je...@gmail.com> wrote:
>> Hey Juho,
>> 
>> You should check out Hive
>> (https://issues.apache.org/jira/browse/HADOOP-3601), which was just
>> committed to the Hadoop trunk today. It's what we use at Facebook to
>> query our collection of Thrift-serialized logfiles. Inside of the Hive
>> code, you'll find a pure-Java (using JavaCC) parser for
>> Thrift-serialized data structures.
>> 
>> Regards,
>> Jeff
>> 
>> On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>>> On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
>>>> What's the current status of Thrift with Hadoop? Is there any
>>>> documentation online or even some code in the SVN which I could look
>>>> into?
>>> 
>>> I think you have two choices: 1) wrap your Thrift code in a class that
>>> implements Writable, or 2) use Thrift to serialize your data to byte
>>> arrays and store them as BytesWritable.
>>> -Stuart
>>> 
>> 


Re: Reading and writing Thrift data from MapReduce

Posted by Juho Mäkinen <ju...@gmail.com>.
Thanks Jeff. I believe that you mean the serde module inside hadoop
(hadoop-core-trunk\src\contrib\hive\serde)?
I'm currently looking into it, but it seems to lack a lot of useful
documentation so it'll take me some time to figure it out (all
additional info is appreciated).

I've already put some effort into this and designed a partial
sollution for my log analysis which so far seems ok to me. As I don't
know the details of serde yet, I'm not sure if this is the way I
should go, or should I change my implementation and plans so that I
could use serve (if it makes my job easier). I'm not yet interested in
HIVE, but I'd like to keep the option open in the future, so that I
could easily run hive on my datas (so that I would not need to
transform my datas to hive if I choose to use it in the future).

Currently I've come up with the following design:
1) Each log event type has it's own thrift structure. The structure is
compiled into php code. The log entry creators creates and populates
the structure php object with data and sends it to be stored
2) Log sender object receiveres this object ($tbase) and serializes it
using TBinaryTransport, adds the structure name to the beginning and
sends the byte array to loc receiver using UDP. The following code
does this:

$this->transport = new TResetableMemoryBuffer(); // a TMemoryBuffer
with a reset() method
$this->protocol = new TBinaryProtocol($this->transport);
$this->transport->open();

$this->transport->reset(); // Reset the memory buffer array
$this->protocol->writeByte(1); // version 1: we have the TBase name in string
$this->protocol->writeString($tbase->getName()); // Name of the structure
$tbase->write($this->protocol); // Serialize our thrift structure to
the memory buffer
		
$this->sendBytes($this->transport->getBuffer());

3) Log receiver reads the structure name and stores the byte array
(without the version byte and structure name) into HDFS file
"/events/<insert structure name here>/<week
number>/<timestamp>.datafile"

My plan is that I could read the stored entries using MapReduce,
deserialize them into java objects (the map-reducer would need to have
the thrift compiled structures available) and use the structures
directly in Map operations. (How) can serde help me with this part?
Should I modify my plans so that I could use HIVE directly in the
future? How Hive stores the thrift serialized log data into HDFS?

 - Juho Mäkinen

On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher
<je...@gmail.com> wrote:
> Hey Juho,
>
> You should check out Hive
> (https://issues.apache.org/jira/browse/HADOOP-3601), which was just
> committed to the Hadoop trunk today. It's what we use at Facebook to
> query our collection of Thrift-serialized logfiles. Inside of the Hive
> code, you'll find a pure-Java (using JavaCC) parser for
> Thrift-serialized data structures.
>
> Regards,
> Jeff
>
> On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
>> On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
>>> What's the current status of Thrift with Hadoop? Is there any
>>> documentation online or even some code in the SVN which I could look
>>> into?
>>
>> I think you have two choices: 1) wrap your Thrift code in a class that
>> implements Writable, or 2) use Thrift to serialize your data to byte
>> arrays and store them as BytesWritable.
>> -Stuart
>>
>

Re: Reading and writing Thrift data from MapReduce

Posted by Jeff Hammerbacher <je...@gmail.com>.
Hey Juho,

You should check out Hive
(https://issues.apache.org/jira/browse/HADOOP-3601), which was just
committed to the Hadoop trunk today. It's what we use at Facebook to
query our collection of Thrift-serialized logfiles. Inside of the Hive
code, you'll find a pure-Java (using JavaCC) parser for
Thrift-serialized data structures.

Regards,
Jeff

On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra <ma...@stuartsierra.com> wrote:
> On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
>> What's the current status of Thrift with Hadoop? Is there any
>> documentation online or even some code in the SVN which I could look
>> into?
>
> I think you have two choices: 1) wrap your Thrift code in a class that
> implements Writable, or 2) use Thrift to serialize your data to byte
> arrays and store them as BytesWritable.
> -Stuart
>

Re: Reading and writing Thrift data from MapReduce

Posted by Stuart Sierra <ma...@stuartsierra.com>.
On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <ju...@gmail.com> wrote:
> What's the current status of Thrift with Hadoop? Is there any
> documentation online or even some code in the SVN which I could look
> into?

I think you have two choices: 1) wrap your Thrift code in a class that
implements Writable, or 2) use Thrift to serialize your data to byte
arrays and store them as BytesWritable.
-Stuart