You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Steve Corona (JIRA)" <ji...@apache.org> on 2009/03/07 23:52:56 UTC

[jira] Created: (HIVE-333) Add TFileTransport deserializer

Add TFileTransport deserializer
-------------------------------

                 Key: HIVE-333
                 URL: https://issues.apache.org/jira/browse/HIVE-333
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Serializers/Deserializers
         Environment: Linux
            Reporter: Steve Corona


I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...

Reply from Joydeep Sen Sarma (jssarma@facebook.com)

Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).

No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-333) Add TFileTransport deserializer

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joydeep Sen Sarma reassigned HIVE-333:
--------------------------------------

    Assignee: Joydeep Sen Sarma

> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>
> I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-333) Add TFileTransport deserializer

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joydeep Sen Sarma updated HIVE-333:
-----------------------------------

    Attachment: libthrift_asf.jar
                hive-333.patch.1

> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-333) Add TFileTransport deserializer

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joydeep Sen Sarma updated HIVE-333:
-----------------------------------

    Attachment: hive-333.patch.2

> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, hive-333.patch.2, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HIVE-333) Add TFileTransport deserializer

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698174#action_12698174 ] 

Joydeep Sen Sarma edited comment on HIVE-333 at 4/12/09 12:00 AM:
------------------------------------------------------------------

this turned out to be way more complicated than i had thought. Here's the rundown:

- thrift-377 - i have attached the tfiletransport java ports in it. more on this later

- hive-333 - contains a new contrib/thrift module that has:
  * lib/libthrift_asf.jar - this contains a thrift jar created from thrift trunk + thrift-377 (so includes tfiletransport)
     I had to enter a new libthrift into hive because the current one uses com.facebook namespace that is not compatible with thrift trunk. All of contrib/thrift uses the latest thrift trunk version.

     note that contrib/thrift/lib/libthrift_asf.jar is submitted as a separate attachment from the patch

  * provides a trivial re-write of the existing thrift serde in Hive (new one called org.apache.hadoop.hive.serde.asfthrift.ThriftBytesWritabledeserializer) that uses the thrift trunk library (instead of the old one). this is required to read thrift objects embedded inside ByteWritables in hive.

  * contrib/thrift also has a TFileTransportInputFormat and TFileTransportRecordReader - this allows processing of TFileTransport files as inputs to hadoop map-reduce. it will split files so that the splits are aligned with tfiletransport chunk boundaries.

  * it also has an example map-reduce program (TConverter/TMapper) - that shows how to convert a TFileTransport into a SequenceFile with thrift objects embedded inside BytesWritable objects. This example does not do any reduction - but you can extend this example to hash/reduce  on specific key (which is what we do at Facebook). Also output compression can be controlled by command line options (extends Tool - more on usage later).

  * aside from libthrift_asf.jar - the rest of the stuff is produced as a single jar file by contrib/thrift (see build/contrib-thrift/hive_contrib-thrift.jar - should be produced by ant jar or ant package).

ie. the current work done so far allows conversion of files in TFileTransport format into SequenceFile +BytesWritable formats (and also provides the serde to read these files) that are Hive friendly. example run of TConverter:

hadoop jar -libjars contrib/thrift/lib/libthrift_asf.jar,build/ql/hive_exec.jar build/contrib-thrift/hive_contrib-thrift.jar org.apache.hadoop.hive.thrift.TConverter -Dthrift.filetransport.classname=org.apache.hadoop.thrift.TestClass -inputpath /tmp/tfiletransportfile -output /tmp/sequencefile

// more options (including those to get compressed sequencefiles) can simply be added using more -Dkey=value options.
// u will need to add the jar file for TestClass in this example also to the libjars switch

Once the files are converted - it's trivial to create a Hive table with the right properties so that these files can be queries. a few points about hive integration:
- need to ask Prasad about the exact cli statements to create these tables - will post instructions once i have them.
- the jar file hive_contrib-thrift.jar and libthrift_asf.jar will need to be in hive execution environment. This can be arranged by copying them into auxlib/ under the hive distribution directory. i haven't integrated this into ant yet.
- also jar files for the classes that are serialized into the sequencefile and need to be queries by hive need to be deposited into auxlibs/ as well.

Two more options exist:
- convert thrift files into text using TConverter type programs
- alternatively we can arrange Hive to query TFileTransport directly. It's not that hard (since the inputformat is now done) - but it needs some more work and testing and more new code. 

CAVEAT regarding thrift-377 - i am finding a few (1-5) empty spurious records at the beginning of each tfiletransport chunk when trying to read tfiletransport files produced in c++ land from java land (and only when seeking to split boundaries). I just don't have the time to debug anymore. the simple workaround is to <b> disable splitting of tfiletransport files by setting mapred.min.split.size to an infinite value</b>. if the files are not spit - there's no problem.

I am hoping you can take things from here. if we really really need hive to query tfiletransport directly - it's probably another couple of hours worth of work - but i will wait for ur input and see if this is required (seems to me that SequenceFiles are a better long term data container in hadoop since they allow compression).

      was (Author: jsensarma):
    this turned out to be way more complicated than i had thought. Here's the rundown:

- thrift-377 - i have attached the tfiletransport java ports in it. more on this later

- hive-333 - contains a new contrib/thrift module that has:
  * lib/libthrift_asf.jar - this contains a thrift jar created from thrift trunk + thrift-377 (so includes tfiletransport)
     I had to enter a new libthrift into hive because the current one uses com.facebook namespace that is not compatible with thrift trunk. All of contrib/thrift uses the latest thrift trunk version.

     note that contrib/thrift/lib/libthrift_asf.jar is submitted as a separate attachment from the patch

  * provides a trivial re-write of the existing thrift serde in Hive (new one called org.apache.hadoop.hive.serde.asfthrift.ThriftBytesWritabledeserializer) that uses the thrift trunk library (instead of the old one). this is required to read thrift objects embedded inside ByteWritables in hive.

  * contrib/thrift also has a TFileTransportInputFormat and TFileTransportRecordReader - this allows processing of TFileTransport files as inputs to hadoop map-reduce. it will split files so that the splits are aligned with tfiletransport chunk boundaries.

  * it also has an example map-reduce program (TConverter/TMapper) - that shows how to convert a TFileTransport into a SequenceFile with thrift objects embedded inside BytesWritable objects. This example does not do any reduction - but you can extend this example to hash/reduce  on specific key (which is what we do at Facebook). Also output compression can be controlled by command line options (extends Tool - more on usage later).

  * aside from libthrift_asf.jar - the rest of the stuff is produced as a single jar file by contrib/thrift (see build/contrib-thrift/hive_contrib-thrift.jar - should be produced by ant jar or ant package).

ie. the current work done so far allows conversion of files in TFileTransport format into SequenceFile +BytesWritable formats (and also provides the serde to read these files) that are Hive friendly. example run of TConverter:

hadoop jar -libjars contrib/thrift/lib/libthrift_asf.jar,build/ql/hive_exec.jar build/contrib-thrift/hive_contrib-thrift.jar org.apache.hadoop.hive.thrift.TConverter -Dthrift.filetransport.classname=org.apache.hadoop.thrift.TestClass -inputpath /tmp/tfiletransportfile -output /tmp/sequencefile

// more options (including those to get compressed sequencefiles) can simply be added using more -Dkey=value options.

Once the files are converted - it's trivial to create a Hive table with the right properties so that these files can be queries. a few points about hive integration:
- need to ask Prasad about the exact cli statements to create these tables - will post instructions once i have them.
- the jar file hive_contrib-thrift.jar and libthrift_asf.jar will need to be in hive execution environment. This can be arranged by copying them into auxlib/ under the hive distribution directory. i haven't integrated this into ant yet.

Two more options exist:
- convert thrift files into text using TConverter type programs
- alternatively we can arrange Hive to query TFileTransport directly. It's not that hard (since the inputformat is not done) - but it needs some more work and testing and more new code. 

BIG CAVEAT regarding thrift-377 - i am finding a few (1-5) empty spurious records at the beginning of each tfiletransport chunk when trying to read tfiletransport files produced in c++ land from java land (and only when seeking to split boundaries). I just don't have the time to debug anymore. the simple workaround is to disable splitting of tfiletransport files by setting mapred.min.split.size to an infinite value. if the files are not spit - there's no problem.

I am hoping you can take things from here. if we really really need hive to query tfiletransport directly - it's probably another couple of hours worth of work - but i will wait for ur input and see if this is required (seems to me that SequenceFiles are a better long term data container in hadoop since they allow compression).
  
> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-333) Add TFileTransport deserializer

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857143#action_12857143 ] 

Joydeep Sen Sarma commented on HIVE-333:
----------------------------------------

i think a lot of stuff has changed in the hive code base since this patch was posted. for sure - i think hive now uses the ASF namespace of thrift (org.apache.thrift) - which was a big part of this patch (i think i bundled in a separate jar based on the asf distribution). 

the other question is how the input thrift files are generated. previously the request was for reading 'tfiletransport' formatted files. there's a lot of code in the patch - as well as dependency on an uncommitted thrift patch for this reason. 

however - tfiletransport (despite it's early use in Facebook) is not used widely. it suffers from numerous performance problems (single threaded performance sucks)  - as well as it bloats the data. it has a useful property that the data is chunked - but my understanding is that TFramedTransport and its ilk also may have similar properties.

so i think the first step may be to identify the starting container for Thrift files - since the integration into Hive depends a lot on that.

> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, hive-333.patch.2, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HIVE-333) Add TFileTransport deserializer

Posted by "Matt Hackett (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856956#action_12856956 ] 

Matt Hackett commented on HIVE-333:
-----------------------------------

II am curious about the status of this feature request -- it looks like it did not make it into the codebase, though to me and others I imagine it would be extremely useful. 

The ability to move Thrift object stores in TFileTransport format into more Hive/Hadoop-friendly SequenceFiles would seem to complete the loop for a common use case: namely, logging data to the ThriftFile store in Scribe. From what I gather, this is what is also done internally at Facebook.

Apologies in advance if this has already been superseded by other changes or discussions.

> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, hive-333.patch.2, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (HIVE-333) Add TFileTransport deserializer

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698174#action_12698174 ] 

Joydeep Sen Sarma commented on HIVE-333:
----------------------------------------

this turned out to be way more complicated than i had thought. Here's the rundown:

- thrift-377 - i have attached the tfiletransport java ports in it. more on this later

- hive-333 - contains a new contrib/thrift module that has:
  * lib/libthrift_asf.jar - this contains a thrift jar created from thrift trunk + thrift-377 (so includes tfiletransport)
     I had to enter a new libthrift into hive because the current one uses com.facebook namespace that is not compatible with thrift trunk. All of contrib/thrift uses the latest thrift trunk version.

     note that contrib/thrift/lib/libthrift_asf.jar is submitted as a separate attachment from the patch

  * provides a trivial re-write of the existing thrift serde in Hive (new one called org.apache.hadoop.hive.serde.asfthrift.ThriftBytesWritabledeserializer) that uses the thrift trunk library (instead of the old one). this is required to read thrift objects embedded inside ByteWritables in hive.

  * contrib/thrift also has a TFileTransportInputFormat and TFileTransportRecordReader - this allows processing of TFileTransport files as inputs to hadoop map-reduce. it will split files so that the splits are aligned with tfiletransport chunk boundaries.

  * it also has an example map-reduce program (TConverter/TMapper) - that shows how to convert a TFileTransport into a SequenceFile with thrift objects embedded inside BytesWritable objects. This example does not do any reduction - but you can extend this example to hash/reduce  on specific key (which is what we do at Facebook). Also output compression can be controlled by command line options (extends Tool - more on usage later).

  * aside from libthrift_asf.jar - the rest of the stuff is produced as a single jar file by contrib/thrift (see build/contrib-thrift/hive_contrib-thrift.jar - should be produced by ant jar or ant package).

ie. the current work done so far allows conversion of files in TFileTransport format into SequenceFile +BytesWritable formats (and also provides the serde to read these files) that are Hive friendly. example run of TConverter:

hadoop jar -libjars contrib/thrift/lib/libthrift_asf.jar,build/ql/hive_exec.jar build/contrib-thrift/hive_contrib-thrift.jar org.apache.hadoop.hive.thrift.TConverter -Dthrift.filetransport.classname=org.apache.hadoop.thrift.TestClass -inputpath /tmp/tfiletransportfile -output /tmp/sequencefile

// more options (including those to get compressed sequencefiles) can simply be added using more -Dkey=value options.

Once the files are converted - it's trivial to create a Hive table with the right properties so that these files can be queries. a few points about hive integration:
- need to ask Prasad about the exact cli statements to create these tables - will post instructions once i have them.
- the jar file hive_contrib-thrift.jar and libthrift_asf.jar will need to be in hive execution environment. This can be arranged by copying them into auxlib/ under the hive distribution directory. i haven't integrated this into ant yet.

Two more options exist:
- convert thrift files into text using TConverter type programs
- alternatively we can arrange Hive to query TFileTransport directly. It's not that hard (since the inputformat is not done) - but it needs some more work and testing and more new code. 

BIG CAVEAT regarding thrift-377 - i am finding a few (1-5) empty spurious records at the beginning of each tfiletransport chunk when trying to read tfiletransport files produced in c++ land from java land (and only when seeking to split boundaries). I just don't have the time to debug anymore. the simple workaround is to disable splitting of tfiletransport files by setting mapred.min.split.size to an infinite value. if the files are not spit - there's no problem.

I am hoping you can take things from here. if we really really need hive to query tfiletransport directly - it's probably another couple of hours worth of work - but i will wait for ur input and see if this is required (seems to me that SequenceFiles are a better long term data container in hadoop since they allow compression).

> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive  in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type)  in a TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma (jssarma@facebook.com)
> Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.