You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jay Vyas <ja...@gmail.com> on 2012/09/25 19:27:05 UTC

Python + hdfs written thrift sequence files: lots of moving parts!

Hi guys!

Im trying to read some hadoop outputted thrift files in plain old java
(without using SequenceFile.Reader).  The reason for this is that I

(1) want to understand the sequence file format better and
(2) would like to be able to port this code to a language which doesnt have
robust hadoop sequence file i/o / thrift support  (python). My code looks
like this:

So, before reading forward, if anyone has :

1) Some general hints on how to create a Sequence file with thrift encoded
key values in python would be very useful.
2) Some tips on the generic approach for reading a sequencefile (the
comments seem to be a bit underspecified in the SequenceFile header)

I'd appreciate it!

Now, here is my adventure into thrift/hdfs sequence file i/o :

I've written a simple stub which , I think, should be the start of a
sequence file reader (just tries to skip the header and get straight to the
data).

But it doesnt handle compression.

http://pastebin.com/vyfgjML9

So, this code ^^ appears to fail with cryptic errors : "don't know what
type: 15".

This error comes from a case statement, which attempts to determine what
type of thrift record is being read in:
"fail 127 don't know what type: 15"

  private byte getTType(byte type) throws TProtocolException {
    switch ((byte)(type & 0x0f)) {
      case TType.STOP:
        return TType.STOP;
      case Types.BOOLEAN_FALSE:
      case Types.BOOLEAN_TRUE:
        return TType.BOOL;
     ........
     case Types.STRUCT:
        return TType.STRUCT;
      default:
        throw new TProtocolException("don't know what type: " +
(byte)(type & 0x0f));
    }

Upon further investigation, I have found that, in fact, the Configuration
object is (of course) heavily utilized by the SequenceFile reader, in
particular, to
determine the Codec.  That corroborates my hypothesis that the data needs
to be decompressed or decoded before it can be deserialized by thrift, I
believe.

So... I guess what Im assuming is missing here, is that I don't know how to
manually reproduce the Codec/GZip, etc.. logic inside of
SequenceFile.Reader in plain old java (i.e without cheating and using the
SequenceFile.Reader class that is configured in our mapreduce soruce
code).

With my end goal being to read the file in python, I think it would be nice
to be able to read the sequencefile in java, and use this as a template
(since I know that my thrift objects and serialization are working
correctly in my current java source codebase, when read in from
SequenceFile.Reader api).

Any suggestions on how I can distill the logic of the SequenceFile.Reader
class into a simplified version which is specific to my data, so that I can
start porting into a python script which is capable of scanning a few real
sequencefiles off of HDFS would be much appreciated !!!

In general... what are the core steps for doing i/o with sequence files
that are compressed and or serialized in different formats?  Do we
decompress first , and then deserialize?  Or do them both at the same time
?  Thanks!

PS I've added an issue to github here
https://github.com/matteobertozzi/Hadoop/issues/5, for a python
SequenceFile reader.  If I get some helpful hints on this thread maybe I
can directly implement an example on matteobertozzi's python hadoop trunk.

-- 
Jay Vyas
MMSB/UCHC

Re: Python + hdfs written thrift sequence files: lots of moving parts!

Posted by Jay Vyas <ja...@gmail.com>.
Thanks harsh: In any case, I'm really curious about how it is that sequence
file headers are formatted, as the documentation in the SequenceFile
javadocs seems to be very generic.

To make my questions more concrete:

1) I notice that the FileSplit class has a getStart() function.  It is
documented as returning the place to start "processing".  Does that imply
that a FileSplit does, or does not include a header?

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html#getStart%28%29

2) Also, Its not clear to me that how compression and serialization are
related.  These are two inticrately coupled aspects of HDFS file writing,
and im not sure what the idiom for coordinating the compression of records
to  the deserialization is.

Re: Python + hdfs written thrift sequence files: lots of moving parts!

Posted by Harsh J <ha...@cloudera.com>.
Hi Jay,

This may be off-topic to you, but I feel its related: Use Avro
DataFiles. There's Python support already available, as well as
several other languages.

On Tue, Sep 25, 2012 at 10:57 PM, Jay Vyas <ja...@gmail.com> wrote:
> Hi guys!
>
> Im trying to read some hadoop outputted thrift files in plain old java
> (without using SequenceFile.Reader).  The reason for this is that I
>
> (1) want to understand the sequence file format better and
> (2) would like to be able to port this code to a language which doesnt have
> robust hadoop sequence file i/o / thrift support  (python). My code looks
> like this:
>
> So, before reading forward, if anyone has :
>
> 1) Some general hints on how to create a Sequence file with thrift encoded
> key values in python would be very useful.
> 2) Some tips on the generic approach for reading a sequencefile (the
> comments seem to be a bit underspecified in the SequenceFile header)
>
> I'd appreciate it!
>
> Now, here is my adventure into thrift/hdfs sequence file i/o :
>
> I've written a simple stub which , I think, should be the start of a
> sequence file reader (just tries to skip the header and get straight to the
> data).
>
> But it doesnt handle compression.
>
> http://pastebin.com/vyfgjML9
>
> So, this code ^^ appears to fail with cryptic errors : "don't know what
> type: 15".
>
> This error comes from a case statement, which attempts to determine what
> type of thrift record is being read in:
> "fail 127 don't know what type: 15"
>
>   private byte getTType(byte type) throws TProtocolException {
>     switch ((byte)(type & 0x0f)) {
>       case TType.STOP:
>         return TType.STOP;
>       case Types.BOOLEAN_FALSE:
>       case Types.BOOLEAN_TRUE:
>         return TType.BOOL;
>      ........
>      case Types.STRUCT:
>         return TType.STRUCT;
>       default:
>         throw new TProtocolException("don't know what type: " +
> (byte)(type & 0x0f));
>     }
>
> Upon further investigation, I have found that, in fact, the Configuration
> object is (of course) heavily utilized by the SequenceFile reader, in
> particular, to
> determine the Codec.  That corroborates my hypothesis that the data needs
> to be decompressed or decoded before it can be deserialized by thrift, I
> believe.
>
> So... I guess what Im assuming is missing here, is that I don't know how to
> manually reproduce the Codec/GZip, etc.. logic inside of
> SequenceFile.Reader in plain old java (i.e without cheating and using the
> SequenceFile.Reader class that is configured in our mapreduce soruce
> code).
>
> With my end goal being to read the file in python, I think it would be nice
> to be able to read the sequencefile in java, and use this as a template
> (since I know that my thrift objects and serialization are working
> correctly in my current java source codebase, when read in from
> SequenceFile.Reader api).
>
> Any suggestions on how I can distill the logic of the SequenceFile.Reader
> class into a simplified version which is specific to my data, so that I can
> start porting into a python script which is capable of scanning a few real
> sequencefiles off of HDFS would be much appreciated !!!
>
> In general... what are the core steps for doing i/o with sequence files
> that are compressed and or serialized in different formats?  Do we
> decompress first , and then deserialize?  Or do them both at the same time
> ?  Thanks!
>
> PS I've added an issue to github here
> https://github.com/matteobertozzi/Hadoop/issues/5, for a python
> SequenceFile reader.  If I get some helpful hints on this thread maybe I
> can directly implement an example on matteobertozzi's python hadoop trunk.
>
> --
> Jay Vyas
> MMSB/UCHC



-- 
Harsh J