You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2008/09/04 00:55:44 UTC

[jira] Created: (HADOOP-4065) support for reading binary data from flat files

support for reading binary data from flat files
-----------------------------------------------

                 Key: HADOOP-4065
                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
             Project: Hadoop Core
          Issue Type: Bug
          Components: mapred
            Reporter: Joydeep Sen Sarma


like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).

it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.

tricky aspects are:
- how to know what class the file contains (has to be in a configuration somewhere).
- how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629305#action_12629305 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

Nice. Will the TFileRecordReader fit into this paradigm?  How were you going to implement the TFileRecordReader?

TFile is very similar to TRecordStream - but more full featured - sorted and seek by key. But, TRecordStream is meant to be readable/writable in many languages (first cut c++, java, python and perl). It's primary use case is non-hadoop - just a robust way of logging data, but secondarily, there's no reason not to enable directly reading/writing to one from Hadoop as it's a waste to open one, read it and write it out as a TFile or SF if one doesn't need the richer functionality that sequence file and tfile support.



-- pete



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629659#action_12629659 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

I will submit a patch with its own RecordReader so we don't need to change SequenceFileRR. I


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Description: 
Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation is specific in the JobConf and that once instantiated, that Serialization Implementation can  figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies RecordIOSerialization and then the specific class is LogRecordObject. 

Another way one might to do this is to use the SerializationFactory to do the lookup of the Serialization Implementation; however, this requires all Serialization Implementations to be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader. (see below re: adding Serialization implementations to contrib).

To ensure it is generic, I propose implementing the following Serialization implementations:

1. RecordIOSerialization
2. LineReaderSerialization
3. ThriftSerialization

The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 



  was:
Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation is specific in the JobConf and that once instantiated, that Serialization Implementation can  figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies RecordIOSerialization and then the specific class is LogRecordObject. 

Another way one might to do this is to use the SerializationFactory to do the lookup of the Serialization Implementation; however, this requires all Deserializers to be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader.

To ensure it is generic, I propose implementing the following Serialization implementations:

1. RecordIOSerialization
2. LineReaderSerialization
3. ThriftSerialization

The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 




> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation is specific in the JobConf and that once instantiated, that Serialization Implementation can  figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies RecordIOSerialization and then the specific class is LogRecordObject. 
> Another way one might to do this is to use the SerializationFactory to do the lookup of the Serialization Implementation; however, this requires all Serialization Implementations to be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader. (see below re: adding Serialization implementations to contrib).
> To ensure it is generic, I propose implementing the following Serialization implementations:
> 1. RecordIOSerialization
> 2. LineReaderSerialization
> 3. ThriftSerialization
> The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628192#action_12628192 ] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

see https://issues.apache.org/jira/browse/HADOOP-4065 as well.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Attachment: HADOOP-4065.0.txt

This is what the proposal would look like. It:

1. adds the TypedSplittableFile interface
2. changes SequenceFile.Reader to implement TypedSplittableFile and adds an initialize method and an empty constructor
3. implements TypedSplittableFileRecordReader - just copied the code from SequenceFileRecordReader and changed the constructor only
4. change SequenceFileRecordReader to extend TypedSplittableFileRecordReader and have its constructor just construct the parent class.

Still a work in progress, but this kind of shows the API and the changes to SequenceFileRecordReader.



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Attachment: HADOOP-4065.2.txt

namespaced into hive and also brought back the RowContainer so it will work without HADOOP-1230 being fixed.  

The testcase uses JavaSerialization, WritableSerialization (Record) and ThriftSerialization.


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632752#action_12632752 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

bq. I'm still not convinced about the utility of this class outside of Hive. What is the advantage of storing the data this way?

1. You don't need a loader.  
2. Tools outside of hadoop can use the data - python, perl, c++, ...
3. There are other file formats that are splittable and self or non self-describing. Hadoop is generally pretty pluggable, but not at the file level. Would be nice to have generic file interfaces that one can implement to get *First Class* hadoop treatment for any file format.
 
To be clear, Hive writes and reads binary data to sequence files only now. We load all binary data into sequence files.

bq. i really don't care - we can put this into Hive. 

-1

This is a general FlatFileRecordReader - HADOOP-3566 seems to be a non-general version of this? (with the issue of that being <String, Void>)

And note my intention is to put this in contrib/serialization






> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631913#action_12631913 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

proposal (which does not touch any existing code):

1. extend Deserializer with an interface that requires returning the actual type being deserialized
{code:title=ParameterizedDeserializer.java}
public interface ParameterizedDeserializer<T> extends Deserializer<T> {
  Class<? extends T> getRealClass() ;
}
{code}
2. create a Serialization implementation which gets the Serializer/Deserializer from the JobConf - e.g.,
{code}
   public ParameterizedDeserializer<R> getDeserializer(Class<R> c) {                                                                                                                                        
     // ignore c. doesn't matter, it is coming from the configuration                                                                                                                                        
      Class<? extends ParameterizedDeserializer> t = conf.getClass("mapred.input.io.deserializer", null, ParameterizedDeserializer.class);                                                                      
      return ReflectionUtils.newInstance(t, conf);                                                                                                                                                           
    } 
{code}
3. Parameterized deserializers will (typically) get the specific class (? extends T - e.g., T= Record) they are implementing from the JobConf. e.g.,
{code}
    this.recordClass = conf.getClass("mapred.input.io.record_class", null, Record.class);                                                                                                                   
{code}

4. RecordReader.getValueClass looks like:
{code}
  public R createValue() {                                                                                                                                                                     
    return (R)ReflectionUtils.newInstance(deserializer.getRealClass(),conf);                                                                                                                                              
  }                    
{code}

5. Setting up the JobConf:
{code}
      job.setClass("mapred.input.io.deserializer", .serializer.RecordIOSerialization.RecordIODeserializer.class, serializer.ParameterizedDeserializer.class);          
                                                                                                                                                                                                             
      // Set this so the RecordIO Deserializer knows the specific Record class in the file                                                                                                                   
      job.setClass("mapred.input.io.recordio_class", FlatFileDeserializerTestObj.class, record.Record.class);             
{code}

These classes could all go into a contrib directory devoted to contributing Serialization implementations.  Or it could be in core and mapred.

Issues - why not get the  deserialization specific class in Serialization.getDeserializer (and implement getRealClass in this Serialization implementation - no need for any new interface) and pass that in to any normal Deserializer?  This would make things more uniform and also mean the Deserializer is any old deserializer.
{code}
Class<? extends R> realClass = conf.getClass("mapred.input.io.deserializer.class", null, Class<R>);                                                                                                                   
return ReflectionUtils.newInstance(conf.getClass("mapred.input.deserializer",null,Deserializer.class));
{code}
And then we can construct an instance of realClass and pass that into any Deserializer.deserialize(T t).

I am just learning generics and it doesn't seem conf.getClass("mapred.input.io.deserializer.class", null, Class<R>);  is possible because it doesn't accept Class<R>. It wants something more like Record.class, forcing us into doing this in the deserizlier.



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628679#action_12628679 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

I think in Joy's case, it would be the filename and/or with some configuration info in the jobconf.  In the TRecordStream case, we would need to use some code from TFixedFrameTransport to read the frame headers - TFixedFrameTransport is splittable. (TRS is a thin layer on top of TFFT).

Joy or I can post a proposed API.

good point, we should make it general and not just for binary.

pete



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642527#action_12642527 ] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

+1. 

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Joydeep Sen Sarma
>            Assignee: Pete Wyckoff
>             Fix For: 0.20.0
>
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631187#action_12631187 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------


Should also mention we could put this in contrib as it is self contained.


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Status: Open  (was: Patch Available)

hudson -1 contrib i think this wasn't due to this patch.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634030#action_12634030 ] 

Hadoop QA commented on HADOOP-4065:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12390798/HADOOP-4065.2.txt
  against trunk revision 698385.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 core tests.  The patch passed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3358/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3358/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3358/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3358/console

This message is automatically generated.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633822#action_12633822 ] 

Doug Cutting commented on HADOOP-4065:
--------------------------------------

> my intention was to put this in contrib/serialization, but if there is objection, i can change the patch to contrib/hive.

+1 I'd rather not have contrib/serialization just become a grab-bag of io-related stuff.  If this is needed by Hive only, then it belongs in contrib/hive.  If we decide (subsequently, perhaps) that it has wide utility as a generic API for access to files in a variety of formats for a variety of applications, then perhaps it could be moved to mapred.  But that doesn't yet sound like the consensus, so contrib/hive is probably best for now.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629298#action_12629298 ] 

Mahadev konar commented on HADOOP-4065:
---------------------------------------

we at yahoo have been working on similar kind of files where data is just stored as binary data and is splittable. 

http://issues.apache.org/jira/browse/HADOOP-3315

the  spec is old and needs to be updated. TFile is meant to be a sequence file replacement.


  A TFile is a container of key-value pairs. Both keys and values are type-less
  byte arrays. Keys can be up to 64KB, value length is not restricted. TFile
  further provides the following features:
- Block Compression.
- Named meta data blocks.
- Sorted or unsorted keys.
- Seek by key or by file offset.

We will update the specs on HADOOP-3315 by the end of this week. 



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Status: Patch Available  (was: Open)

re-submitting for hudson.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

      Component/s:     (was: contrib/serialization)
                       (was: mapred)
                   contrib/hive
    Fix Version/s: 0.20.0
         Assignee: Pete Wyckoff

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Joydeep Sen Sarma
>            Assignee: Pete Wyckoff
>             Fix For: 0.20.0
>
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Status: Patch Available  (was: Open)

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Component/s: contrib/serialization

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629315#action_12629315 ] 

Mahadev konar commented on HADOOP-4065:
---------------------------------------

the tfilerecordreader would just return raw bytes for keys and values and its up to the application to convert the raw bytes to types. Though as far as our implementation goes right now -- we dont have a map reduce interface for Tfiles. We intend to provide one.

The TFile is also supposed to be readable/writable in  other languages with the spec being clear enough on how to read and write Tfiles. We intend to provide just java implementation of Tfiles though. 

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633082#action_12633082 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

if there are implications, why not make the signature <Void, T> and when someone wants the sorting, have them use the InverseMapper instead of the IdentityMapper?


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629366#action_12629366 ] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

oh - and my code does not work without a change to DecompressorStream class to make that a public class (it's marked package private right now). Truthfully - that forced me to file this bug :-)

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632806#action_12632806 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

RE: FlatFileRecordReader's signature.

What would the implications be of changing the signature to <T, Void> ? Owen points out on HADOOP-3566 there can be benefits to this viz sorting but that JIRA is for Strings, whereas here T could be anything.

(Assuming for a moment that HADOOP-1230 is implemented. Now it would be <RowContainer<T>, Void>)

thanks, pete


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628530#action_12628530 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

It would be nice to also take care of the case where the file is self describing like sequence files. If we had a FlatFileRecordReader(conf, split, serializerFactory) where the serializerFactory could be initialized with the conf, split and the input stream, it could optionally read the self describing data from the inputstream. Then regardless, it could be used to implement getKey and getValue which it could do from the self described data or as you said based on the path. Or it could just instantiate the factory from a conf variable.

Then the user is free to implement the serializer lookup however they want much like most of the rest of the system.

This means the serializer lookup is very low in the stack, but since one must implement next and as Joy points out, you can't do that without the serializer??

This solves this case, but also the case of self describing thrift TRecordStream since the serializer class info would be in the header itself.

It would also be nice if the underlying input stream could actually be an interface because sometimes there's a flat file, but other times it may be compressed or some other format, but that format is capable of producing a stream of bytes. 

So, I guess I'm advocating for flexibility in defining the serializer lookup logic as well as how to read from the file.



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632217#action_12632217 ] 

Tom White commented on HADOOP-4065:
-----------------------------------

A few comments:

Could the types be called FlatFileInputFormat and FlatFileRecordReader?

Is a SerializationContext class needed? The Serialization can be got from the SerializationFactory. It just needs to know the base class (Writable, TBase etc). A second configuration parameter is needed to specify the concrete class, but I don't see why the FlatFileDeserializerRecordReader can't just get these two classes from the Configuration itself.

Can the classes go in the org.apache.hadoop.contrib.serialization.mapred package to echo the main mapred package? When HADOOP-1230 is done an equivalent could then go in the mapreduce package.

I agree it would be good to have tests for Writable, Java Serialization and Thrift to test the abstraction.

Shouldn't keys be file offsets, similar to TextInputFormat? The row numbers you have are actually the row number within the split, which might be confusing (and they're not unique per file).

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629652#action_12629652 ] 

Owen O'Malley commented on HADOOP-4065:
---------------------------------------

I think this is too complicated. What is the justification for these new interfaces? We already have RecordReader that already 
expresses these concepts.

Once the TFile stuff is ready, I think it would make a lot of sense to build an ObjectFile that uses the pluggable serializer
framework to save any objects. At that point, it becomes a potential replacement for SequenceFile. By using the serializer
framework, it should work fine with Java serialization, Thrift, or Protocol Buffers. I don't think having a Thrift file format is very
compelling at that point.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629654#action_12629654 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

Yes, good point.

I will change it to DeserializerTypedFile.

But, the SequenceFileRecordReader is re-usable for all these.  From the reader of a file that does its own deserializing of its types, it's all the same.

With this interface, the SequenceFileRecordReader can read SequenceFiles, DeserializerTypedFiles (thrift, proto buffers, record io whatever) and any other self describing typed files; sequencefile's being one example of these.

Otherwise, I don't see how not to be re-implementing the current SequenceFileRecordReader functionality for all these use cases??

-- pete


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Attachment: HADOOP-4065.1.txt

This patch implements: FlatFileDeserializerRecordReader (and input format), which reads rows of any kind of data using a Deserializer that is given in the JobConf.

For the current test case, I created a simple deserializer for \n separated plain text as a thin wrapper around LineRecordReader.LineReader to show what LineRecordReader would look like using FlatFileDeserializerRecordReader.

One thing is this is my first foray into Java generics, so I would appreciate an experienced generics person looking at the code.

-- pete

I want to also implement a thrift or record io test case.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629362#action_12629362 ] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

@Pete - i am still trying to understand the proposed interface. in my case - the data is not splitable (so maybe there's a different base class - UnsplittableFileInputFormat). The factory approach for getting a record reader sounds interesting - but on second thoughts - since the getRR() already takes in a Configuration object - we don't need a new interface for this i think. (Meaning that we could write a ConfigurableRecordReader that could look at the config and instantiate the right record reader and then redirect all RR api calls to the contained record reader.)

the main motivation i had for filing this bug was to extract out the common parts for dealing with compressed, non-splitable binary data into a base class (and most of this functionality is in the record reader) and make it easy to handle new kinds of binary data with minimal code. Binary files don't have keys and values - they just have rows of data. So one part is to supply some default key (like record number). The remaining part is to get the class of the row and the deserialization method. Appropriately - it would be nice to have a deserializerFactory that takes in a Configuration object (which is i think one of the key missing parts of hadoop-1986). that way - the deserializer can be configured by application - instead of hadoop maintaining a mapping from class -> Deserializer.

so my proposal would be something like this (admittedly - i am not being very ambitious here - just want to cover the issue mentioned in this jira):

/**
  * forced to make this class since maprunner tries to use same object for all next() calls.
  * This way we can swap out the actual 'row' object on each call to next()
  */
public class RowContainer {
   public Object row;
}

/**
 * Application can return right deserializer based on configuration
 */
public interface RowSource {
   public Deserializer<?> getDeserializer(Configuration conf) throws IOException;
   public Class<?> getClass();
}

/**
 * Reads a non-splitable binary flat file.
 */
public class RowSourceFileInputFormat extends FileInputFormat<LongWritable, RowContainer> {
  protected boolean isSplitable(FileSystem fs, Path file) { return false; }
  public RecordReader<LongWritable, RowContainer> getRecordReader(InputSplit genericSplit, JobConf job,
                                                                      Reporter reporter)
    throws IOException {
    reporter.setStatus(genericSplit.toString());
    return new RowSourceRecordReader(job, (FileSplit) genericSplit);
  }
}

/**
 * Reads one row at a time. The key is the row number and the actual row is returned inside the RowContainer
 */
public class RowSourceRecordReader implements RecordReader<LongWritable, RowContainer> {
    private long rnum;
    private final DataInput in;
    private final DecompressorStream dcin;
    private final FSDataInputStream fsin;
    private final long end;
    private final Deserializer deserializer;

    public RowSourceRecordReader(Configuration job,
                                FileSplit split) throws IOException {

      final Path file = split.getPath();
      CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(job);
      final CompressionCodec codec = compressionCodecs.getCodec(file);
      FileSystem fs = file.getFileSystem(job);
      fsin = fs.open(split.getPath());

      if(codec != null) {
        dcin = (DecompressorStream)codec.createInputStream(fsin);
        in = new DataInputStream(dcin);
      } else {
        dcin = null;
        in = fsin;
      }
      rnum = 0;
      end = split.getLength();

      deserializer=(ReflectionUtils.newInstance(job.getClass("mapred.input.rowsource", null, RowSource.class)).getDeserializer(job);
      deserializer.open(in);
    }

    public LongWritable createKey() {
      return new LongWritable();
    }

    public RowContainer createValue() {
       return new RowContainer();
    }

    public synchronized boolean next(LongWritable key, RowContainer value) throws IOException {
      if(dcin != null) {
        if (dcin.available() == 0) {
          return false;
        }
      } else {
        if(fsin.getPos() >= end) {
          return false;
        }
      }
      key.set(rnum++);
      Object row = deserializer.deserialize(value.row);
      value.row = row;
      return true;
    }

    public synchronized float getProgress() throws IOException {
      // this assumes no splitting                                                                                               
      if (end == 0) {
        return 0.0f;
      } else {
        // gives progress over uncompressed stream                                                                               
        return Math.min(1.0f, fsin.getPos()/(float)(end));


    public synchronized long getPos() throws IOException {
      // position over uncompressed stream. not sure what                                                                        
      // effect this has on stats about job                                                                                      
      return fsin.getPos();
    }

    public synchronized void close() throws IOException {
       // assuming that this closes the underlying streams
       deserializer.close();
    }
}


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633566#action_12633566 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

This is what HADOOP-3566 looks like as an instance of a FlatFileRecordReader (with signature <Void, String>, not <String, Void>).  This assumes there is a StringSerialization implementation (based on LineRecordReader) and that HADOOP-1230 is implemented. But, it should hopefully demonstrate that FlatFileRecordReader can be used for non binary records.  Although,  without this, it can still be be used for anything that implements the Serialization interface.

{code:title=StringInputFormat.java}
public class StringInputFormat extends FileInputFormat<Void, String> implements JobConfigurable {                                                     
  private CompressionCodecFactory compressionCodecs = null;                                                                                           
                                                                                                                                                      
  public void configure(JobConf conf) {                                                                                                               
    compressionCodecs = new CompressionCodecFactory(conf);                                                                                            
  }                                                                                                                                                   
                                                                                                                                                      
  protected boolean isSplittable(FileSystem fs, Path file) {                                                                                          
    return compressionCodecs.getCodec(file) == null;                                                                     
  }                                                                                                                                                   
                                                                                                                                                      
  public RecordReader<Void, String> getRecordReader(InputSplit split,                                                                                 
                                                    JobConf job, Reporter reporter)                                                                   
    throws IOException {                                                                                                                              
                                                                                                                                                      
    reporter.setStatus(split.toString());                                                                                                             
                                                                                                                                                      
    //                                                                                                                                                
    // Set this so the SerializerFromConf can lookup our deserializer.                                                                                
    //                                                                                                                                                
    job.setClass(FlatFileRecordReader.SerializationContextFromConf.SerializationImplKey,                                                              
                 org.apache.hadoop.contrib.serialization.string.StringSerialization.class,                                                            
                 org.apache.hadoop.io.Serialization.class);                                                                                           
                                                                                                                                                      
    job.setClass(FlatFileRecordReader.SerializationContextFromConf.SerializationSubclassKey,                                                          
                 java.lang.String.class, java.lang.String.class);                                                                                     
                                                                                                                                                      
    return new FlatFileRecordReader<String>(job, (FileSplit) split);                                                                                  
  }                                                                                                                                                   
}                                                                                                                                                     
{code}

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Description: 
Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation is specific in the JobConf and that once instantiated, that Serialization Implementation can  figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies RecordIOSerialization and then the specific class is LogRecordObject. 

Another way one might to do this is to use the SerializationFactory to do the lookup of the Serialization Implementation; however, this requires all Deserializers to be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader.

To ensure it is generic, I propose implementing the following Serialization implementations:

1. RecordIOSerialization
2. LineReaderSerialization
3. ThriftSerialization

The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 



  was:
like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).

it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.

tricky aspects are:
- how to know what class the file contains (has to be in a configuration somewhere).
- how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation is specific in the JobConf and that once instantiated, that Serialization Implementation can  figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies RecordIOSerialization and then the specific class is LogRecordObject. 
> Another way one might to do this is to use the SerializationFactory to do the lookup of the Serialization Implementation; however, this requires all Deserializers to be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader.
> To ensure it is generic, I propose implementing the following Serialization implementations:
> 1. RecordIOSerialization
> 2. LineReaderSerialization
> 3. ThriftSerialization
> The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632501#action_12632501 ] 

Owen O'Malley commented on HADOOP-4065:
---------------------------------------

I'm still not convinced about the utility of this class outside of Hive. What is the advantage of storing the data this way?
If you put it in a sequence file or t-file, a single bug in the serialization code for the application type doesn't destroy
your entire file. With this format, that is exactly what will happen. Furthermore, since the types have to be configured,
you can't use multiple ones in different contexts.

Maybe we should just put this into Hive?

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632451#action_12632451 ] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

couple of comments on the code:

SerializationContext<R> sinfo = (SerializationContext<R>)ReflectionUtils.newInstance(sinfoClass, conf);
sinfo.setConf(conf);

the setConf call is redundant since SerializationContext is configurable

key.set(rnum++);
if (key == null)
    key = createKey();

switch order? (or maybe the createKey()/createValue() is not required?)

otherwise looks good.

wrt some of Tom's comments:

> The row numbers you have are actually the row number within the split, which might be confusing
the inputformat is not splittable - so we are safe here

> Is a SerializationContext class needed? 

Very much so. Let me walk through the Hive use case:
- Hive knows the deserialization class for each file. However - it knows this through metadata about the _file_.  (The file belongs to a table that has some metadata). This metadata is passed to mappers through the configuration.
- In this case the mapping is not from a class -> deserializer but from a file -> deserializer - and the ability to bootstrap the serialization factory from the configuration is critical (the configuration has both the file name and the metadata about the file name)

This also seems to be the hadoop style of doing things (all implementations can be configurable) - and i think if it covers the hive case - it would help others as well. In fact - i think we should try to make this (configurable serialization factory pattern) a more fundamental part of the infrastructure. it seems more general than the class->serialization way of bootstrapping (de)serialization.








> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632760#action_12632760 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

bq. And note my intention is to put this in contrib/serialization

Sorry, I meant. my intention was (not is) to put this in contrib/serialization, but if there is objection, i can change the patch to contrib/hive.



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/serialization, mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Description: 
like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).

it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.

tricky aspects are:
- how to know what class the file contains (has to be in a configuration somewhere).
- how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.



  was:
Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation is specific in the JobConf and that once instantiated, that Serialization Implementation can  figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies RecordIOSerialization and then the specific class is LogRecordObject. 

Another way one might to do this is to use the SerializationFactory to do the lookup of the Serialization Implementation; however, this requires all Serialization Implementations to be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader. (see below re: adding Serialization implementations to contrib).

To ensure it is generic, I propose implementing the following Serialization implementations:

1. RecordIOSerialization
2. LineReaderSerialization
3. ThriftSerialization

The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 




reverting to original description :)


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628658#action_12628658 ] 

Owen O'Malley commented on HADOOP-4065:
---------------------------------------

Ok, a dispatching FileInputFormat could make sense. Where it looked at the
filenames or header of the files and picked the appropriate reader. At that
point, it isn't about binary files, because you'd want it to work for text
files also. What would be the approach? Filenames like we do with the
compression of text files? Or sampling the first 80 bytes looking for a
header?

-- Owen


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629362#action_12629362 ] 

jsensarma edited comment on HADOOP-4065 at 9/8/08 6:47 PM:
-------------------------------------------------------------------

@Pete - i am still trying to understand the proposed interface. in my case - the data is not splitable (so maybe there's a different base class - UnsplittableFileInputFormat). The factory approach for getting a record reader sounds interesting - but on second thoughts - since the getRR() already takes in a Configuration object - we don't need a new interface for this i think. (Meaning that we could write a ConfigurableRecordReader that could look at the config and instantiate the right record reader and then redirect all RR api calls to the contained record reader.)

the main motivation i had for filing this bug was to extract out the common parts for dealing with compressed, non-splitable binary data into a base class (and most of this functionality is in the record reader) and make it easy to handle new kinds of binary data with minimal code. Binary files don't have keys and values - they just have rows of data. So one part is to supply some default key (like record number). The remaining part is to get the class of the row and the deserialization method. Appropriately - it would be nice to have a deserializerFactory that takes in a Configuration object (which is i think one of the key missing parts of hadoop-1986). that way - the deserializer can be configured by application - instead of hadoop maintaining a mapping from class -> Deserializer.

so my proposal would be something like this (admittedly - i am not being very ambitious here - just want to cover the issue mentioned in this jira):

{code}
/**
  * forced to make this class since maprunner tries to use same object for all next() calls.
  * This way we can swap out the actual 'row' object on each call to next()
  */
public class RowContainer {
   public Object row;
}

/**
 * Application can return right deserializer based on configuration
 */
public interface RowSource {
   public Deserializer<?> getDeserializer(Configuration conf) throws IOException;
   public Class<?> getClass();
}

/**
 * Reads a non-splitable binary flat file.
 */
public class RowSourceFileInputFormat extends FileInputFormat<LongWritable, RowContainer> {
  protected boolean isSplitable(FileSystem fs, Path file) { return false; }
  public RecordReader<LongWritable, RowContainer> getRecordReader(InputSplit genericSplit, JobConf job,
                                                                      Reporter reporter)
    throws IOException {
    reporter.setStatus(genericSplit.toString());
    return new RowSourceRecordReader(job, (FileSplit) genericSplit);
  }
}

/**
 * Reads one row at a time. The key is the row number and the actual row is returned inside the RowContainer
 */
public class RowSourceRecordReader implements RecordReader<LongWritable, RowContainer> {
    private long rnum;
    private final DataInput in;
    private final DecompressorStream dcin;
    private final FSDataInputStream fsin;
    private final long end;
    private final Deserializer deserializer;

    public RowSourceRecordReader(Configuration job,
                                FileSplit split) throws IOException {

      final Path file = split.getPath();
      CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(job);
      final CompressionCodec codec = compressionCodecs.getCodec(file);
      FileSystem fs = file.getFileSystem(job);
      fsin = fs.open(split.getPath());

      if(codec != null) {
        dcin = (DecompressorStream)codec.createInputStream(fsin);
        in = new DataInputStream(dcin);
      } else {
        dcin = null;
        in = fsin;
      }
      rnum = 0;
      end = split.getLength();

      deserializer=(ReflectionUtils.newInstance(job.getClass("mapred.input.rowsource", null, RowSource.class)).getDeserializer(job);
      deserializer.open(in);
    }

    public LongWritable createKey() {
      return new LongWritable();
    }

    public RowContainer createValue() {
       return new RowContainer();
    }

    public synchronized boolean next(LongWritable key, RowContainer value) throws IOException {
      if(dcin != null) {
        if (dcin.available() == 0) {
          return false;
        }
      } else {
        if(fsin.getPos() >= end) {
          return false;
        }
      }
      key.set(rnum++);
      Object row = deserializer.deserialize(value.row);
      value.row = row;
      return true;
    }

    public synchronized float getProgress() throws IOException {
      // this assumes no splitting                                                                                               
      if (end == 0) {
        return 0.0f;
      } else {
        // gives progress over uncompressed stream                                                                               
        return Math.min(1.0f, fsin.getPos()/(float)(end));


    public synchronized long getPos() throws IOException {
      // position over uncompressed stream. not sure what                                                                        
      // effect this has on stats about job                                                                                      
      return fsin.getPos();
    }

    public synchronized void close() throws IOException {
       // assuming that this closes the underlying streams
       deserializer.close();
    }
}
{code}

      was (Author: jsensarma):
    @Pete - i am still trying to understand the proposed interface. in my case - the data is not splitable (so maybe there's a different base class - UnsplittableFileInputFormat). The factory approach for getting a record reader sounds interesting - but on second thoughts - since the getRR() already takes in a Configuration object - we don't need a new interface for this i think. (Meaning that we could write a ConfigurableRecordReader that could look at the config and instantiate the right record reader and then redirect all RR api calls to the contained record reader.)

the main motivation i had for filing this bug was to extract out the common parts for dealing with compressed, non-splitable binary data into a base class (and most of this functionality is in the record reader) and make it easy to handle new kinds of binary data with minimal code. Binary files don't have keys and values - they just have rows of data. So one part is to supply some default key (like record number). The remaining part is to get the class of the row and the deserialization method. Appropriately - it would be nice to have a deserializerFactory that takes in a Configuration object (which is i think one of the key missing parts of hadoop-1986). that way - the deserializer can be configured by application - instead of hadoop maintaining a mapping from class -> Deserializer.

so my proposal would be something like this (admittedly - i am not being very ambitious here - just want to cover the issue mentioned in this jira):

/**
  * forced to make this class since maprunner tries to use same object for all next() calls.
  * This way we can swap out the actual 'row' object on each call to next()
  */
public class RowContainer {
   public Object row;
}

/**
 * Application can return right deserializer based on configuration
 */
public interface RowSource {
   public Deserializer<?> getDeserializer(Configuration conf) throws IOException;
   public Class<?> getClass();
}

/**
 * Reads a non-splitable binary flat file.
 */
public class RowSourceFileInputFormat extends FileInputFormat<LongWritable, RowContainer> {
  protected boolean isSplitable(FileSystem fs, Path file) { return false; }
  public RecordReader<LongWritable, RowContainer> getRecordReader(InputSplit genericSplit, JobConf job,
                                                                      Reporter reporter)
    throws IOException {
    reporter.setStatus(genericSplit.toString());
    return new RowSourceRecordReader(job, (FileSplit) genericSplit);
  }
}

/**
 * Reads one row at a time. The key is the row number and the actual row is returned inside the RowContainer
 */
public class RowSourceRecordReader implements RecordReader<LongWritable, RowContainer> {
    private long rnum;
    private final DataInput in;
    private final DecompressorStream dcin;
    private final FSDataInputStream fsin;
    private final long end;
    private final Deserializer deserializer;

    public RowSourceRecordReader(Configuration job,
                                FileSplit split) throws IOException {

      final Path file = split.getPath();
      CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(job);
      final CompressionCodec codec = compressionCodecs.getCodec(file);
      FileSystem fs = file.getFileSystem(job);
      fsin = fs.open(split.getPath());

      if(codec != null) {
        dcin = (DecompressorStream)codec.createInputStream(fsin);
        in = new DataInputStream(dcin);
      } else {
        dcin = null;
        in = fsin;
      }
      rnum = 0;
      end = split.getLength();

      deserializer=(ReflectionUtils.newInstance(job.getClass("mapred.input.rowsource", null, RowSource.class)).getDeserializer(job);
      deserializer.open(in);
    }

    public LongWritable createKey() {
      return new LongWritable();
    }

    public RowContainer createValue() {
       return new RowContainer();
    }

    public synchronized boolean next(LongWritable key, RowContainer value) throws IOException {
      if(dcin != null) {
        if (dcin.available() == 0) {
          return false;
        }
      } else {
        if(fsin.getPos() >= end) {
          return false;
        }
      }
      key.set(rnum++);
      Object row = deserializer.deserialize(value.row);
      value.row = row;
      return true;
    }

    public synchronized float getProgress() throws IOException {
      // this assumes no splitting                                                                                               
      if (end == 0) {
        return 0.0f;
      } else {
        // gives progress over uncompressed stream                                                                               
        return Math.min(1.0f, fsin.getPos()/(float)(end));


    public synchronized long getPos() throws IOException {
      // position over uncompressed stream. not sure what                                                                        
      // effect this has on stats about job                                                                                      
      return fsin.getPos();
    }

    public synchronized void close() throws IOException {
       // assuming that this closes the underlying streams
       deserializer.close();
    }
}

  
> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632309#action_12632309 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

bq. Could the types be called FlatFileInputFormat and FlatFileRecordReader?
Yes, better names.

bq. Is a SerializationContext class needed?

If the Serialization is in contrib, one would need to use ReflectionUtils to instantiate it and it wouldn't be in any Factory, would it?  So, in this case, it needs to know the name of the Class to instantiate it, no?  

bq. an't just get these two classes from the Configuration itself.

wanted to make it extensible so it could some from the configuration or maybe some place else - the name of the file or some external store or something depending on the application. Of course, in that case, one could argue a higher level is setting that up anyway, so why don't they just do the lookup and store the info in the configuration. 

bq. Shouldn't keys be file offsets, similar to TextInputFormat? The row numbers you have are actually the row number within the split, which might be confusing (and they're not unique per file).

Are the file offsets useful anywhere?  Maybe we should just always return the same instance of some dummy Writable for performance if the key isn't used anyway??


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629398#action_12629398 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

It looks like we posted complimentary patches :) I think we should keep my RecordReader, but use your implementation of the TypedFileTransport as it's more general since it uses the serialization framework.

This would address the use case of files whose serializers are set in the config file and self describing files like sequence files and TRecordStream and I would guess, TFile but I haven't looked at it.

-- pete


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632529#action_12632529 ] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

yes - given that this has no dependency on core hadoop now - i really don't care - we can put this into Hive. The generic ThriftDeserializer is trivial - we could duplicate the code for now and then remove it once 3787 provides those classes as well.

btw - we also don't store data in this manner. agree with all your observations. however 
- this requested originated from outside Hive/Facebook. I get the impression (perhaps wrong) that quite a few people just dump thrift logs into a flat file (just like people dump apache logs into a flat file). This is also because Thrift does not have (so far) a good framed file format.
- the same counter argument can be made for TextFileInputFormat. The general observation is that data originates outside the hadoop ecosystem and the general format it originates in is flat files. We should strive the easiest way to absorb this data and transform it into a better one (like Sequencefile). 

That is the general effort with Hive at least. We expect users to create temporary tables by pointing to flat files. And then quickly do some transformations (using sql and potentially scripts) and load it into tables in sequencefile (like) format (for longer term storage).  Being able to point to thrift flat files(and potentially other binary files)  is part of the data integration story.

> Furthermore, since the types have to be configured, you can't use multiple ones in different contexts. 

not sure what u mean - but this is not true. the deserializer is obtained from a combination of file name and file name->deserializer metadata from an external source. Different files can be read using different deserializers and then operated on in the same map-reduce program (the application logic has logic to deal with different classes based on the file name).  we will only be too happy to demonstrate a join of two different thrift classes (in different files/tables) using Hive and a generic flat file reader like this.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631938#action_12631938 ] 

Owen O'Malley commented on HADOOP-4065:
---------------------------------------

I've lost the motivation for this. It complicates the public interfaces a lot and doesn't have any payback.

In our experience, given a file format, the code is pretty independent, but it is tied to the fragment splitting. 

Is the goal of this jira to:
  1. Make a generic / self-detecting format?
  2. A generic file format?

In either case, changes to the serialization framework seems like serious overkill.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628484#action_12628484 ] 

Owen O'Malley commented on HADOOP-4065:
---------------------------------------

How are you going to define/find record boundaries? Is it going to be key/value or single blob? How is it different from KeyValueInputFormat?

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629339#action_12629339 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

Actual interface

{code:title=TypedSplittableFile.java}

public interface TypedSplittableFile {                                                                                                                                                                         
  public void initialize(FileSystem fileSys, Path path, Configuration conf) throws IOException;                                                                                                                
                                                                                                                                                                                                               
  public Class getKeyClass() ;                                                                                                                                                                                 
  public Class getValueClass();                                                                                                                                                                                
                                                                                                                                                                                                               
  public Object next(Object key) throws IOException;                                                                                                                                                           
  public Object getCurrentValue(Object val) throws IOException ;                                                                                                                                               
                                                                                                                                                                                                               
  public boolean syncSeen(); // i.e., atEOF()                                                                                                                                                                  
  public void sync(long position) throws IOException; // skip to past last frame boundary                                                                                                                      
  public long getPosition() throws IOException;                                                                                                                                                                
  public void seek(long position) throws IOException;                                                                                                                                                          
  public void close() throws IOException;                                                                                                                                                                      
                                                                                                                                                                                                               
}    
{code}


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628526#action_12628526 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

I think what he means is that file format and record boundary finding could be decoupled, and the latter made into some kind of interface that may be related to the deserializer. He's talking about binary data for which only really the deserializer can figure out record boundaries. 


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629661#action_12629661 ] 

Doug Cutting commented on HADOOP-4065:
--------------------------------------

> With this interface, the SequenceFileRecordReader can read SequenceFiles, DeserializerTypedFiles (thrift, proto buffers, record io whatever) and any other self describing typed files; sequencefile's being one example of these.

I'm all for reducing code duplication.  So if SequenceFileRecordReader can mostly be replaced with code that's shared by other file format's that'd be great.  But we need those file formats to exist before we perform this factoring.  Is there a splittable thrift or protocol-buffer input file format implementation yet that can share code with SequenceFileInputFormat?  Let's not refactor until we have these.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634394#action_12634394 ] 

Hadoop QA commented on HADOOP-4065:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12390798/HADOOP-4065.2.txt
  against trunk revision 698721.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 9 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3367/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3367/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3367/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3367/console

This message is automatically generated.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Joydeep Sen Sarma
>            Assignee: Pete Wyckoff
>             Fix For: 0.20.0
>
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, HADOOP-4065.2.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629342#action_12629342 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

bq. The TFile is also supposed to be readable/writable in other languages with the spec being clear enough on how to read and write Tfiles. We intend to provide just java implementation of Tfiles though.

Fair enough. If we had TFile in a number of languages it could probably be put in as a thrift transport. TRS is much simpler though and thus much easier to write in other languages.


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632058#action_12632058 ] 

Joydeep Sen Sarma commented on HADOOP-4065:
-------------------------------------------

Hi Owen - the motivation was based on different families of binary (or even non binary) data embedded within flat files (by which i mean they are unsplittable and not self-describing (except for compression).

- We should be able to write one concrete implementation that covers no-splits and compression related code
- One should be able to plug in different deserializers for different binary formats

The desire was that once this is written out - different deserializers can be plugged in easily. In that sense - this does not follow the general pattern that you observed of having to write custom code to deal with splitting (since there's no splitting here). Existing interfaces should not have to be changed (although things got pretty complicated in the intermediate discussion) - and i don't think they are although i am going to send back feedback on the code separately. The code should be really simple i would think.

Do you think this is a reasonable thing to add?


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628806#action_12628806 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

I propose we re-use the code from SequenceFileRecordReader by making it depend on a SplittableTypedFile interace (below) which conveniently is already implemented by SequenceFile.  Then we're basically done. 

I am not super-familiar with this code and the devil is probably in the details, but looking at SequenceFileRecordReader, there is basically only about 5 methods it uses from SequenceFile and those are all well defined and seem needed for any implementation of a self describing file that is splittable.

We could also not touch SequenceFileRecordReader, but it seems we'd just be duplicating all of its code.


{code:title=TypedFile Interfaces}

                                                                                                                                                                                      public interface TypedFile  {                                                                                                                                                         
  public void initialize(Configuration conf, InputStream in);                                                                                                                         
  public Class getKeyClass();                                                                                                                                                         
  public Class getValueClass();                                                                                                                                                       
                                                                                                                                                                                      
  public boolean next(Writable key);                                                                                                                                                  
  public boolean next(Writable key, Writable value);                                                                                                                                  
  public Writable getCurrentValue();                                                                                                                                                  
}                                                                                                                                                                                     
                                                                                                                                                                                      
public interface SplittableTypedFile implements TypedFile {                                                                                                                           
  public boolean syncSeen(); // i.e., atEOF()                                                                                                                                         
  public boolean sync(long); // skip to past last frame boundary                                                                                                                      
}                                                                                                                                                                                     
                                                                                                                                                                                      
{code}

{code:title=TypedSplittableRecordReader }

// This is almost a complete cut-n-paste of existing SequenceFileRecordReader - which would be removed

public class TypedSplittableRecordReader<K, V> implements RecordReader<K,V> {                                                                                                         
  SplittableTypedFile in;                                                                                                                                                             
  public TypedRecordReader(Configuration conf, FileSplit split, SplittableTypedFileFactory<K,V> fileFactory) {                                                                        
    this.in = fileFactory.getFileReader(fs, path, conf);                                                                                                                              
  }                                                                                                                                                                                   
  // the rest is exactly like the current sequence file implementation basically.                                                                                                     
}                                                                                                                                                                                     
                                                                                                                                                                                      
{code}

{code:title=SequenceFile}

-public class SequenceFile {
+public class SequenceFile implements SplittableTypedFile {                                                                                                                                  
                                                                                                                                                                                      
{code}

{code:title=SequenceFileInputFormat}                                                                                                                                                                                      

public class SequenceFileInputFormat<K, V> extends FileInputFormat<K, V> {                                                                                                            
  public RecordReader<K,V> getRecordReader() {                                                                                                                                        
    return TypedSplittableRecordReader<K, V>(job, split, new SequenceFileFactory<K,V>());                                                                                             
  }                                                                                                                                                                                   
}                            

{code}

{code:title=SelfDescribingFileExample}
public class TFixedFrameTransportInputFormat implements SplittableTypedFile {                                                                                                         
  // implementing all the above should be straightforward                                                                                                                             
}                                                                                                                                                                                     

public class TFixedFrameTransportFileInputFormat<K, V> extends FileInputFormat<K, V> {                                                                                                
  public RecordReader<K,V> getRecordReader() {                                                                                                                                        
    return TypedSplittableRecordReader<K, V>(job, split, new TFixedFrameFileFactory<K,V>());                                                                                          
  }                                                                                                                                                                                   
}                                                                                                                                                                                     

{code}

One problem is for non-splittable files, I have to create another record reader with almost the same code. Maybe better to put everything in one interface and add boolean isSplittable and have sync just do a seek(0) and syncSeen just look at EOF.




> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629296#action_12629296 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

I just wanted to post pseudo-code for this design that actually addresses this JIRA :) and not only self describing files like SequenceFile and Thrift's TRecordStream.

In the case for this JIRA, the file's metadata is stored in some external store or dictionary or something.  The only way to lookup the file would be through the filename/path, so I think it's fair that on job submission, the mapping is put in the JobConf.

Given this use case, and looking at line 43 of SequenceFileRecordReader (   this.in = new SequenceFile.Reader(fs, path, conf); ), the TypeFile        interface should be changed:

- public void initialize(Configuration conf, InputStream in);                                                                                                                         
+ public void initialize(FileSystem, Path, Configuration);

Obviously it has top open the inputstream anyway ( :) ).  

And a typo SequenceFile would not implement SplittableTypedFile, SequenceFile.Reader would.
                                         





> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Attachment: ThriftFlatFile.java

psuedo-esque code that implements Thrift in FlatFiles where the filename is used as the key to get the thrift class from the configuration.  Implements <LongWritable, ThriftWritable>;

Of course, it may be that we want to use the serialization classes.


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631544#action_12631544 ] 

Doug Cutting commented on HADOOP-4065:
--------------------------------------

Please don't edit descriptions.  It's very difficult to tell what's changed.  The description should describe the problem.  The discussion below should present solutions.  Editing descriptions and comments makes it very hard to follow an issue.  This is discussed in the "Jira Guidlines" section of http://wiki.apache.org/hadoop/HowToContribute.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> Implement generic FlatFileDeserializationRecordReader which assumes a Serialization Implementation is specific in the JobConf and that once instantiated, that Serialization Implementation can  figure out the actual class being Deserialized from the JobConf.  e.g., the JobConf specifies RecordIOSerialization and then the specific class is LogRecordObject. 
> Another way one might to do this is to use the SerializationFactory to do the lookup of the Serialization Implementation; however, this requires all Serialization Implementations to be known apriori and registered and goes against the spirit of a very generic FlatFileDeserializeRecordReader. (see below re: adding Serialization implementations to contrib).
> To ensure it is generic, I propose implementing the following Serialization implementations:
> 1. RecordIOSerialization
> 2. LineReaderSerialization
> 3. ThriftSerialization
> The first 2 should go in io/serialization and the Thrift one in contrib somewhere. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632067#action_12632067 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

bq. In either case, changes to the serialization framework seems like serious overkill.

you are right - I don't know why I went that direction. It should be exactly the oppostite of the way I coded it up.

All one needs is some way of getting SerializationContext information (to instantiate the right Serialization Object and then the actual subclass we want to deserialize; e.g., Record/MyRecordObj). [this was called RowSource in joy's code example]

And then a simple record reader that uses that info to instantiate a deserializer and done.

No changes to the serialization framework.

I actually have that with unit tests and just need to clean up the documentation and such.

-- pete


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Attachment: HADOOP-4065.1.txt

Here's the simple patch and in contrib/serialization/readers. No build file with this yet as will coordinate with Tom White on that.

Although the unit test is only for Java now, I did have it with Thrift and RecordIO, but since neither of those are checked in yet, i didn't include those tests.



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4065) support for reading binary data from flat files

Posted by "Pete Wyckoff (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pete Wyckoff updated HADOOP-4065:
---------------------------------

    Attachment: FlatFileReader.java

this is all this is - don't know why i went the other way :(


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not have the deserializer throw an exception  (which is hard to distinguish from a exception due to corruptions?)). this is easy for non-compressed streams - for compressed streams - DecompressorStream has a useful looking getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.