You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Aaron Kimball (JIRA)" <ji...@apache.org> on 2010/04/16 00:20:52 UTC

[jira] Created: (HADOOP-6708) New file format for very large records

New file format for very large records
--------------------------------------

                 Key: HADOOP-6708
                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
             Project: Hadoop Common
          Issue Type: New Feature
          Components: io
            Reporter: Aaron Kimball
            Assignee: Aaron Kimball


A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857637#action_12857637 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

Also how does TFile handle splits and resynchronizing? It doesn't seem like there's an InputFormat for it.


> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball updated HADOOP-6708:
----------------------------------

    Attachment: lobfile.pdf

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858015#action_12858015 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

I'm definitely able to help test. I can do some manual testing of large records / performance after there's a patch available.

I wasn't suggesting that we use CountingInputStream in there; I was mostly using that as an example of having dual interfaces to retrieve the same value -- one as an integer ({{getCount()}}), and the other as a long ({{getByteCount()}}). This is not my preference; I'd prefer to modify the existing methods. But it is an incompatible change, so I figured I'd check first.



> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aaron Kimball resolved HADOOP-6708.
-----------------------------------

    Resolution: Won't Fix

After thinking more about this, I don't think this issue is going to suit Sqoop's needs in the time being. I'd like the next release of Sqoop to be compatible with the Hadoop 0.21 release, which wouldn't happen if we depend on this. I also don't know that modifying TFile in this way is feasible on our internal timeline. Hong, thanks for the guidance thus far.

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857621#action_12857621 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

I don't think that works in this scenario. Suppose I have a record that is 8 GB long; I read the first kilobyte or two out of the record, then intend to discard the rest and start with the next record.

If we have a chunk size of 1 MB, then skipping through this will require 8000 seeks, or ~64000 milliseconds. Even upping this chunk size drastically to 100 MB means a 640 ms delay.


> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857609#action_12857609 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

Hong,

bq. * Length ﬁelds are encoded as integers, not longs. This does not support records > 2 GB.
bq. This is an intentional restriction. All integers are in VInt/VLong format which is fully wire compatible. You can easily make a case to request such limit be lifted.

So does this mean that the API for TFile could be changed without complication to accept/return {{long}} values? I read the TFile spec and it points out in several different locations the 2 GB value limit. By reading that, it sounds as though other aspects of TFile may break based on the assumed integer size there.

bq. Even if you do not know the length of the record you write (namely specifying -1 during writing), you can still efficiently skip a record (even after partially consuming some bytes of the record). Isn't it sufficient for your case? Searching for a synchronization boundary is very inefficient than length-prefixed encoding.

Data comes to me from JDBC through an InputStream or a Reader that I am not sure how long it is. I read from that InputStream/Reader and write its contents into an OutputStream/Writer that dumps into a file (LobFile). In the case where I have a character-based Reader, I know how many characters I have, which is a lower bound on the number of bytes, but not exact.  So my plan was to seek ahead by that much, then search for the boundary. Assuming most characters are one byte, the search will be pretty quick.

How does TFile support length skipping if you don't pre-declare the lengths?


> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857620#action_12857620 ] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. So does this mean that the API for TFile could be changed without complication to accept/return long values?

yes, if by "complication" you mean wire compatibility. You may also need to remove the checks for length violations in various places. Note that there is still a key length restriction of 64KB.

bq. How does TFile support length skipping if you don't pre-declare the lengths?

it uses chunk encoding. The whole value stream is encoded with a chain of chunks. We use an internal buffer to accumulate small writes and once full, flush out the buffer as one chunk. Each chunk is length prefixed. All but the terminal chunk have their lengths written out as negative integers. The chunk size is controlled by parameter "tfile.io.chunk.size" (default to 1MB). When skipping, it skips chunk by chunk.


> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857588#action_12857588 ] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. Shortcomings of TFile: 

bq. • Length ﬁelds are encoded as integers, not longs. This does not support records > 2 GB. 

This is an intentional restriction. All integers are in VInt/VLong format which is fully wire compatible. You can easily make a case to request such limit be lifted.

bq. • The expected length of the value must be precisely-known or written as -1. I have records where I have a minimum bound on their size (more precisely: Ahead of time, I know their length in characters but not in bytes), but not necessarily a more precise byte count value. A goal of this format is the ability to efﬁciently partially-read a character stream-oriented record, then skip quickly to the next record. The format I propose will allow you to specify that a record is at least a certain length. That way you can efﬁciently skip most of a record and then search for a synchronization boundary after that. (Since I expect most characters to be encoded in a single byte, but not necessarily all of them.) 

Even if you do not know the length of the record you write (namely specifying -1 during writing), you can still efficiently skip a record (even after partially consuming some bytes of the record). Isn't it sufficient for your case? Searching for a synchronization boundary is very inefficient than length-prefixed encoding.


> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857578#action_12857578 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

In working on Sqoop, I need to import records which may be several gigabytes in size each. I require a file format that allows me to store these records in an efficient, grouped fashion.

Users may then want to open a file containing many such records and partially-read individual records, but still access subsequent records efficiently. 

I'm attaching to this issue a proposal for a _LobFile_ format which will store these large objects. (The basis for this work surrounds import of BLOB and CLOB-typed columns.) The specification proposal analyzes available file formats and my understanding of why they aren't appropriate here.

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859474#action_12859474 ] 

Doug Cutting commented on HADOOP-6708:
--------------------------------------

> Shortcomings of Avro File Format:
> • Data is expected to have a schema

But that schema can be just "bytes".

> • No lazy reading API (yet?)

True.  Would this be hard to add?

Also, is it important to add this to Common, or should it rather belong in Sqoop?

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857937#action_12857937 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

Hong,

This sounds like TFile could be adapted to my needs. Thanks for helping explain all of that thoroughly. Is this block offset/length index already maintained? It sounds like the skip-to-the-next-block optimization is not already implemented. Do you know what are the next steps required to make that happen?

The other thing that needs to happen is an API to support long-valued lengths. Should I submit a patch that modifies the existing method signatures? Or provide additional methods (e.g., as in http://commons.apache.org/io/api-1.4/org/apache/commons/io/input/CountingInputStream.html)

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857635#action_12857635 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

I'm not sure what you mean by this optimization. Can you please explain further?

What's the relationship between "blocks" and "chunks" in a TFile? It sounds like a record can span multiple chunks. Is a record fully contained in a block? If it compresses an 8 GB record down to, say, 2 GB, will that still require skipping chunk-wise through the compressed data?

I do plan on using compression. Given the very large record lengths I'm designing for, I expect that it's acceptable to compress each record individually. The current writeup doesn't propose how to handle compression elegantly. But I'm leaning toward writing out a table of compressed record lengths at the end of the file.

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861161#action_12861161 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

Possibly. But then we'd have to wait for the next Avro release as well, and then ensure that this Avro version works with the available Hadoop releases, which would introduce further dependency-management complications. The point of doing this large object format work here is for broader compatibility. Until Avro compatibility in general is improved in Hadoop (e.g., resolution of MAPREDUCE-815), it doesn't seem like this directly facilitates this goal.

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860026#action_12860026 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

| But that schema can be just "bytes".

Of course. And Sqoop would use such a file in this manner. But in building a feature into Avro's file format, would it be possible to include a {{getRecordAsByteStream()}} / {{getRecordAsCharStream()}} that makes sense in the context of a file where many underlying schemata don't necessarily make sense in a byte-wise form?

| True. Would this be hard to add?

:smile: You'd be in a better position than I to comment on that.

As for the common/sqoop question: I have written prototype code that provides this file format in Sqoop itself, but I haven't pushed it out yet. If it's infeasible to add this to Hadoop common, then I'll continue to polish that prototype code and just include it directly in Sqoop. However in discussion with other engineers, it's come up that such a very-large-record format may have broader applications than just Sqoop. Furthermore, people will want to use inputformats, etc., that operate over these records. Folks could link against Sqoop's jar to get to these file formats and InputFormat classes, but that's One More Dependency that they may not want to manage. Given that these records are just byte or character streams, it doesn't seem necessary to restrict it to just Sqoop. Also the ability to expand the scope of an existing format to encapsulate these records could lower maintenance costs over time for clients who are storing data in this format.


> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Aaron Kimball (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857580#action_12857580 ] 

Aaron Kimball commented on HADOOP-6708:
---------------------------------------

Please review the attached proposal. This will serve as a basis to foment discussion; hopefully we can arrive at a conclusion as to how to best implement this need. Some still-open points are listed at the end of the document.


> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857639#action_12857639 ] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. What's the relationship between "blocks" and "chunks" in a TFile?
A TFile contains zero or more compressed blocks. Each block contains sequences of key, value, key, value. Each value can contain 1 to more chunks. A block has a minimum size of 256KB. Whenever we accumulate enough data that exceeds the minimum block size, we "close" the current block and starts a new block. All blocks have their offsets and lengths recorded in some index section.

bq. Is a record fully contained in a block?
Yes.

bq.  If it compresses an 8 GB record down to, say, 2 GB, will that still require skipping chunk-wise through the compressed data?
No, because it would be the last record in that block. With my suggested optimization, it would be an O(1) operation to skip that record.

bq. Also how does TFile handle splits and resynchronizing? It doesn't seem like there's an InputFormat for it. 
Writing an input format for it is pretty easy, I believe Owen has a prototype of OFile on top of TFile on his laptop. :) Generally, you would extend from FileInputFormat, and your record reader would be backed up by a TFile.Reader.Scanner created by TFile.Reader.createScannerByByteRange(long offset, long length). Internally, this method would move the bytes range to the boundary of TFile compression blocks (through the block index it maintains). 

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Tom White (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858017#action_12858017 ] 

Tom White commented on HADOOP-6708:
-----------------------------------

FWIW I've marked the TFile interfaces as "Evolving" in HADOOP-6668, which would be consistent with making an incompatible change for the next release.

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12858011#action_12858011 ] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. Is this block offset/length index already maintained?
Yes.

bq. It sounds like the skip-to-the-next-block optimization is not already implemented. Do you know what are the next steps required to make that happen?
Correct. There is no immediate plan for me to work on this. The implementation could be quite simple but I'll need a bit more time to refresh my mind of the code structure. Also, would you be able to help test this?

bq. The other thing that needs to happen is an API to support long-valued lengths. Should I submit a patch that modifies the existing method signatures? 
I think modifying the existing method signature is fine (not clear why we need CountingInputStream in TFile, users should be able to create one on top of the input stream we provide, right?).

> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857627#action_12857627 ] 

Hong Tang commented on HADOOP-6708:
-----------------------------------

bq. I don't think that works in this scenario. Suppose I have a record that is 8 GB long; I read the first kilobyte or two out of the record, then intend to discard the rest and start with the next record.

Your analysis is almost right. However, there is one optimization could be done to support this: TFile does block compression, and an 8GB record is likely to exceed the TFIle block size after compression (unless you have something like all zeros). So it would be the last record in the block. And we can speed up the skipping of the last record in a block by positioning the cursor to the beginning of the next block without chunk decoding. On the other hand, if your 8GB actually compresses very well within one TFile block (256KB by default), then it is really a sequential read of 256KB from HDFS.

>From your description, it seems that you do not plan to use compression, which sounds a bit surprising to me...



> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-6708) New file format for very large records

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860056#action_12860056 ] 

Doug Cutting commented on HADOOP-6708:
--------------------------------------

> would it be possible to include a getRecordAsByteStream() / getRecordAsCharStream()

What might be simpler is to implement a DatumReader<InputStream> that only works when the schema is "bytes" and whose #read() implementation returns a clone of BinaryDecoder#getInputStream().  You'd want to use a "direct" binary decoder and pass it an input stream implementation that you know how to clone.  Then you could use the vanilla DataFileReader, seek to entries and read them.



> New file format for very large records
> --------------------------------------
>
>                 Key: HADOOP-6708
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6708
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: io
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: lobfile.pdf
>
>
> A file format that handles multi-gigabyte records efficiently, with lazy disk access

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.