You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hong Tang (JIRA)" <ji...@apache.org> on 2009/01/29 03:17:00 UTC

[jira] Issue Comment Edited: (HADOOP-3315) New binary file format

    [ https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668286#action_12668286 ] 

hong.tang edited comment on HADOOP-3315 at 1/28/09 6:15 PM:
------------------------------------------------------------

bq.  Should probably be explicit about encoding in the below:
 public ByteArray(String str) {
 this(str.getBytes());

Good catch. This seems to be some code to facilitate testing, but was not properly cleaned up. I should remove that constructor.

bq. Would be nice if we could easily pass a alternate implementation of BCFile, say one that cached blocks.

Possible, but I think it is too early to make BCFile API public. Can you also elaborate on the example you mentioned? Why do you need block caching instead of key-value caching (given that TFile is based on <key, value> pairs)?

bq. Do you want to fix the below:

  // TODO: remember the longest key in a TFile, and use it to replace
  // MAX_KEY_SIZE.
  keyBuffer = new byte[MAX_KEY_SIZE];

bq. Default buffers of 64k for keys is a bit on the extravagant side.

Yes, it is an easy fix. I intend to do that at a later time when we gather more information about what statistics we should collect during file creation time and put them in one meta block. I don't imagine it being an urgent issue though except for applications that open up hundreds of files (or scanners) simultaneously.

bq. Below should be public so users don't have to define their own: 
  
 protected final static String JCLASS = "jclass:";

Certainly. Possibly also true for symbolic names for various compression algorithms.

bq. API seems to have changed since last patch. There nolonger a #find method. Whats the suggested way of accessing a random single key/value? (Open scanner using what would you suggest for start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte array. Should there be a seekTo that takes a RawComparable that is public?).

Yes, the API is changed so that we do not have to scan through a compressed block twice (first get a location object, then use it with the scanner). I'd suggest to do random access as follows:

    Scanner scanner = reader.createScanner();
    ...
    if (scanner.seekTo(bytes, offset, length)) {
        Entry entry = scanner.entry();
        // access value through either entry.getValue or entry.writeValue 
    }


      was (Author: hong.tang):
    bq Should probably be explicit about encoding in the below:
public ByteArray(String str) {
this(str.getBytes());

Good catch. This seems to be some code to facilitate testing, but was not properly cleaned up. I should remove that constructor.

bq. Would be nice if we could easily pass a alternate implementation of BCFile, say one that cached blocks.

Possible, but I think it is too early to make BCFile API public. Can you also elaborate on the example you mentioned? Why do you need block caching instead of key-value caching (given that TFile is based on <key, value> pairs)?

bq Do you want to fix the below:
// TODO: remember the longest key in a TFile, and use it to replace
        // MAX_KEY_SIZE.
        keyBuffer = new byte[MAX_KEY_SIZE];
Default buffers of 64k for keys is a bit on the extravagant side.

Yes, it is an easy fix. I intend to do that at a later time when we gather more information about what statistics we should collect during file creation time and put them in one meta block. I don't imagine it being an urgent issue though except for applications that open up hundreds of files (or scanners) simultaneously.

bq. Below should be public so users don't have to define their own: 
   protected final static String JCLASS = "jclass:";

Certainly. Possibly also true for symbolic names for various compression algorithms.

bq. API seems to have changed since last patch. There nolonger a #find method. Whats the suggested way of accessing a random single key/value? (Open scanner using what would you suggest for start and end? Then seekTo? But I find I'm making double ByteArray instances of same byte array. Should there be a seekTo that takes a RawComparable that is public?).

Yes, the API is changed so that we do not have to scan through a compressed block twice (first get a location object, then use it with the scanner). I'd suggest to do random access as follows:
    Scanner scanner = reader.createScanner();
    ...
    if (scanner.seekTo(bytes, offset, length)) {
        Entry entry = scanner.entry();
        // access value through either entry.getValue or entry.writeValue 
    }

  
> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch, HADOOP-3315_20080915_TFILE.patch, hadoop-trunk-tfile.patch, hadoop-trunk-tfile.patch, TFile Specification 20081217.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs to compress or decompress. It would be good to have a file format that only needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.