You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by maninder batth <ba...@gmail.com> on 2012/04/20 16:49:15 UTC

Clarification on T file

My requirements are to save variable sized binary records and ability to
query them later on. So i was looking at Tfile and had some doubts.

1. Is the datablock in the tfile a fixed size or variable size? If it is
fixed, what happens when a record cannot fit in the datablock? Would you
normally fill the empty space with zeros or spread the record over 2
datablocks?

2. Is there any downside of having a variable sized datablocks?

3. Are the records synced with file at the boundary of a datablock or they
just written to file system. The question is like write() call in linux vs
fsync()?

Thank you,
-- Maninder Batth
No trees were killed in the creation of this message. However, many
electrons were terrible inconvenienced.

Re: Clarification on T file

Posted by Harsh J <ha...@cloudera.com>.
Hey Maninder,

In some ways the TFile is close to SequenceFiles.

On Fri, Apr 20, 2012 at 8:19 PM, maninder batth
<ba...@gmail.com> wrote:
> My requirements are to save variable sized binary records and ability to
> query them later on. So i was looking at Tfile and had some doubts.
>
> 1. Is the datablock in the tfile a fixed size or variable size? If it is
> fixed, what happens when a record cannot fit in the datablock? Would you
> normally fill the empty space with zeros or spread the record over 2
> datablocks?
>
> 2. Is there any downside of having a variable sized datablocks?

The condition for creation of a data block is only if the current size
of the block (at end of an append) is >= min-size-of-block.

Hence the data block isn't "fixed" in size. So if there's still space,
another record's written and then the condition is checked (which
would then trigger a block completion).

> 3. Are the records synced with file at the boundary of a datablock or they
> just written to file system. The question is like write() call in linux vs
> fsync()?

Unsure what you mean by a "datablock" here. The TFiles don't work at
the FS level, and the "datablocks" in it are logical. Could you
clarify this question given (1) and (2)?

-- 
Harsh J