You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/04/24 00:27:12 UTC

Design question

I just wanted to check how do people design their storage directories for
data that is sent to the system continuously. For eg: for a given
functionality we get data feed continuously writen to sequencefile, that is
then coverted to more structured format using map reduce and stored in tab
separated files. For such continuous feed what's the best way to organize
directories and the names? Should it be just based of timestamp or
something better that helps in organizing data.

Second part of question, is it better to store output in sequence files so
that we can take advantage of compression per record. This seems to be
required since gzip/snappy compression of entire file would launch only one
map tasks.

And the last question, when compressing a flat file should it first be
split into multiple files so that we get multiple mappers if we need to run
another job on this file? LZO is another alternative but then it requires
additional configuration, is it preferred?

Any articles or suggestions would be very helpful.

Re: Design question

Posted by Mohit Anchlia <mo...@gmail.com>.

Ant suggestion or pointers would be helpful. Are there any best practices?

On Mon, Apr 23, 2012 at 3:27 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> I just wanted to check how do people design their storage directories for
> data that is sent to the system continuously. For eg: for a given
> functionality we get data feed continuously writen to sequencefile, that is
> then coverted to more structured format using map reduce and stored in tab
> separated files. For such continuous feed what's the best way to organize
> directories and the names? Should it be just based of timestamp or
> something better that helps in organizing data.
>
> Second part of question, is it better to store output in sequence files so
> that we can take advantage of compression per record. This seems to be
> required since gzip/snappy compression of entire file would launch only one
> map tasks.
>
> And the last question, when compressing a flat file should it first be
> split into multiple files so that we get multiple mappers if we need to run
> another job on this file? LZO is another alternative but then it requires
> additional configuration, is it preferred?
>
> Any articles or suggestions would be very helpful.
>