You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Sean McNamara <Se...@Webtrends.com> on 2013/02/08 22:52:35 UTC

Splitting logs in hdfs by account

We have a use case that requires us to have the ability to:

  *   delete all of a customers data as it sits in hdfs on a whims notice
  *   Re-mapreduce all of a particular accounts data, going way back in time

This is how we're thinking of storing the logs in hdfs:

/hdfs-path-to-data/accnt-1/YYYY-MM-DD.log
/hdfs-path-to-data/accnt-2/YYYY-MM-DD.log
..


I imagine we would need to tune the hdfs block size depending on the size of the logs, and the goal would be to have 1 log file per account, per day (so we don't have a zillion files burdening the namenode).


We currently have large bz2 files with all account data mingled together flowing into hdfs.  So I'm thinking the best approach would be to have a daily MR job that's uses MultipleOutputs, and creates block compressed sequence files split by account?  Can MultipleOutputs specify different output directories for each output file, so that the output files don't have to be copied into the proper account directory after completing?

Is this approach sound?  I thought it would be wise to solicit some feedback on here before starting to go down a path.

Thanks!

Sean

Re: Splitting logs in hdfs by account

Posted by Tom Brown <to...@gmail.com>.

It seems that moving the files around (if they're all output to a single
directory by MultipleOutputs) should be a lightweight operation for hdfs.

My thinking is that a physical copy of the data would not be required.
Rather, the namenode should be able to perform the move by adjusting the
directory structure (I.e. the actual blocks remain the same).

If this is not how HDFS works, please correct me!

--Tom

On Friday, February 8, 2013, Sean McNamara wrote:

>  We have a use case that requires us to have the ability to:
>
>    - delete all of a customers data as it sits in hdfs on a whims notice
>    - Re-mapreduce all of a particular accounts data, going way back in
>    time
>
>
>  This is how we're thinking of storing the logs in hdfs:
>
>  /hdfs-path-to-data/accnt-1/YYYY-MM-DD.log
> /hdfs-path-to-data/accnt-2/YYYY-MM-DD.log
> ..
>
>
>  I imagine we would need to tune the hdfs block size depending on the
> size of the logs, and the goal would be to have 1 log file per account, per
> day (so we don't have a zillion files burdening the namenode).
>
>
>  We currently have large bz2 files with all account data mingled together
> flowing into hdfs.  So I'm thinking the best approach would be to have a
> daily MR job that's uses MultipleOutputs, and creates block compressed
> sequence files split by account?  Can MultipleOutputs specify different
> output directories for each output file, so that the output files don't
> have to be copied into the proper account directory after completing?
>
>  Is this approach sound?  I thought it would be wise to solicit some
> feedback on here before starting to go down a path.
>
>  Thanks!
>
>  Sean
>
>
>
>
>
>
>
>

Re: Splitting logs in hdfs by account

Posted by Tom Brown <to...@gmail.com>.

It seems that moving the files around (if they're all output to a single
directory by MultipleOutputs) should be a lightweight operation for hdfs.

My thinking is that a physical copy of the data would not be required.
Rather, the namenode should be able to perform the move by adjusting the
directory structure (I.e. the actual blocks remain the same).

If this is not how HDFS works, please correct me!

--Tom

On Friday, February 8, 2013, Sean McNamara wrote:

>  We have a use case that requires us to have the ability to:
>
>    - delete all of a customers data as it sits in hdfs on a whims notice
>    - Re-mapreduce all of a particular accounts data, going way back in
>    time
>
>
>  This is how we're thinking of storing the logs in hdfs:
>
>  /hdfs-path-to-data/accnt-1/YYYY-MM-DD.log
> /hdfs-path-to-data/accnt-2/YYYY-MM-DD.log
> ..
>
>
>  I imagine we would need to tune the hdfs block size depending on the
> size of the logs, and the goal would be to have 1 log file per account, per
> day (so we don't have a zillion files burdening the namenode).
>
>
>  We currently have large bz2 files with all account data mingled together
> flowing into hdfs.  So I'm thinking the best approach would be to have a
> daily MR job that's uses MultipleOutputs, and creates block compressed
> sequence files split by account?  Can MultipleOutputs specify different
> output directories for each output file, so that the output files don't
> have to be copied into the proper account directory after completing?
>
>  Is this approach sound?  I thought it would be wise to solicit some
> feedback on here before starting to go down a path.
>
>  Thanks!
>
>  Sean
>
>
>
>
>
>
>
>

Re: Splitting logs in hdfs by account

Posted by Tom Brown <to...@gmail.com>.

It seems that moving the files around (if they're all output to a single
directory by MultipleOutputs) should be a lightweight operation for hdfs.

My thinking is that a physical copy of the data would not be required.
Rather, the namenode should be able to perform the move by adjusting the
directory structure (I.e. the actual blocks remain the same).

If this is not how HDFS works, please correct me!

--Tom

On Friday, February 8, 2013, Sean McNamara wrote:

>  We have a use case that requires us to have the ability to:
>
>    - delete all of a customers data as it sits in hdfs on a whims notice
>    - Re-mapreduce all of a particular accounts data, going way back in
>    time
>
>
>  This is how we're thinking of storing the logs in hdfs:
>
>  /hdfs-path-to-data/accnt-1/YYYY-MM-DD.log
> /hdfs-path-to-data/accnt-2/YYYY-MM-DD.log
> ..
>
>
>  I imagine we would need to tune the hdfs block size depending on the
> size of the logs, and the goal would be to have 1 log file per account, per
> day (so we don't have a zillion files burdening the namenode).
>
>
>  We currently have large bz2 files with all account data mingled together
> flowing into hdfs.  So I'm thinking the best approach would be to have a
> daily MR job that's uses MultipleOutputs, and creates block compressed
> sequence files split by account?  Can MultipleOutputs specify different
> output directories for each output file, so that the output files don't
> have to be copied into the proper account directory after completing?
>
>  Is this approach sound?  I thought it would be wise to solicit some
> feedback on here before starting to go down a path.
>
>  Thanks!
>
>  Sean
>
>
>
>
>
>
>
>

Re: Splitting logs in hdfs by account

Posted by Tom Brown <to...@gmail.com>.

It seems that moving the files around (if they're all output to a single
directory by MultipleOutputs) should be a lightweight operation for hdfs.

My thinking is that a physical copy of the data would not be required.
Rather, the namenode should be able to perform the move by adjusting the
directory structure (I.e. the actual blocks remain the same).

If this is not how HDFS works, please correct me!

--Tom

On Friday, February 8, 2013, Sean McNamara wrote:

>  We have a use case that requires us to have the ability to:
>
>    - delete all of a customers data as it sits in hdfs on a whims notice
>    - Re-mapreduce all of a particular accounts data, going way back in
>    time
>
>
>  This is how we're thinking of storing the logs in hdfs:
>
>  /hdfs-path-to-data/accnt-1/YYYY-MM-DD.log
> /hdfs-path-to-data/accnt-2/YYYY-MM-DD.log
> ..
>
>
>  I imagine we would need to tune the hdfs block size depending on the
> size of the logs, and the goal would be to have 1 log file per account, per
> day (so we don't have a zillion files burdening the namenode).
>
>
>  We currently have large bz2 files with all account data mingled together
> flowing into hdfs.  So I'm thinking the best approach would be to have a
> daily MR job that's uses MultipleOutputs, and creates block compressed
> sequence files split by account?  Can MultipleOutputs specify different
> output directories for each output file, so that the output files don't
> have to be copied into the proper account directory after completing?
>
>  Is this approach sound?  I thought it would be wise to solicit some
> feedback on here before starting to go down a path.
>
>  Thanks!
>
>  Sean
>
>
>
>
>
>
>
>