You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Aaron Turner <sy...@gmail.com> on 2016/08/15 17:00:35 UTC
Hadoop archives (.har) are really really slow
Basically I want to list all the files in a .har file and compare the
file list/sizes to an existing directory in HDFS. The problem is that
running commands like: hdfs dfs -ls -R <path to har file> is orders of
magnitude slower then running the same command against a live HDFS
file system.
How much slower? I've calculated it will take ~19 days to list all
the files in 250TB worth of content spread between 2 .har files.
Is this normal? Can I do this faster (write a map/reduce job/etc?)
--
Aaron Turner
https://synfin.net/ Twitter: @synfinatic
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org
Re: Hadoop archives (.har) are really really slow
Posted by Aaron Turner <sy...@gmail.com>.
Oh I should mention that creating the archive took only a few hours, but copying the files out of the archive back to HDFS was 80MB/min. Would take years to copy back which seems really surprising.
-Aaron
> On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>
> ls over files in har:// maybe 10 times slow than ls over regular files. It does not sound normal unless it would take ~1 day to list out all the 250TB files when they are stored as regular files.
> Tsz-Wo
>
>
> On Monday, August 15, 2016 10:01 AM, Aaron Turner <sy...@gmail.com> wrote:
>
>
> Basically I want to list all the files in a .har file and compare the
> file list/sizes to an existing directory in HDFS. The problem is that
> running commands like: hdfs dfs -ls -R <path to har file> is orders of
> magnitude slower then running the same command against a live HDFS
> file system.
>
> How much slower? I've calculated it will take ~19 days to list all
> the files in 250TB worth of content spread between 2 .har files.
>
> Is this normal? Can I do this faster (write a map/reduce job/etc?)
>
> --
> Aaron Turner
> https://synfin.net/ Twitter: @synfinatic
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>
>
Re: Hadoop archives (.har) are really really slow
Posted by Aaron Turner <sy...@gmail.com>.
I can list all the files out of HDFS in a few hours, not a day. Listing the files in a single directory in the har takes ~50 min. Honestly I'd be happy with only a 10x performance hit. I'm seeing closer to 100-150x.
-Aaron
> On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze <sz...@yahoo.com> wrote:
>
> ls over files in har:// maybe 10 times slow than ls over regular files. It does not sound normal unless it would take ~1 day to list out all the 250TB files when they are stored as regular files.
> Tsz-Wo
>
>
> On Monday, August 15, 2016 10:01 AM, Aaron Turner <sy...@gmail.com> wrote:
>
>
> Basically I want to list all the files in a .har file and compare the
> file list/sizes to an existing directory in HDFS. The problem is that
> running commands like: hdfs dfs -ls -R <path to har file> is orders of
> magnitude slower then running the same command against a live HDFS
> file system.
>
> How much slower? I've calculated it will take ~19 days to list all
> the files in 250TB worth of content spread between 2 .har files.
>
> Is this normal? Can I do this faster (write a map/reduce job/etc?)
>
> --
> Aaron Turner
> https://synfin.net/ Twitter: @synfinatic
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>
>
>