You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by Dharmesh Kakadia <dh...@gmail.com> on 2016/11/01 23:40:35 UTC

Hive+Tez staging dir and scratch dir

Hi,

I am trying to understand meaning and relation between following
configurations when running Hive on Tez. I have default FS as Azure store
and trying to figure out where all the local disk is utilized because I am
running into disk space filling up while large ORC table conversion.

hive.exec.stagingdir
tez.staging-dir
hive.exec.scratchdir

Any help ?

Thanks,
Dharmesh

Re: Hive+Tez staging dir and scratch dir

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Thanks Gopal. Does ORC conversion have to see the entire data before it can
>  write output? My source table is fairly large (~70 TB) which I am trying to
>  convert to ORC. 

Only for partitioned/bucketed tables.

In case of partitioned tables, you do not want each task opening new files (considering there are 70,000 tasks there) in each partition.

The data load is shuffled to bring partitions together, to avoid ending up with a few million files per-TB of data.

There are always faster ways to do this, if you control the inputs (insert 1 day at a time or 1 week) or can make assumptions about them (each file contains 1 day).

But until you can confirm which of the YARN directories are big, I wouldn't start on attempting that yet.

Cheers,
Gopal




Re: Hive+Tez staging dir and scratch dir

Posted by Dharmesh Kakadia <dh...@gmail.com>.
Thanks Gopal. Does ORC conversion have to see the entire data before it can
write output? My source table is fairly large (~70 TB) which I am trying to
convert to ORC. Both the source and destination table is on WASB remote
store and has a lot of space. But the conversion job runs out of disk space
while running the reducers part of ORC conversion query. Are there
alternative ways to achieve ORC conversion that does fill up disk?

Thanks,
Dharmesh

On Tue, Nov 1, 2016 at 7:30 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> > The namenode ui is reporting "non-dfs" used to be 90% of the size.
>
> That space is unlikely to be related to the hive or tez scratch dir
> configs.
>
> If you inspect your disks with (or wherever your disks are)
>
> du -sh /grid/*/yarn/*
>
> you will have some idea of what is occupying that space - whether it is
> logs, local data or shuffle data.
>
> Cheers,
> Gopal
>
>
>
>

Re: Hive+Tez staging dir and scratch dir

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> The namenode ui is reporting "non-dfs" used to be 90% of the size.

That space is unlikely to be related to the hive or tez scratch dir configs.

If you inspect your disks with (or wherever your disks are)

du -sh /grid/*/yarn/* 

you will have some idea of what is occupying that space - whether it is logs, local data or shuffle data.

Cheers,
Gopal




Re: Hive+Tez staging dir and scratch dir

Posted by Dharmesh Kakadia <dh...@gmail.com>.
Thanks Hitesh.

We do run a local HDFS, but the disk space is not used by datanodes. The
namenode ui is reporting "non-dfs" used to be 90% of the size. The
intermediate output from the tasks seems to be filling up the disk.
I will follow your suggestion and post this to hive mailinglist.

Thanks,
Dharmesh

On Tue, Nov 1, 2016 at 6:51 PM, Hitesh Shah <hi...@apache.org> wrote:

> There are multiple aspects of local disk. Is the disk usage being taken up
> by the NodeManager local dirs? Is it being taken by the NodeManager log
> dirs? Are you running HDFS which will also consume local disk space i.e.
> datanode’s data dirs? Could you clarify in terms of the above as to what is
> taking up a lot of space?
>
> The tez staging dir, hive scratch dir are usually meant to be configured
> to point to a distributed FS. Have you configured them to use the Azure
> store FS? FWIW, in most cases, the Tez staging dir is not very large as it
> stores meta data and not the real data being processed.
>
> Additionally, this might be better to post to the hive mailing lists in
> terms of how they manage intermediate data before the table is made visible
> to other users.
>
> thanks
> — Hitesh
>
>
> > On Nov 1, 2016, at 4:40 PM, Dharmesh Kakadia <dh...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I am trying to understand meaning and relation between following
> > configurations when running Hive on Tez. I have default FS as Azure store
> > and trying to figure out where all the local disk is utilized because I
> am
> > running into disk space filling up while large ORC table conversion.
> >
> > hive.exec.stagingdir
> > tez.staging-dir
> > hive.exec.scratchdir
> >
> > Any help ?
> >
> > Thanks,
> > Dharmesh
>
>

Re: Hive+Tez staging dir and scratch dir

Posted by Hitesh Shah <hi...@apache.org>.
There are multiple aspects of local disk. Is the disk usage being taken up by the NodeManager local dirs? Is it being taken by the NodeManager log dirs? Are you running HDFS which will also consume local disk space i.e. datanode’s data dirs? Could you clarify in terms of the above as to what is taking up a lot of space? 

The tez staging dir, hive scratch dir are usually meant to be configured to point to a distributed FS. Have you configured them to use the Azure store FS? FWIW, in most cases, the Tez staging dir is not very large as it stores meta data and not the real data being processed.  

Additionally, this might be better to post to the hive mailing lists in terms of how they manage intermediate data before the table is made visible to other users. 

thanks
— Hitesh


> On Nov 1, 2016, at 4:40 PM, Dharmesh Kakadia <dh...@gmail.com> wrote:
> 
> Hi,
> 
> I am trying to understand meaning and relation between following
> configurations when running Hive on Tez. I have default FS as Azure store
> and trying to figure out where all the local disk is utilized because I am
> running into disk space filling up while large ORC table conversion.
> 
> hive.exec.stagingdir
> tez.staging-dir
> hive.exec.scratchdir
> 
> Any help ?
> 
> Thanks,
> Dharmesh