You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Madhav Sharan <ms...@usc.edu> on 2016/08/08 23:19:33 UTC

All nodes are not used

Hi Hadoop users,

I am running a m/r job with an input file of 23 million records. I can see
all our files are not getting used.

What can I change to utilize all nodes?


Containers Mem Used Mem Avail Vcores used Vcores avail
8 11.25 GB 0 B 8 0
0 0 B 11.25 GB 0 8
0 0 B 11.25 GB 0 8
8 11.25 GB 0 B 8 0
8 11.25 GB 0 B 8 0
7 11.25 GB 0 B 7 1
5 7.03 GB 4.22 GB 5 3
0 0 B 11.25 GB 0 8
0 0 B 11.25 GB 0 8


My command looks like -

hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation
/user/pts/output/MeanChiSquareAndSimilarityInput
/user/pts/output/MeanChiSquaredCalcOutput

Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a input
file of 23 m records. File size is ~3 GB

Code -
https://github.com/smadha/pooled_time_series/blob/master/src/main/java/gov/nasa/jpl/memex/pooledtimeseries/MeanChiSquareDistanceCalculation.java#L135


--
Madhav Sharan

Re: All nodes are not used

Posted by Madhav Sharan <ms...@usc.edu>.

Thanks Mahesh

Till now I am not able to run the whole job in a limited time period. So I
am looking for optimizations and resource utilization. May be I can try
tweaking input split size if it helps.

Thanks for your help, It explains the behaviour

--
Madhav Sharan


On Tue, Aug 9, 2016 at 1:28 PM, Mahesh Balija <ba...@gmail.com>
wrote:

> Hi Madhav,
>
> The behaviour to me sounds normal.
> If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e.,
> containers used).
> You cannot use entire cluster as the blocks could be only in the nodes
> being used.
>
> You should not try using the entire cluster resources for following reason
>
> The time required to initialize the container vs the time required to
> process the amount of data should be optimum to maximize the conainer
> utilization, that is why the block size 128 MB is been choosen, in many
> cases this InputSplit size is increased to optimize the containers
> utilization depending on the workloads.
>
> Best,
> Mahesh.B.
>
>
>
> On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <ms...@usc.edu> wrote:
>
>> Hi Hadoop users,
>>
>> I am running a m/r job with an input file of 23 million records. I can
>> see all our files are not getting used.
>>
>> What can I change to utilize all nodes?
>>
>>
>> Containers Mem Used Mem Avail Vcores used Vcores avail
>> 8 11.25 GB 0 B 8 0
>> 0 0 B 11.25 GB 0 8
>> 0 0 B 11.25 GB 0 8
>> 8 11.25 GB 0 B 8 0
>> 8 11.25 GB 0 B 8 0
>> 7 11.25 GB 0 B 7 1
>> 5 7.03 GB 4.22 GB 5 3
>> 0 0 B 11.25 GB 0 8
>> 0 0 B 11.25 GB 0 8
>>
>>
>> My command looks like -
>>
>> hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
>> gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation
>> /user/pts/output/MeanChiSquareAndSimilarityInput
>> /user/pts/output/MeanChiSquaredCalcOutput
>>
>> Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a
>> input file of 23 m records. File size is ~3 GB
>>
>> Code - https://github.com/smadha/pooled_time_series/blob/master/src
>> /main/java/gov/nasa/jpl/memex/pooledtimeseries/MeanChiSquare
>> DistanceCalculation.java#L135
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smadha_pooled-5Ftime-5Fseries_blob_master_src_main_java_gov_nasa_jpl_memex_pooledtimeseries_MeanChiSquareDistanceCalculation.java-23L135&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=ZQO-otgJ4EOvBzmchAV--4QdJcYvW3BYTxuPziQ53EM&s=tCPLOH7YJVRXRKfaD8HM3f-imDvx5VACqBiAXkK7S1w&e=>
>>
>>
>> --
>> Madhav Sharan
>>
>>
>

Re: All nodes are not used

Posted by Mahesh Balija <ba...@gmail.com>.

Hi Madhav,

The behaviour to me sounds normal.
If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e.,
containers used).
You cannot use entire cluster as the blocks could be only in the nodes
being used.

You should not try using the entire cluster resources for following reason

The time required to initialize the container vs the time required to
process the amount of data should be optimum to maximize the conainer
utilization, that is why the block size 128 MB is been choosen, in many
cases this InputSplit size is increased to optimize the containers
utilization depending on the workloads.

Best,
Mahesh.B.

On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <ms...@usc.edu> wrote:

> Hi Hadoop users,
>
> I am running a m/r job with an input file of 23 million records. I can see
> all our files are not getting used.
>
> What can I change to utilize all nodes?
>
>
> Containers Mem Used Mem Avail Vcores used Vcores avail
> 8 11.25 GB 0 B 8 0
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
> 8 11.25 GB 0 B 8 0
> 8 11.25 GB 0 B 8 0
> 7 11.25 GB 0 B 7 1
> 5 7.03 GB 4.22 GB 5 3
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
>
>
> My command looks like -
>
> hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
> gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation
> /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/
> MeanChiSquaredCalcOutput
>
> Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a
> input file of 23 m records. File size is ~3 GB
>
> Code - https://github.com/smadha/pooled_time_series/blob/
> master/src/main/java/gov/nasa/jpl/memex/pooledtimeseries/
> MeanChiSquareDistanceCalculation.java#L135
>
>
> --
> Madhav Sharan
>
>

Re: All nodes are not used

Posted by Madhav Sharan <ms...@usc.edu>.

Hi Sunil - Thanks a lot for replying

For one job run yes some nodes don't take load at all. But if I rerun no
these are not same nodes always.

One map job takes ~3 seconds to run and till now I am not able to run my
whole job on a bigger data set so I can't say that container are short
lived.

I was doing experiments and if I split input file into N files where N =
number of cores then my job starts running on all cores. So may be I need
to look at split size. Any trick to set split size = number of cores?

I can try adjusting mapred.min.split.size manually otherwise.


--
Madhav Sharan


On Tue, Aug 9, 2016 at 8:27 AM, Sunil Govind <su...@gmail.com> wrote:

> HI Madhav
>
> Could you help to share some more information here. When u say few nodes
> are not utilized, is it always same nodes which are not utilized?
>
> also how long each of these container are running on an average, pls make
> sure you have provided enough split size to ensure the containers are not
> short running.
>
> Thanks
> Sunil
>
> On Tue, Aug 9, 2016 at 4:49 AM Madhav Sharan <ms...@usc.edu> wrote:
>
>> Hi Hadoop users,
>>
>> I am running a m/r job with an input file of 23 million records. I can
>> see all our files are not getting used.
>>
>> What can I change to utilize all nodes?
>>
>>
>> Containers Mem Used Mem Avail Vcores used Vcores avail
>> 8 11.25 GB 0 B 8 0
>> 0 0 B 11.25 GB 0 8
>> 0 0 B 11.25 GB 0 8
>> 8 11.25 GB 0 B 8 0
>> 8 11.25 GB 0 B 8 0
>> 7 11.25 GB 0 B 7 1
>> 5 7.03 GB 4.22 GB 5 3
>> 0 0 B 11.25 GB 0 8
>> 0 0 B 11.25 GB 0 8
>>
>>
>> My command looks like -
>>
>> hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
>> gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation
>> /user/pts/output/MeanChiSquareAndSimilarityInput
>> /user/pts/output/MeanChiSquaredCalcOutput
>>
>> Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a
>> input file of 23 m records. File size is ~3 GB
>>
>> Code - https://github.com/smadha/pooled_time_series/blob/master/src
>> /main/java/gov/nasa/jpl/memex/pooledtimeseries/MeanChiSquare
>> DistanceCalculation.java#L135
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_smadha_pooled-5Ftime-5Fseries_blob_master_src_main_java_gov_nasa_jpl_memex_pooledtimeseries_MeanChiSquareDistanceCalculation.java-23L135&d=DQMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=lW_zVvBUpXydqujJG2o4HChrLD-0A-mjoMCXOaKh2eI&s=TdXRcXKJ9MowW1sS7KlhX14-45SNMj0O6gVqBqoEjwg&e=>
>>
>>
>> --
>> Madhav Sharan
>>
>>

Re: All nodes are not used

Posted by Sunil Govind <su...@gmail.com>.

HI Madhav

Could you help to share some more information here. When u say few nodes
are not utilized, is it always same nodes which are not utilized?

also how long each of these container are running on an average, pls make
sure you have provided enough split size to ensure the containers are not
short running.

Thanks
Sunil

On Tue, Aug 9, 2016 at 4:49 AM Madhav Sharan <ms...@usc.edu> wrote:

> Hi Hadoop users,
>
> I am running a m/r job with an input file of 23 million records. I can see
> all our files are not getting used.
>
> What can I change to utilize all nodes?
>
>
> Containers Mem Used Mem Avail Vcores used Vcores avail
> 8 11.25 GB 0 B 8 0
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
> 8 11.25 GB 0 B 8 0
> 8 11.25 GB 0 B 8 0
> 7 11.25 GB 0 B 7 1
> 5 7.03 GB 4.22 GB 5 3
> 0 0 B 11.25 GB 0 8
> 0 0 B 11.25 GB 0 8
>
>
> My command looks like -
>
> hadoop jar
> target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar
> gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation /user/pts/output/MeanChiSquareAndSimilarityInput
> /user/pts/output/MeanChiSquaredCalcOutput
>
> Directory - */user/pts/output/MeanChiSquareAndSimilarityInput* have a
> input file of 23 m records. File size is ~3 GB
>
> Code -
> https://github.com/smadha/pooled_time_series/blob/master/src/main/java/gov/nasa/jpl/memex/pooledtimeseries/MeanChiSquareDistanceCalculation.java#L135
>
>
> --
> Madhav Sharan
>
>