You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by sindhu hosamane <si...@gmail.com> on 2014/08/01 12:36:26 UTC

Re: Ideal number of mappers and reducers to increase performance

Thanks a ton  for ur help Harsh . I am a newbie in hadoop.
If i have set
mapred.tasktracker.map.tasks.maximum  = 4
mapred.tasktracker.reduce.tasks.maximum = 4
Should i also bother or set below values
mapred.map.Tasks and mapred.reduce.Tasks .
If yes then what is the ideal value?





On Fri, Aug 1, 2014 at 12:00 AM, Harsh J <ha...@cloudera.com> wrote:

> You can perhaps start with a generic 4+4 configuration (which matches
> your cores), and tune your way upwards or downwards from there based
> on your results.
>
> On Thu, Jul 31, 2014 at 8:35 PM, Sindhu Hosamane <si...@gmail.com>
> wrote:
> > Hello friends ,
> >
> > If i am running my experiment on a server with 2 processors (4 cores
> each ) .
> > To say it has 2 processors and 8 cores .
> > What would be the ideal values for mapred.tasktracker.map.tasks.maximum
>  and mapred.tasktracker.reduce.tasks.maximum to get maximum performance.
> > I am running cascalog queries on data of size 280 MB.
> > I have multiple datanodes running on same machine.
> >
> > Your help is very much appreciated.
> >
> >
> > Regards,
> > sindhu
> >
>
>
>
> --
> Harsh J
>

Re: Ideal number of mappers and reducers to increase performance

Posted by Harsh J <ha...@cloudera.com>.
Felix has already explained most of the characteristics that define
the parallelism of MR jobs.

How many mappers does your program run? Your parallel performance
depends on how much parallelism your job actually runs with, aside of
what the platform is providing as a capability. Perhaps for your input
it only uses two map tasks (due to only 2 input splits), so it
wouldn't go any faster by default. Or perhaps your input is a single
non-splittable file, such as a gzip compressed text file, which would
only yield one map task.

As to the reduces question, if you are using an expressive wrapper
such as cascalog, then it also depends on what you are doing in it. If
you are computing an operation such as a total count, or a global max
for example, then the wrapper may by itself set the # of reducers to
1. I'm not aware of cascalog's internals, but that may be worth
looking into.

P.s. Just in case it was a typo though, the property is
mapred.reduce.tasks, not mapped.reduce.tasks.

On Tue, Aug 5, 2014 at 1:26 AM, Sindhu Hosamane <si...@gmail.com> wrote:
> Thanks a lot for your explanation Felix .
> MY query is not using global sort/count. But still i am unable to understand
> -
> even i set the mapped.reduce.tasks=4
> when the hadoop job runs i still see
> 14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
> 14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100
>
> Does that look ok , numReduceTasks should be 4 right ?
> Also i am pasting my cascalog query below. Please point me where am i wrong.
> why is the performance not increased?
>
> Cascalog code
> (def info
>       (hfs-delimited  "/users/si/File.txt"
>                        :delimiter ";"
>                        :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
>                        :classes [String String String String  ]
>                        :skip-header? true))
>
>
>
> (defn convert-to-long [a]
>     (ct/to-long (f/parse custom-formatter a)))
>
> (def info-tap
>   (<- [?timestamp  ?BTT367 ]
>       ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))
>
> (defn convert-to-float [a]
>   (try
>     (if (not= a " ")
>       (read-string a))
>    (catch Exception e (do
>  nil))))
>
>  (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
>       (convert-to-float ?BTT367 :> ?converted-BTT367 )
>       (convert-to-long ?timestamp :> ?converted-timestamp)
>       (>= ?converted-timestamp start-value)
>       (<= ?converted-timestamp end-value)
>       (:sort ?converted-BTT367)(:reverse true)
>       (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out
> ?highest-value))
>
>
> Regards,
> Sindhu
>
>
>
>
>
> On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:
>
> The mapper and reducer numbers really depends on what your program is trying
> to do. Without your actual query it’s really difficult to tell why you are
> having this problem.
>
> For example, if you tried to perform a global sum or count, cascalog will
> only use one reducer since this is the only way to do a global sum/count. To
> avoid this behavior you can set a output key that can generally split the
> reducer. e.g. word count example use word as the output key. With this word
> count output you can sum it up in a serial manner or run the global map
> reduce job with this much smaller input.
>
> The mapper number is usually not a performance bottleneck. For your curious,
> if the file is splittable (ie, unzipped text or sequence file), the number
> of mapper number is controlled by the split size in configuration. The
> smaller the split size it is, the more mappers are queued.
>
> In short, your problem is not likely to be a configuration problem, but
> misunderstood the map reduce logic. To solve your problem, can you paste
> your cascalog query and let people take a look?
>
> Felix
>
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
>
>
> I am not coding in mapreduce. I am running my cascalog queries on hadoop
> cluster(1 node ) on data of size 280MB. So all the config settings has to be
> made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4
>  and mapred.tasktracker.reduce.tasks.maximum = 4
> and then kept tuning it up ways and down ways  like below
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>
> But all the time performance remains same .
> Everytime, inspite whatever combination of
> mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution
> time .
>
> Then when the above things failed i also tried mapred.reduce.tasks = 4
> still results are same. No reduction in execution time.
>
> What other things should i set? Also i made sure hadoop is restarted every
> time after changing config.
> I have attached my conf folder ..please indicate me what should be added
> where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
>
> Regards,
> Sindhu
>
>
>



-- 
Harsh J

Re: Ideal number of mappers and reducers to increase performance

Posted by Harsh J <ha...@cloudera.com>.
Felix has already explained most of the characteristics that define
the parallelism of MR jobs.

How many mappers does your program run? Your parallel performance
depends on how much parallelism your job actually runs with, aside of
what the platform is providing as a capability. Perhaps for your input
it only uses two map tasks (due to only 2 input splits), so it
wouldn't go any faster by default. Or perhaps your input is a single
non-splittable file, such as a gzip compressed text file, which would
only yield one map task.

As to the reduces question, if you are using an expressive wrapper
such as cascalog, then it also depends on what you are doing in it. If
you are computing an operation such as a total count, or a global max
for example, then the wrapper may by itself set the # of reducers to
1. I'm not aware of cascalog's internals, but that may be worth
looking into.

P.s. Just in case it was a typo though, the property is
mapred.reduce.tasks, not mapped.reduce.tasks.

On Tue, Aug 5, 2014 at 1:26 AM, Sindhu Hosamane <si...@gmail.com> wrote:
> Thanks a lot for your explanation Felix .
> MY query is not using global sort/count. But still i am unable to understand
> -
> even i set the mapped.reduce.tasks=4
> when the hadoop job runs i still see
> 14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
> 14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100
>
> Does that look ok , numReduceTasks should be 4 right ?
> Also i am pasting my cascalog query below. Please point me where am i wrong.
> why is the performance not increased?
>
> Cascalog code
> (def info
>       (hfs-delimited  "/users/si/File.txt"
>                        :delimiter ";"
>                        :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
>                        :classes [String String String String  ]
>                        :skip-header? true))
>
>
>
> (defn convert-to-long [a]
>     (ct/to-long (f/parse custom-formatter a)))
>
> (def info-tap
>   (<- [?timestamp  ?BTT367 ]
>       ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))
>
> (defn convert-to-float [a]
>   (try
>     (if (not= a " ")
>       (read-string a))
>    (catch Exception e (do
>  nil))))
>
>  (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
>       (convert-to-float ?BTT367 :> ?converted-BTT367 )
>       (convert-to-long ?timestamp :> ?converted-timestamp)
>       (>= ?converted-timestamp start-value)
>       (<= ?converted-timestamp end-value)
>       (:sort ?converted-BTT367)(:reverse true)
>       (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out
> ?highest-value))
>
>
> Regards,
> Sindhu
>
>
>
>
>
> On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:
>
> The mapper and reducer numbers really depends on what your program is trying
> to do. Without your actual query it’s really difficult to tell why you are
> having this problem.
>
> For example, if you tried to perform a global sum or count, cascalog will
> only use one reducer since this is the only way to do a global sum/count. To
> avoid this behavior you can set a output key that can generally split the
> reducer. e.g. word count example use word as the output key. With this word
> count output you can sum it up in a serial manner or run the global map
> reduce job with this much smaller input.
>
> The mapper number is usually not a performance bottleneck. For your curious,
> if the file is splittable (ie, unzipped text or sequence file), the number
> of mapper number is controlled by the split size in configuration. The
> smaller the split size it is, the more mappers are queued.
>
> In short, your problem is not likely to be a configuration problem, but
> misunderstood the map reduce logic. To solve your problem, can you paste
> your cascalog query and let people take a look?
>
> Felix
>
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
>
>
> I am not coding in mapreduce. I am running my cascalog queries on hadoop
> cluster(1 node ) on data of size 280MB. So all the config settings has to be
> made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4
>  and mapred.tasktracker.reduce.tasks.maximum = 4
> and then kept tuning it up ways and down ways  like below
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>
> But all the time performance remains same .
> Everytime, inspite whatever combination of
> mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution
> time .
>
> Then when the above things failed i also tried mapred.reduce.tasks = 4
> still results are same. No reduction in execution time.
>
> What other things should i set? Also i made sure hadoop is restarted every
> time after changing config.
> I have attached my conf folder ..please indicate me what should be added
> where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
>
> Regards,
> Sindhu
>
>
>



-- 
Harsh J

Re: Ideal number of mappers and reducers to increase performance

Posted by Harsh J <ha...@cloudera.com>.
Felix has already explained most of the characteristics that define
the parallelism of MR jobs.

How many mappers does your program run? Your parallel performance
depends on how much parallelism your job actually runs with, aside of
what the platform is providing as a capability. Perhaps for your input
it only uses two map tasks (due to only 2 input splits), so it
wouldn't go any faster by default. Or perhaps your input is a single
non-splittable file, such as a gzip compressed text file, which would
only yield one map task.

As to the reduces question, if you are using an expressive wrapper
such as cascalog, then it also depends on what you are doing in it. If
you are computing an operation such as a total count, or a global max
for example, then the wrapper may by itself set the # of reducers to
1. I'm not aware of cascalog's internals, but that may be worth
looking into.

P.s. Just in case it was a typo though, the property is
mapred.reduce.tasks, not mapped.reduce.tasks.

On Tue, Aug 5, 2014 at 1:26 AM, Sindhu Hosamane <si...@gmail.com> wrote:
> Thanks a lot for your explanation Felix .
> MY query is not using global sort/count. But still i am unable to understand
> -
> even i set the mapped.reduce.tasks=4
> when the hadoop job runs i still see
> 14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
> 14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100
>
> Does that look ok , numReduceTasks should be 4 right ?
> Also i am pasting my cascalog query below. Please point me where am i wrong.
> why is the performance not increased?
>
> Cascalog code
> (def info
>       (hfs-delimited  "/users/si/File.txt"
>                        :delimiter ";"
>                        :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
>                        :classes [String String String String  ]
>                        :skip-header? true))
>
>
>
> (defn convert-to-long [a]
>     (ct/to-long (f/parse custom-formatter a)))
>
> (def info-tap
>   (<- [?timestamp  ?BTT367 ]
>       ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))
>
> (defn convert-to-float [a]
>   (try
>     (if (not= a " ")
>       (read-string a))
>    (catch Exception e (do
>  nil))))
>
>  (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
>       (convert-to-float ?BTT367 :> ?converted-BTT367 )
>       (convert-to-long ?timestamp :> ?converted-timestamp)
>       (>= ?converted-timestamp start-value)
>       (<= ?converted-timestamp end-value)
>       (:sort ?converted-BTT367)(:reverse true)
>       (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out
> ?highest-value))
>
>
> Regards,
> Sindhu
>
>
>
>
>
> On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:
>
> The mapper and reducer numbers really depends on what your program is trying
> to do. Without your actual query it’s really difficult to tell why you are
> having this problem.
>
> For example, if you tried to perform a global sum or count, cascalog will
> only use one reducer since this is the only way to do a global sum/count. To
> avoid this behavior you can set a output key that can generally split the
> reducer. e.g. word count example use word as the output key. With this word
> count output you can sum it up in a serial manner or run the global map
> reduce job with this much smaller input.
>
> The mapper number is usually not a performance bottleneck. For your curious,
> if the file is splittable (ie, unzipped text or sequence file), the number
> of mapper number is controlled by the split size in configuration. The
> smaller the split size it is, the more mappers are queued.
>
> In short, your problem is not likely to be a configuration problem, but
> misunderstood the map reduce logic. To solve your problem, can you paste
> your cascalog query and let people take a look?
>
> Felix
>
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
>
>
> I am not coding in mapreduce. I am running my cascalog queries on hadoop
> cluster(1 node ) on data of size 280MB. So all the config settings has to be
> made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4
>  and mapred.tasktracker.reduce.tasks.maximum = 4
> and then kept tuning it up ways and down ways  like below
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>
> But all the time performance remains same .
> Everytime, inspite whatever combination of
> mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution
> time .
>
> Then when the above things failed i also tried mapred.reduce.tasks = 4
> still results are same. No reduction in execution time.
>
> What other things should i set? Also i made sure hadoop is restarted every
> time after changing config.
> I have attached my conf folder ..please indicate me what should be added
> where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
>
> Regards,
> Sindhu
>
>
>



-- 
Harsh J

Re: Ideal number of mappers and reducers to increase performance

Posted by Harsh J <ha...@cloudera.com>.
Felix has already explained most of the characteristics that define
the parallelism of MR jobs.

How many mappers does your program run? Your parallel performance
depends on how much parallelism your job actually runs with, aside of
what the platform is providing as a capability. Perhaps for your input
it only uses two map tasks (due to only 2 input splits), so it
wouldn't go any faster by default. Or perhaps your input is a single
non-splittable file, such as a gzip compressed text file, which would
only yield one map task.

As to the reduces question, if you are using an expressive wrapper
such as cascalog, then it also depends on what you are doing in it. If
you are computing an operation such as a total count, or a global max
for example, then the wrapper may by itself set the # of reducers to
1. I'm not aware of cascalog's internals, but that may be worth
looking into.

P.s. Just in case it was a typo though, the property is
mapred.reduce.tasks, not mapped.reduce.tasks.

On Tue, Aug 5, 2014 at 1:26 AM, Sindhu Hosamane <si...@gmail.com> wrote:
> Thanks a lot for your explanation Felix .
> MY query is not using global sort/count. But still i am unable to understand
> -
> even i set the mapped.reduce.tasks=4
> when the hadoop job runs i still see
> 14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
> 14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100
>
> Does that look ok , numReduceTasks should be 4 right ?
> Also i am pasting my cascalog query below. Please point me where am i wrong.
> why is the performance not increased?
>
> Cascalog code
> (def info
>       (hfs-delimited  "/users/si/File.txt"
>                        :delimiter ";"
>                        :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
>                        :classes [String String String String  ]
>                        :skip-header? true))
>
>
>
> (defn convert-to-long [a]
>     (ct/to-long (f/parse custom-formatter a)))
>
> (def info-tap
>   (<- [?timestamp  ?BTT367 ]
>       ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))
>
> (defn convert-to-float [a]
>   (try
>     (if (not= a " ")
>       (read-string a))
>    (catch Exception e (do
>  nil))))
>
>  (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
>       (convert-to-float ?BTT367 :> ?converted-BTT367 )
>       (convert-to-long ?timestamp :> ?converted-timestamp)
>       (>= ?converted-timestamp start-value)
>       (<= ?converted-timestamp end-value)
>       (:sort ?converted-BTT367)(:reverse true)
>       (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out
> ?highest-value))
>
>
> Regards,
> Sindhu
>
>
>
>
>
> On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:
>
> The mapper and reducer numbers really depends on what your program is trying
> to do. Without your actual query it’s really difficult to tell why you are
> having this problem.
>
> For example, if you tried to perform a global sum or count, cascalog will
> only use one reducer since this is the only way to do a global sum/count. To
> avoid this behavior you can set a output key that can generally split the
> reducer. e.g. word count example use word as the output key. With this word
> count output you can sum it up in a serial manner or run the global map
> reduce job with this much smaller input.
>
> The mapper number is usually not a performance bottleneck. For your curious,
> if the file is splittable (ie, unzipped text or sequence file), the number
> of mapper number is controlled by the split size in configuration. The
> smaller the split size it is, the more mappers are queued.
>
> In short, your problem is not likely to be a configuration problem, but
> misunderstood the map reduce logic. To solve your problem, can you paste
> your cascalog query and let people take a look?
>
> Felix
>
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
>
>
> I am not coding in mapreduce. I am running my cascalog queries on hadoop
> cluster(1 node ) on data of size 280MB. So all the config settings has to be
> made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4
>  and mapred.tasktracker.reduce.tasks.maximum = 4
> and then kept tuning it up ways and down ways  like below
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>
> But all the time performance remains same .
> Everytime, inspite whatever combination of
> mapred.tasktracker.map.tasks.maximum and
> mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution
> time .
>
> Then when the above things failed i also tried mapred.reduce.tasks = 4
> still results are same. No reduction in execution time.
>
> What other things should i set? Also i made sure hadoop is restarted every
> time after changing config.
> I have attached my conf folder ..please indicate me what should be added
> where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
>
> Regards,
> Sindhu
>
>
>



-- 
Harsh J

Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
Thanks a lot for your explanation Felix .
MY query is not using global sort/count. But still i am unable to understand - 
even i set the mapped.reduce.tasks=4
when the hadoop job runs i still see 
14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100

Does that look ok , numReduceTasks should be 4 right ?
Also i am pasting my cascalog query below. Please point me where am i wrong. why is the performance not increased?

Cascalog code
(def info
      (hfs-delimited  "/users/si/File.txt"
                       :delimiter ";"
                       :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
                       :classes [String String String String  ]
                       :skip-header? true))
       
(defn convert-to-long [a]
	     (ct/to-long (f/parse custom-formatter a)))

(def info-tap
  (<- [?timestamp  ?BTT367 ]
      ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))

(defn convert-to-float [a] 
  (try
    (if (not= a " ")
      (read-string a))
   (catch Exception e (do 
 nil))))

 (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
      (convert-to-float ?BTT367 :> ?converted-BTT367 )
      (convert-to-long ?timestamp :> ?converted-timestamp)
      (>= ?converted-timestamp start-value)
      (<= ?converted-timestamp end-value)
      (:sort ?converted-BTT367)(:reverse true)
      (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out ?highest-value)) 


Regards,
Sindhu





On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:

> The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.
> 
> For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.
> 
> The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.
> 
> In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?
> 
> Felix
> 
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
> 
>> 
>> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
>> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>>  and mapred.tasktracker.reduce.tasks.maximum = 4  
>> and then kept tuning it up ways and down ways  like below 
>> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>> 
>> But all the time performance remains same .
>> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
>> 
>> Then when the above things failed i also tried mapred.reduce.tasks = 4 
>> still results are same. No reduction in execution time.
>> 
>> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
>> I have attached my conf folder ..please indicate me what should be added where ?
>> I am really stuck ..Your help would be much appreciated. Thank you .
>> <(singlenodecuda)conf.zip>
>> 
>> Regards,
>> Sindhu
> 


Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
Thanks a lot for your explanation Felix .
MY query is not using global sort/count. But still i am unable to understand - 
even i set the mapped.reduce.tasks=4
when the hadoop job runs i still see 
14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100

Does that look ok , numReduceTasks should be 4 right ?
Also i am pasting my cascalog query below. Please point me where am i wrong. why is the performance not increased?

Cascalog code
(def info
      (hfs-delimited  "/users/si/File.txt"
                       :delimiter ";"
                       :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
                       :classes [String String String String  ]
                       :skip-header? true))
       
(defn convert-to-long [a]
	     (ct/to-long (f/parse custom-formatter a)))

(def info-tap
  (<- [?timestamp  ?BTT367 ]
      ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))

(defn convert-to-float [a] 
  (try
    (if (not= a " ")
      (read-string a))
   (catch Exception e (do 
 nil))))

 (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
      (convert-to-float ?BTT367 :> ?converted-BTT367 )
      (convert-to-long ?timestamp :> ?converted-timestamp)
      (>= ?converted-timestamp start-value)
      (<= ?converted-timestamp end-value)
      (:sort ?converted-BTT367)(:reverse true)
      (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out ?highest-value)) 


Regards,
Sindhu





On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:

> The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.
> 
> For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.
> 
> The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.
> 
> In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?
> 
> Felix
> 
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
> 
>> 
>> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
>> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>>  and mapred.tasktracker.reduce.tasks.maximum = 4  
>> and then kept tuning it up ways and down ways  like below 
>> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>> 
>> But all the time performance remains same .
>> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
>> 
>> Then when the above things failed i also tried mapred.reduce.tasks = 4 
>> still results are same. No reduction in execution time.
>> 
>> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
>> I have attached my conf folder ..please indicate me what should be added where ?
>> I am really stuck ..Your help would be much appreciated. Thank you .
>> <(singlenodecuda)conf.zip>
>> 
>> Regards,
>> Sindhu
> 


Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
Thanks a lot for your explanation Felix .
MY query is not using global sort/count. But still i am unable to understand - 
even i set the mapped.reduce.tasks=4
when the hadoop job runs i still see 
14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100

Does that look ok , numReduceTasks should be 4 right ?
Also i am pasting my cascalog query below. Please point me where am i wrong. why is the performance not increased?

Cascalog code
(def info
      (hfs-delimited  "/users/si/File.txt"
                       :delimiter ";"
                       :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
                       :classes [String String String String  ]
                       :skip-header? true))
       
(defn convert-to-long [a]
	     (ct/to-long (f/parse custom-formatter a)))

(def info-tap
  (<- [?timestamp  ?BTT367 ]
      ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))

(defn convert-to-float [a] 
  (try
    (if (not= a " ")
      (read-string a))
   (catch Exception e (do 
 nil))))

 (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
      (convert-to-float ?BTT367 :> ?converted-BTT367 )
      (convert-to-long ?timestamp :> ?converted-timestamp)
      (>= ?converted-timestamp start-value)
      (<= ?converted-timestamp end-value)
      (:sort ?converted-BTT367)(:reverse true)
      (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out ?highest-value)) 


Regards,
Sindhu





On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:

> The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.
> 
> For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.
> 
> The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.
> 
> In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?
> 
> Felix
> 
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
> 
>> 
>> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
>> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>>  and mapred.tasktracker.reduce.tasks.maximum = 4  
>> and then kept tuning it up ways and down ways  like below 
>> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>> 
>> But all the time performance remains same .
>> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
>> 
>> Then when the above things failed i also tried mapred.reduce.tasks = 4 
>> still results are same. No reduction in execution time.
>> 
>> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
>> I have attached my conf folder ..please indicate me what should be added where ?
>> I am really stuck ..Your help would be much appreciated. Thank you .
>> <(singlenodecuda)conf.zip>
>> 
>> Regards,
>> Sindhu
> 


Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
Thanks a lot for your explanation Felix .
MY query is not using global sort/count. But still i am unable to understand - 
even i set the mapped.reduce.tasks=4
when the hadoop job runs i still see 
14/08/03 15:01:48 INFO mapred.MapTask: numReduceTasks: 1
14/08/03 15:01:48 INFO mapred.MapTask: io.sort.mb = 100

Does that look ok , numReduceTasks should be 4 right ?
Also i am pasting my cascalog query below. Please point me where am i wrong. why is the performance not increased?

Cascalog code
(def info
      (hfs-delimited  "/users/si/File.txt"
                       :delimiter ";"
                       :outfields ["?timestamp" "?AIT" "?CET” “?BTT367" ]
                       :classes [String String String String  ]
                       :skip-header? true))
       
(defn convert-to-long [a]
	     (ct/to-long (f/parse custom-formatter a)))

(def info-tap
  (<- [?timestamp  ?BTT367 ]
      ((select-fields info ["?timestamp"  "?BTT367"]) ?timestamp  ?BTT367)))

(defn convert-to-float [a] 
  (try
    (if (not= a " ")
      (read-string a))
   (catch Exception e (do 
 nil))))

 (?<- (stdout) [?timestamp-out ?highest-value](info-tap ?timestamp ?BTT367)
      (convert-to-float ?BTT367 :> ?converted-BTT367 )
      (convert-to-long ?timestamp :> ?converted-timestamp)
      (>= ?converted-timestamp start-value)
      (<= ?converted-timestamp end-value)
      (:sort ?converted-BTT367)(:reverse true)
      (c/limit [1] ?timestamp ?converted-BTT367 :> ?timestamp-out ?highest-value)) 


Regards,
Sindhu





On 04 Aug 2014, at 19:10, Felix Chern <id...@gmail.com> wrote:

> The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.
> 
> For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.
> 
> The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.
> 
> In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?
> 
> Felix
> 
> On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:
> 
>> 
>> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
>> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>>  and mapred.tasktracker.reduce.tasks.maximum = 4  
>> and then kept tuning it up ways and down ways  like below 
>> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
>> 
>> But all the time performance remains same .
>> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
>> 
>> Then when the above things failed i also tried mapred.reduce.tasks = 4 
>> still results are same. No reduction in execution time.
>> 
>> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
>> I have attached my conf folder ..please indicate me what should be added where ?
>> I am really stuck ..Your help would be much appreciated. Thank you .
>> <(singlenodecuda)conf.zip>
>> 
>> Regards,
>> Sindhu
> 


Re: Ideal number of mappers and reducers to increase performance

Posted by Felix Chern <id...@gmail.com>.
The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.

For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.

The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.

In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?

Felix

On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:

> 
> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>  and mapred.tasktracker.reduce.tasks.maximum = 4  
> and then kept tuning it up ways and down ways  like below 
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
> 
> But all the time performance remains same .
> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
> 
> Then when the above things failed i also tried mapred.reduce.tasks = 4 
> still results are same. No reduction in execution time.
> 
> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
> I have attached my conf folder ..please indicate me what should be added where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
> 
> Regards,
> Sindhu


Re: Ideal number of mappers and reducers to increase performance

Posted by Felix Chern <id...@gmail.com>.
The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.

For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.

The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.

In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?

Felix

On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:

> 
> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>  and mapred.tasktracker.reduce.tasks.maximum = 4  
> and then kept tuning it up ways and down ways  like below 
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
> 
> But all the time performance remains same .
> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
> 
> Then when the above things failed i also tried mapred.reduce.tasks = 4 
> still results are same. No reduction in execution time.
> 
> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
> I have attached my conf folder ..please indicate me what should be added where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
> 
> Regards,
> Sindhu


Re: Ideal number of mappers and reducers to increase performance

Posted by Felix Chern <id...@gmail.com>.
The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.

For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.

The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.

In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?

Felix

On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:

> 
> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>  and mapred.tasktracker.reduce.tasks.maximum = 4  
> and then kept tuning it up ways and down ways  like below 
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
> 
> But all the time performance remains same .
> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
> 
> Then when the above things failed i also tried mapred.reduce.tasks = 4 
> still results are same. No reduction in execution time.
> 
> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
> I have attached my conf folder ..please indicate me what should be added where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
> 
> Regards,
> Sindhu


Re: Ideal number of mappers and reducers to increase performance

Posted by Felix Chern <id...@gmail.com>.
The mapper and reducer numbers really depends on what your program is trying to do. Without your actual query it’s really difficult to tell why you are having this problem.

For example, if you tried to perform a global sum or count, cascalog will only use one reducer since this is the only way to do a global sum/count. To avoid this behavior you can set a output key that can generally split the reducer. e.g. word count example use word as the output key. With this word count output you can sum it up in a serial manner or run the global map reduce job with this much smaller input.

The mapper number is usually not a performance bottleneck. For your curious, if the file is splittable (ie, unzipped text or sequence file), the number of mapper number is controlled by the split size in configuration. The smaller the split size it is, the more mappers are queued.

In short, your problem is not likely to be a configuration problem, but misunderstood the map reduce logic. To solve your problem, can you paste your cascalog query and let people take a look?

Felix

On Aug 3, 2014, at 1:51 PM, Sindhu Hosamane <si...@gmail.com> wrote:

> 
> I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
> As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
>  and mapred.tasktracker.reduce.tasks.maximum = 4  
> and then kept tuning it up ways and down ways  like below 
> (4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)
> 
> But all the time performance remains same .
> Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .
> 
> Then when the above things failed i also tried mapred.reduce.tasks = 4 
> still results are same. No reduction in execution time.
> 
> What other things should i set? Also i made sure hadoop is restarted every time after changing config.
> I have attached my conf folder ..please indicate me what should be added where ?
> I am really stuck ..Your help would be much appreciated. Thank you .
> <(singlenodecuda)conf.zip>
> 
> Regards,
> Sindhu


Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
 and mapred.tasktracker.reduce.tasks.maximum = 4  
and then kept tuning it up ways and down ways  like below 
(4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)

But all the time performance remains same .
Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .

Then when the above things failed i also tried mapred.reduce.tasks = 4 
still results are same. No reduction in execution time.

What other things should i set? Also i made sure hadoop is restarted every time after changing config.
I have attached my conf folder ..please indicate me what should be added where ?
I am really stuck ..Your help would be much appreciated. Thank you .


Regards,
Sindhu

Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
 and mapred.tasktracker.reduce.tasks.maximum = 4  
and then kept tuning it up ways and down ways  like below 
(4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)

But all the time performance remains same .
Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .

Then when the above things failed i also tried mapred.reduce.tasks = 4 
still results are same. No reduction in execution time.

What other things should i set? Also i made sure hadoop is restarted every time after changing config.
I have attached my conf folder ..please indicate me what should be added where ?
I am really stuck ..Your help would be much appreciated. Thank you .


Regards,
Sindhu

Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
 and mapred.tasktracker.reduce.tasks.maximum = 4  
and then kept tuning it up ways and down ways  like below 
(4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)

But all the time performance remains same .
Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .

Then when the above things failed i also tried mapred.reduce.tasks = 4 
still results are same. No reduction in execution time.

What other things should i set? Also i made sure hadoop is restarted every time after changing config.
I have attached my conf folder ..please indicate me what should be added where ?
I am really stuck ..Your help would be much appreciated. Thank you .


Regards,
Sindhu

Re: Ideal number of mappers and reducers to increase performance

Posted by Sindhu Hosamane <si...@gmail.com>.
I am not coding in mapreduce. I am running my cascalog queries on hadoop cluster(1 node ) on data of size 280MB. So all the config settings has to be made on hadoop cluster itself.
As you said , i set the values of mapred.tasktracker.map.tasks.maximum =4  
 and mapred.tasktracker.reduce.tasks.maximum = 4  
and then kept tuning it up ways and down ways  like below 
(4+4)   (5+3) (6+2) (2+6) (3+5) (3+3 ) (10+10)

But all the time performance remains same .
Everytime, inspite whatever combination of mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum i use -  produces same execution time .

Then when the above things failed i also tried mapred.reduce.tasks = 4 
still results are same. No reduction in execution time.

What other things should i set? Also i made sure hadoop is restarted every time after changing config.
I have attached my conf folder ..please indicate me what should be added where ?
I am really stuck ..Your help would be much appreciated. Thank you .


Regards,
Sindhu

Re: Ideal number of mappers and reducers to increase performance

Posted by Nitin Pawar <ni...@gmail.com>.
the setting mapred.tasktracker.*  related settings are related to maximum
number of maps or reducers a tasktracker can run. This can change across
machines if you have multiple nodes then depending on machine config you
can decide these values. If you set it to 4, it will basically mean that at
any given point the tasktracker running on that machine will run maximum of
4 maps or reducers.

mapred.map.* settings are cluster wide settings. These setting mean that by
default how many tasks (maps or reducers) per job should be configured by
default. These settings are overwritten by the job when its submitted to
jobtracker or by the client itself.

Its not must for you to set the mapred.map.tasks or mapred.reduce.tasks as
the default value for it is 2 in config.




On Fri, Aug 1, 2014 at 4:06 PM, sindhu hosamane <si...@gmail.com> wrote:

> Thanks a ton  for ur help Harsh . I am a newbie in hadoop.
> If i have set
> mapred.tasktracker.map.tasks.maximum  = 4
> mapred.tasktracker.reduce.tasks.maximum = 4
> Should i also bother or set below values
> mapred.map.Tasks and mapred.reduce.Tasks .
> If yes then what is the ideal value?
>
>
>
>
>
> On Fri, Aug 1, 2014 at 12:00 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> You can perhaps start with a generic 4+4 configuration (which matches
>> your cores), and tune your way upwards or downwards from there based
>> on your results.
>>
>> On Thu, Jul 31, 2014 at 8:35 PM, Sindhu Hosamane <si...@gmail.com>
>> wrote:
>> > Hello friends ,
>> >
>> > If i am running my experiment on a server with 2 processors (4 cores
>> each ) .
>> > To say it has 2 processors and 8 cores .
>> > What would be the ideal values for mapred.tasktracker.map.tasks.maximum
>>  and mapred.tasktracker.reduce.tasks.maximum to get maximum performance.
>> > I am running cascalog queries on data of size 280 MB.
>> > I have multiple datanodes running on same machine.
>> >
>> > Your help is very much appreciated.
>> >
>> >
>> > Regards,
>> > sindhu
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Nitin Pawar

Re: Ideal number of mappers and reducers to increase performance

Posted by Nitin Pawar <ni...@gmail.com>.
the setting mapred.tasktracker.*  related settings are related to maximum
number of maps or reducers a tasktracker can run. This can change across
machines if you have multiple nodes then depending on machine config you
can decide these values. If you set it to 4, it will basically mean that at
any given point the tasktracker running on that machine will run maximum of
4 maps or reducers.

mapred.map.* settings are cluster wide settings. These setting mean that by
default how many tasks (maps or reducers) per job should be configured by
default. These settings are overwritten by the job when its submitted to
jobtracker or by the client itself.

Its not must for you to set the mapred.map.tasks or mapred.reduce.tasks as
the default value for it is 2 in config.




On Fri, Aug 1, 2014 at 4:06 PM, sindhu hosamane <si...@gmail.com> wrote:

> Thanks a ton  for ur help Harsh . I am a newbie in hadoop.
> If i have set
> mapred.tasktracker.map.tasks.maximum  = 4
> mapred.tasktracker.reduce.tasks.maximum = 4
> Should i also bother or set below values
> mapred.map.Tasks and mapred.reduce.Tasks .
> If yes then what is the ideal value?
>
>
>
>
>
> On Fri, Aug 1, 2014 at 12:00 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> You can perhaps start with a generic 4+4 configuration (which matches
>> your cores), and tune your way upwards or downwards from there based
>> on your results.
>>
>> On Thu, Jul 31, 2014 at 8:35 PM, Sindhu Hosamane <si...@gmail.com>
>> wrote:
>> > Hello friends ,
>> >
>> > If i am running my experiment on a server with 2 processors (4 cores
>> each ) .
>> > To say it has 2 processors and 8 cores .
>> > What would be the ideal values for mapred.tasktracker.map.tasks.maximum
>>  and mapred.tasktracker.reduce.tasks.maximum to get maximum performance.
>> > I am running cascalog queries on data of size 280 MB.
>> > I have multiple datanodes running on same machine.
>> >
>> > Your help is very much appreciated.
>> >
>> >
>> > Regards,
>> > sindhu
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Nitin Pawar

Re: Ideal number of mappers and reducers to increase performance

Posted by Nitin Pawar <ni...@gmail.com>.
the setting mapred.tasktracker.*  related settings are related to maximum
number of maps or reducers a tasktracker can run. This can change across
machines if you have multiple nodes then depending on machine config you
can decide these values. If you set it to 4, it will basically mean that at
any given point the tasktracker running on that machine will run maximum of
4 maps or reducers.

mapred.map.* settings are cluster wide settings. These setting mean that by
default how many tasks (maps or reducers) per job should be configured by
default. These settings are overwritten by the job when its submitted to
jobtracker or by the client itself.

Its not must for you to set the mapred.map.tasks or mapred.reduce.tasks as
the default value for it is 2 in config.




On Fri, Aug 1, 2014 at 4:06 PM, sindhu hosamane <si...@gmail.com> wrote:

> Thanks a ton  for ur help Harsh . I am a newbie in hadoop.
> If i have set
> mapred.tasktracker.map.tasks.maximum  = 4
> mapred.tasktracker.reduce.tasks.maximum = 4
> Should i also bother or set below values
> mapred.map.Tasks and mapred.reduce.Tasks .
> If yes then what is the ideal value?
>
>
>
>
>
> On Fri, Aug 1, 2014 at 12:00 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> You can perhaps start with a generic 4+4 configuration (which matches
>> your cores), and tune your way upwards or downwards from there based
>> on your results.
>>
>> On Thu, Jul 31, 2014 at 8:35 PM, Sindhu Hosamane <si...@gmail.com>
>> wrote:
>> > Hello friends ,
>> >
>> > If i am running my experiment on a server with 2 processors (4 cores
>> each ) .
>> > To say it has 2 processors and 8 cores .
>> > What would be the ideal values for mapred.tasktracker.map.tasks.maximum
>>  and mapred.tasktracker.reduce.tasks.maximum to get maximum performance.
>> > I am running cascalog queries on data of size 280 MB.
>> > I have multiple datanodes running on same machine.
>> >
>> > Your help is very much appreciated.
>> >
>> >
>> > Regards,
>> > sindhu
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Nitin Pawar

Re: Ideal number of mappers and reducers to increase performance

Posted by Nitin Pawar <ni...@gmail.com>.
the setting mapred.tasktracker.*  related settings are related to maximum
number of maps or reducers a tasktracker can run. This can change across
machines if you have multiple nodes then depending on machine config you
can decide these values. If you set it to 4, it will basically mean that at
any given point the tasktracker running on that machine will run maximum of
4 maps or reducers.

mapred.map.* settings are cluster wide settings. These setting mean that by
default how many tasks (maps or reducers) per job should be configured by
default. These settings are overwritten by the job when its submitted to
jobtracker or by the client itself.

Its not must for you to set the mapred.map.tasks or mapred.reduce.tasks as
the default value for it is 2 in config.




On Fri, Aug 1, 2014 at 4:06 PM, sindhu hosamane <si...@gmail.com> wrote:

> Thanks a ton  for ur help Harsh . I am a newbie in hadoop.
> If i have set
> mapred.tasktracker.map.tasks.maximum  = 4
> mapred.tasktracker.reduce.tasks.maximum = 4
> Should i also bother or set below values
> mapred.map.Tasks and mapred.reduce.Tasks .
> If yes then what is the ideal value?
>
>
>
>
>
> On Fri, Aug 1, 2014 at 12:00 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> You can perhaps start with a generic 4+4 configuration (which matches
>> your cores), and tune your way upwards or downwards from there based
>> on your results.
>>
>> On Thu, Jul 31, 2014 at 8:35 PM, Sindhu Hosamane <si...@gmail.com>
>> wrote:
>> > Hello friends ,
>> >
>> > If i am running my experiment on a server with 2 processors (4 cores
>> each ) .
>> > To say it has 2 processors and 8 cores .
>> > What would be the ideal values for mapred.tasktracker.map.tasks.maximum
>>  and mapred.tasktracker.reduce.tasks.maximum to get maximum performance.
>> > I am running cascalog queries on data of size 280 MB.
>> > I have multiple datanodes running on same machine.
>> >
>> > Your help is very much appreciated.
>> >
>> >
>> > Regards,
>> > sindhu
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Nitin Pawar