You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by parnab kumar <pa...@gmail.com> on 2013/07/06 09:50:57 UTC

Splitting input file - increasing number of mappers

Hi ,

        I have an input file where each line is of the form :

           <URL> <A NUMBER>

      URLs whose number is within a threshold are considered similar. My
task is to group together all similar urls. For this i wrote a *custom
writable* where i implemented the threshold check in the
*compareTo*method.Therefore when Hadoop sorts the similar urls are
grouped
together.This seems to work fine .
      I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am
i decreasing the efficiency in any way  or using Hadoops sort feature which
hadoop does best  i am actually doing the right thing.Now if this is the
right thing too , then it seems my job  mostly relies on the map
task.Thefore will increase in the number of mappers increase efficiency ?

     2> My file size is not more than 64 mb  i.e a Hadoop block size which
means not more than 1 mapper will be invoked.Will splitting the file into
smaller size increase the efficiency by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
Parnab,
MS student, IIT kharagpur

Re: Splitting input file - increasing number of mappers

Posted by Zhen Ren <re...@gmail.com>.

Hi,can you paste your code here?
In addition,Hadoop:The Definitive Guide 2th introduces a tool named 'HPROF'
to analysis performance of your mapper etc at page 161.
Hope to help you!


On 6 July 2013 15:50, parnab kumar <pa...@gmail.com> wrote:

> Hi ,
>
>         I have an input file where each line is of the form :
>
>            <URL> <A NUMBER>
>
>       URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>      2> My file size is not more than 64 mb  i.e a Hadoop block size which
> means not more than 1 mapper will be invoked.Will splitting the file into
> smaller size increase the efficiency by invoking more mappers.
>
> Can someone kindly provide some insight,advice regarding the above.
>
> Thanks ,
> Parnab,
> MS student, IIT kharagpur
>



-- 
Ren Zhen

Re: Splitting input file - increasing number of mappers

Posted by Zhen Ren <re...@gmail.com>.

Hi,can you paste your code here?
In addition,Hadoop:The Definitive Guide 2th introduces a tool named 'HPROF'
to analysis performance of your mapper etc at page 161.
Hope to help you!


On 6 July 2013 15:50, parnab kumar <pa...@gmail.com> wrote:

> Hi ,
>
>         I have an input file where each line is of the form :
>
>            <URL> <A NUMBER>
>
>       URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>      2> My file size is not more than 64 mb  i.e a Hadoop block size which
> means not more than 1 mapper will be invoked.Will splitting the file into
> smaller size increase the efficiency by invoking more mappers.
>
> Can someone kindly provide some insight,advice regarding the above.
>
> Thanks ,
> Parnab,
> MS student, IIT kharagpur
>



-- 
Ren Zhen

Re: Splitting input file - increasing number of mappers

Posted by Zhen Ren <re...@gmail.com>.

Hi,can you paste your code here?
In addition,Hadoop:The Definitive Guide 2th introduces a tool named 'HPROF'
to analysis performance of your mapper etc at page 161.
Hope to help you!


On 6 July 2013 15:50, parnab kumar <pa...@gmail.com> wrote:

> Hi ,
>
>         I have an input file where each line is of the form :
>
>            <URL> <A NUMBER>
>
>       URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>      2> My file size is not more than 64 mb  i.e a Hadoop block size which
> means not more than 1 mapper will be invoked.Will splitting the file into
> smaller size increase the efficiency by invoking more mappers.
>
> Can someone kindly provide some insight,advice regarding the above.
>
> Thanks ,
> Parnab,
> MS student, IIT kharagpur
>



-- 
Ren Zhen

Re: Splitting input file - increasing number of mappers

Posted by Zhen Ren <re...@gmail.com>.

Hi,can you paste your code here?
In addition,Hadoop:The Definitive Guide 2th introduces a tool named 'HPROF'
to analysis performance of your mapper etc at page 161.
Hope to help you!


On 6 July 2013 15:50, parnab kumar <pa...@gmail.com> wrote:

> Hi ,
>
>         I have an input file where each line is of the form :
>
>            <URL> <A NUMBER>
>
>       URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>      2> My file size is not more than 64 mb  i.e a Hadoop block size which
> means not more than 1 mapper will be invoked.Will splitting the file into
> smaller size increase the efficiency by invoking more mappers.
>
> Can someone kindly provide some insight,advice regarding the above.
>
> Thanks ,
> Parnab,
> MS student, IIT kharagpur
>



-- 
Ren Zhen

Re: Splitting input file - increasing number of mappers

Posted by Shumin Guo <gs...@gmail.com>.

You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.

Simon.
On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  More mappers will make it faster
>      U can try this parameter
>       mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
>      This will control the input split size and force more mappers to run
>
>
>  Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are  not guaranteed to be called to run
>
>  Give these shot
>
>  Good luck
>
>  sanjay
>
>
>
>   From: parnab kumar <pa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Splitting input file - increasing number of mappers
>
>  Hi ,
>
>          I have an input file where each line is of the form :
>
>             <URL> <A NUMBER>
>
>        URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>       2> My file size is not more than 64 mb  i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
>
>  Can someone kindly provide some insight,advice regarding the above.
>
>  Thanks ,
> Parnab,
> MS student, IIT kharagpur
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: Splitting input file - increasing number of mappers

Posted by Shumin Guo <gs...@gmail.com>.

You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.

Simon.
On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  More mappers will make it faster
>      U can try this parameter
>       mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
>      This will control the input split size and force more mappers to run
>
>
>  Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are  not guaranteed to be called to run
>
>  Give these shot
>
>  Good luck
>
>  sanjay
>
>
>
>   From: parnab kumar <pa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Splitting input file - increasing number of mappers
>
>  Hi ,
>
>          I have an input file where each line is of the form :
>
>             <URL> <A NUMBER>
>
>        URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>       2> My file size is not more than 64 mb  i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
>
>  Can someone kindly provide some insight,advice regarding the above.
>
>  Thanks ,
> Parnab,
> MS student, IIT kharagpur
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: Splitting input file - increasing number of mappers

Posted by Shumin Guo <gs...@gmail.com>.

You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.

Simon.
On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  More mappers will make it faster
>      U can try this parameter
>       mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
>      This will control the input split size and force more mappers to run
>
>
>  Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are  not guaranteed to be called to run
>
>  Give these shot
>
>  Good luck
>
>  sanjay
>
>
>
>   From: parnab kumar <pa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Splitting input file - increasing number of mappers
>
>  Hi ,
>
>          I have an input file where each line is of the form :
>
>             <URL> <A NUMBER>
>
>        URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>       2> My file size is not more than 64 mb  i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
>
>  Can someone kindly provide some insight,advice regarding the above.
>
>  Thanks ,
> Parnab,
> MS student, IIT kharagpur
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: Splitting input file - increasing number of mappers

Posted by Shumin Guo <gs...@gmail.com>.

You also need to pay attention to the split boundary, because you don’t
want to split one line to different mappers. May be you can think about
multi-line input format.

Simon.
On Jul 6, 2013 10:18 AM, "Sanjay Subramanian" <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  More mappers will make it faster
>      U can try this parameter
>       mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
>      This will control the input split size and force more mappers to run
>
>
>  Also ur usecase seems good candidate for defining a Combiner because u r
> grouping keys based on a criteria
> But only gotcha is Combiners are  not guaranteed to be called to run
>
>  Give these shot
>
>  Good luck
>
>  sanjay
>
>
>
>   From: parnab kumar <pa...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Saturday, July 6, 2013 12:50 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Splitting input file - increasing number of mappers
>
>  Hi ,
>
>          I have an input file where each line is of the form :
>
>             <URL> <A NUMBER>
>
>        URLs whose number is within a threshold are considered similar. My
> task is to group together all similar urls. For this i wrote a *custom
> writable* where i implemented the threshold check in the *compareTo*method.Therefore when Hadoop sorts the similar urls are grouped
> together.This seems to work fine .
>       I have the following query :
>
>    1>   Since i am relying more on the sort feature provided by Hadoop, am
> i decreasing the efficiency in any way  or using Hadoops sort feature which
> hadoop does best  i am actually doing the right thing.Now if this is the
> right thing too , then it seems my job  mostly relies on the map
> task.Thefore will increase in the number of mappers increase efficiency ?
>
>       2> My file size is not more than 64 mb  i.e a Hadoop block size
> which means not more than 1 mapper will be invoked.Will splitting the file
> into smaller size increase the efficiency by invoking more mappers.
>
>  Can someone kindly provide some insight,advice regarding the above.
>
>  Thanks ,
> Parnab,
> MS student, IIT kharagpur
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>

Re: Splitting input file - increasing number of mappers

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

More mappers will make it faster
     U can try this parameter
      mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
     This will control the input split size and force more mappers to run


Also ur usecase seems good candidate for defining a Combiner because u r grouping keys based on a criteria
But only gotcha is Combiners are  not guaranteed to be called to run

Give these shot

Good luck

sanjay



From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Saturday, July 6, 2013 12:50 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Splitting input file - increasing number of mappers

Hi ,

        I have an input file where each line is of the form :

           <URL> <A NUMBER>

      URLs whose number is within a threshold are considered similar. My task is to group together all similar urls. For this i wrote a custom writable where i implemented the threshold check in the compareTo method.Therefore when Hadoop sorts the similar urls are grouped together.This seems to work fine .
      I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am i decreasing the efficiency in any way  or using Hadoops sort feature which hadoop does best  i am actually doing the right thing.Now if this is the right thing too , then it seems my job  mostly relies on the map task.Thefore will increase in the number of mappers increase efficiency ?

     2> My file size is not more than 64 mb  i.e a Hadoop block size which means not more than 1 mapper will be invoked.Will splitting the file into smaller size increase the efficiency by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
Parnab,
MS student, IIT kharagpur

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Splitting input file - increasing number of mappers

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

More mappers will make it faster
     U can try this parameter
      mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
     This will control the input split size and force more mappers to run


Also ur usecase seems good candidate for defining a Combiner because u r grouping keys based on a criteria
But only gotcha is Combiners are  not guaranteed to be called to run

Give these shot

Good luck

sanjay



From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Saturday, July 6, 2013 12:50 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Splitting input file - increasing number of mappers

Hi ,

        I have an input file where each line is of the form :

           <URL> <A NUMBER>

      URLs whose number is within a threshold are considered similar. My task is to group together all similar urls. For this i wrote a custom writable where i implemented the threshold check in the compareTo method.Therefore when Hadoop sorts the similar urls are grouped together.This seems to work fine .
      I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am i decreasing the efficiency in any way  or using Hadoops sort feature which hadoop does best  i am actually doing the right thing.Now if this is the right thing too , then it seems my job  mostly relies on the map task.Thefore will increase in the number of mappers increase efficiency ?

     2> My file size is not more than 64 mb  i.e a Hadoop block size which means not more than 1 mapper will be invoked.Will splitting the file into smaller size increase the efficiency by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
Parnab,
MS student, IIT kharagpur

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Splitting input file - increasing number of mappers

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

More mappers will make it faster
     U can try this parameter
      mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
     This will control the input split size and force more mappers to run


Also ur usecase seems good candidate for defining a Combiner because u r grouping keys based on a criteria
But only gotcha is Combiners are  not guaranteed to be called to run

Give these shot

Good luck

sanjay



From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Saturday, July 6, 2013 12:50 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Splitting input file - increasing number of mappers

Hi ,

        I have an input file where each line is of the form :

           <URL> <A NUMBER>

      URLs whose number is within a threshold are considered similar. My task is to group together all similar urls. For this i wrote a custom writable where i implemented the threshold check in the compareTo method.Therefore when Hadoop sorts the similar urls are grouped together.This seems to work fine .
      I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am i decreasing the efficiency in any way  or using Hadoops sort feature which hadoop does best  i am actually doing the right thing.Now if this is the right thing too , then it seems my job  mostly relies on the map task.Thefore will increase in the number of mappers increase efficiency ?

     2> My file size is not more than 64 mb  i.e a Hadoop block size which means not more than 1 mapper will be invoked.Will splitting the file into smaller size increase the efficiency by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
Parnab,
MS student, IIT kharagpur

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Splitting input file - increasing number of mappers

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.

More mappers will make it faster
     U can try this parameter
      mapreduce.input.fileinputformat.split.maxsize=<sizeinbytes>
     This will control the input split size and force more mappers to run


Also ur usecase seems good candidate for defining a Combiner because u r grouping keys based on a criteria
But only gotcha is Combiners are  not guaranteed to be called to run

Give these shot

Good luck

sanjay



From: parnab kumar <pa...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Saturday, July 6, 2013 12:50 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Splitting input file - increasing number of mappers

Hi ,

        I have an input file where each line is of the form :

           <URL> <A NUMBER>

      URLs whose number is within a threshold are considered similar. My task is to group together all similar urls. For this i wrote a custom writable where i implemented the threshold check in the compareTo method.Therefore when Hadoop sorts the similar urls are grouped together.This seems to work fine .
      I have the following query :

   1>   Since i am relying more on the sort feature provided by Hadoop, am i decreasing the efficiency in any way  or using Hadoops sort feature which hadoop does best  i am actually doing the right thing.Now if this is the right thing too , then it seems my job  mostly relies on the map task.Thefore will increase in the number of mappers increase efficiency ?

     2> My file size is not more than 64 mb  i.e a Hadoop block size which means not more than 1 mapper will be invoked.Will splitting the file into smaller size increase the efficiency by invoking more mappers.

Can someone kindly provide some insight,advice regarding the above.

Thanks ,
Parnab,
MS student, IIT kharagpur

CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.