You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Stuti Awasthi <st...@hcl.com> on 2013/01/16 08:48:10 UTC

MatrixMultiplicationJob runs with 1 mapper only ?

Hi,

I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and with default block sizeof 64MB only 2 blocks are getting created.
So I reduced the block size to 10MB and now my file divided into 11 blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 DN/TT.

Everytime Im running Mahout MatrixMultiplicationJob through commandline, I can see on JobTracker WebUI that only 1 map task is launched. According to my understanding of Inputsplit, there should be 11 map tasks launched.
Apart from this Map task stays at 0.99% completion and in the Tasks Logs , I can see that map task is spilling the map output.

Mahout Command:

mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200 -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 --numColsB 100000 --tempDir /test/temp

Now here I want to know that why only 1 map task is launched everytime and how can I performance tune the cluster so that I can perform the dense matrix multiplication of the order [90K x 1 Million] .

Thanks
Stuti

::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Stuti Awasthi <st...@hcl.com>.

Hey Sean,
Thanks for response. MatrixMultiplicationJob help shows the usage like :
usage: <command> [Generic Options] [Job-Specific Options] 

Here Generic Option can be provided by -D <property=value>. Hence I tried with commandline -D options but it seems like that it is not making any effect.  It is also suggested in :
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/AbstractJob.html 

Here I have noted 1 thing after your suggestion  that currently Im passing arguments like -D<property=value> rather than -D <property=value>. I tried with space between -D and property=value also but then its giving error like:
13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected /test/points/matrixA while processing Job-Specific Options:

No such error comes if im passing the arguments without space between -D.

By reference of Hadoop Definite Guide : "Do not confuse setting Hadoop properties using the -D property=value option to GenericOptionsParser (and ToolRunner) with setting JVM system properties using the                   -Dproperty=value option to the java command. The syntax for JVM system properties does not allow any whitespace between the D and the property name, whereas GenericOptionsParser requires them to be separated by whitespace."

Hence I suppose that GenericOptions should be parsed by -D property=value rather than -Dproperty=value.

Additionally I tried -Dmapred.max.split.size=10485760 also through commandline but again only single MapTask started.

Please Suggest

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Wednesday, January 16, 2013 1:23 PM
To: Mahout User List
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

It's up to Hadoop in the end.

Try calling FileInputFormat.setMaxInputSplitSize() with a smallish value, like your 10MB (10000000).

I don't know if Hadoop params can be set as sys properties like that anyway?

On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi,
>
> I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and with default block sizeof 64MB only 2 blocks are getting created.
> So I reduced the block size to 10MB and now my file divided into 11 blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 DN/TT.
>
> Everytime Im running Mahout MatrixMultiplicationJob through commandline, I can see on JobTracker WebUI that only 1 map task is launched. According to my understanding of Inputsplit, there should be 11 map tasks launched.
> Apart from this Map task stays at 0.99% completion and in the Tasks Logs , I can see that map task is spilling the map output.
>
> Mahout Command:
>
> mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M 
> -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200 
> -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100 
> --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 --numColsB 
> 100000 --tempDir /test/temp
>
> Now here I want to know that why only 1 map task is launched everytime and how can I performance tune the cluster so that I can perform the dense matrix multiplication of the order [90K x 1 Million] .
>
> Thanks
> Stuti
>
>
> ::DISCLAIMER::
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as 
> information could be intercepted, corrupted, lost, destroyed, arrive 
> late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of 
> the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, 
> copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses and other defects.
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Stuti Awasthi <st...@hcl.com>.

Hi,

Following are the stats 
Cluster Size -> 8 Datanodes with configured capacity of 10 maptask each node. So total map task capacity = 80

Attempt 1
1. Created a file with matrix of dimension 100 x 10000
2. Split this file to 20 part files
3. Submitted to the Mahout matrixMultiplicationJob. It submitted 20 map task . It completed the job in 1 hour , 7mins and 48 sec

Attempt2
1. Same file
2. . Split this file to 50 part files
3.  . Submitted to the Mahout matrixMultiplicationJob. . It submitted 50 map task. Job Failed .

Error :
13/01/30 11:44:54 INFO mapred.JobClient: Job Failed: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201301291845_0004_m_000008

On Investigating more there are so many failed map tasks each with same error like :
Task attempt_201301291845_0004_m_000000_0 failed to report status for 604 seconds. Killing!

What went wrong ? How can I improve the performance ? I thought to increase the number of files so that it can distribute to different mappers and I can utilize my cluster capacity but the Job failed..

Any pointers will be useful.

Thanks
Stuti



-----Original Message-----
From: satish verma [mailto:satish.bigdata@gmail.com] 
Sent: Tuesday, January 29, 2013 7:27 PM
To: user@mahout.apache.org
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

I think I was able to create multiple reducers by setting this property mapred.reduce.tasks = 10 in the MR code.


Try setting  this.  If it does not work, I will check my code n let u know.
But it is doable.

The multiplication part was tricky for the mapper part . Reducer part was easy .


On Tuesday, 29 January 2013, Stuti Awasthi wrote:

> Hey Satish,
> Thanks a ton. It worked for me also. Is there any way to increase 
> reducer also currently only single reducer is working.
>
> Thanks
> Stuti
>
> -----Original Message-----
> From: satish verma [mailto:satish.bigdata@gmail.com]
> Sent: Monday, January 28, 2013 7:13 PM
> To: user@mahout.apache.org
> Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>
> I faced this problem too.
>
> Split the seq file in which ur data is there into Multiple files. Then 
> run the matrix multiplication with the folder as input . If the folder 
> contains N sequence files, N mappers will be created.
>
>
>
> On Monday, 28 January 2013, Sean Owen wrote:
>
> > These are settings to Hadoop, not Mahout. You may need to set them 
> > in your cluster config. They are still only suggestions.
> >
> > The question still remains why you think you need several mappers. Why?
> >
> > On Mon, Jan 28, 2013 at 1:28 PM, Stuti Awasthi 
> > <st...@hcl.com>
> > wrote:
> > > Hi,
> > > I would like to again consolidate all the steps which I performed.
> > >
> > > Issue : MatrixMultiplication example is getting executed with only 
> > > 1 map
> > task.
> > >
> > > Steps :
> > > 1. I created a file with size 104MB which is divided into 11 
> > > blocks with
> > size 10MB each. The file contains 200x100000 size of matrix.
> > > 2. I exported $MAHOUT_OPTS to the following
> > >           $   echo $MAHOUT_OPTS
> > >           -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7 3.
> > > Tried to execute matrix multiplication example using commandline :
> > > mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200
> > --numColsA 100000 --inputPathB /test/points/matrixA --numRowsB 200 
> > --numColsB 100000 --tempDir /test/temp
> > >
> > > When I check the Jobtracker UI , its shows me following for the 
> > > running
> > job :
> > > Running Map Tasks : 1
> > > Occupied Map Slots: 1
> > >
> > > How can I distribute the map task on different mappers for
> > MatrixMultiplication Job dynamically.
> > > Is it even possible that MatrixMultiplication can run 
> > > distributedly on
> > multiple mappers as it internally uses CompositeInputFormat .
> > >
> > > Please Suggest
> > >
> > > Thanks
> > > Stuti
> > >
> > >
> > > -----Original Message-----
> > > From: Sean Owen [mailto:srowen@gmail.com]
> > > Sent: Wednesday, January 23, 2013 6:42 PM
> > > To: Mahout User List
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > Mappers are usually extremely fast since they start themselves on 
> > > top of
> > the data and their job is usually just parsing and emitting key 
> > value pairs. Hadoop's choices are usually fine.
> > >
> > > If not it is usually because the mapper is emitting far more data 
> > > than
> > it ingests. Are you computing some kind of Cartesian product of input?
> > >
> > > That's slow no matter what. More mappers may increase parallelism 
> > > but
> > its still a lot of I/O. Avoid it if you can by sampling or pruning 
> > unimportant values. Otherwise , try to implement a Combiner.
> > > On Jan 23, 2013 12:04 PM, "Jonas Grote" <jf...@gmail.com> wrote:
> > >
> > >> I'd play with the mapred.map.tasks option. Setting it to 
> > >> something bigger than 1 gave me performance improvements for 
> > >> various hadoop jobs on my cluster.
> > >>
> > >>
> > >> 2013/1/16 Ashish <pa...@gmail.com>
> > >>
> > >> > I am afraid I don't know the answer. Need to experiment a bit more.
> > >> > I
> > >> have
> > >> > not used CompositeInputFormat so cannot comment.
> > >> >
> > >> > Probably, someone else on the ML(Mailing List) would be able to 
> > >> > guide
> > >> here.
> > >> >
> > >> >
> > >> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi
> > >> > <::DISCLAIMER::
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------
>
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as 
> information could be intercepted, corrupted, lost, destroyed, arrive 
> late or incomplete, or may contain viruses in transmission. The e mail 
> and its contents (with or without referred errors) shall therefore not 
> attach any liability on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of 
> the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, 
> copying, disclosure, modification, distribution and / or publication 
> of this message without the prior written consent of authorized 
> representative of HCL is strictly prohibited. If you have received 
> this email in error please delete it and notify the sender 
> immediately.
> Before opening any email and/or attachments, please check them for 
> viruses and other defects.
>
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------
>
>

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by satish verma <sa...@gmail.com>.

I think I was able to create multiple reducers by setting this property
mapred.reduce.tasks = 10 in the MR code.


Try setting  this.  If it does not work, I will check my code n let u know.
But it is doable.

The multiplication part was tricky for the mapper part . Reducer part was
easy .


On Tuesday, 29 January 2013, Stuti Awasthi wrote:

> Hey Satish,
> Thanks a ton. It worked for me also. Is there any way to increase reducer
> also currently only single reducer is working.
>
> Thanks
> Stuti
>
> -----Original Message-----
> From: satish verma [mailto:satish.bigdata@gmail.com]
> Sent: Monday, January 28, 2013 7:13 PM
> To: user@mahout.apache.org
> Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>
> I faced this problem too.
>
> Split the seq file in which ur data is there into Multiple files. Then run
> the matrix multiplication with the folder as input . If the folder contains
> N sequence files, N mappers will be created.
>
>
>
> On Monday, 28 January 2013, Sean Owen wrote:
>
> > These are settings to Hadoop, not Mahout. You may need to set them in
> > your cluster config. They are still only suggestions.
> >
> > The question still remains why you think you need several mappers. Why?
> >
> > On Mon, Jan 28, 2013 at 1:28 PM, Stuti Awasthi <st...@hcl.com>
> > wrote:
> > > Hi,
> > > I would like to again consolidate all the steps which I performed.
> > >
> > > Issue : MatrixMultiplication example is getting executed with only 1
> > > map
> > task.
> > >
> > > Steps :
> > > 1. I created a file with size 104MB which is divided into 11 blocks
> > > with
> > size 10MB each. The file contains 200x100000 size of matrix.
> > > 2. I exported $MAHOUT_OPTS to the following
> > >           $   echo $MAHOUT_OPTS
> > >           -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7 3.
> > > Tried to execute matrix multiplication example using commandline :
> > > mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200
> > --numColsA 100000 --inputPathB /test/points/matrixA --numRowsB 200
> > --numColsB 100000 --tempDir /test/temp
> > >
> > > When I check the Jobtracker UI , its shows me following for the
> > > running
> > job :
> > > Running Map Tasks : 1
> > > Occupied Map Slots: 1
> > >
> > > How can I distribute the map task on different mappers for
> > MatrixMultiplication Job dynamically.
> > > Is it even possible that MatrixMultiplication can run distributedly
> > > on
> > multiple mappers as it internally uses CompositeInputFormat .
> > >
> > > Please Suggest
> > >
> > > Thanks
> > > Stuti
> > >
> > >
> > > -----Original Message-----
> > > From: Sean Owen [mailto:srowen@gmail.com]
> > > Sent: Wednesday, January 23, 2013 6:42 PM
> > > To: Mahout User List
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > Mappers are usually extremely fast since they start themselves on
> > > top of
> > the data and their job is usually just parsing and emitting key value
> > pairs. Hadoop's choices are usually fine.
> > >
> > > If not it is usually because the mapper is emitting far more data
> > > than
> > it ingests. Are you computing some kind of Cartesian product of input?
> > >
> > > That's slow no matter what. More mappers may increase parallelism
> > > but
> > its still a lot of I/O. Avoid it if you can by sampling or pruning
> > unimportant values. Otherwise , try to implement a Combiner.
> > > On Jan 23, 2013 12:04 PM, "Jonas Grote" <jf...@gmail.com> wrote:
> > >
> > >> I'd play with the mapred.map.tasks option. Setting it to something
> > >> bigger than 1 gave me performance improvements for various hadoop
> > >> jobs on my cluster.
> > >>
> > >>
> > >> 2013/1/16 Ashish <pa...@gmail.com>
> > >>
> > >> > I am afraid I don't know the answer. Need to experiment a bit more.
> > >> > I
> > >> have
> > >> > not used CompositeInputFormat so cannot comment.
> > >> >
> > >> > Probably, someone else on the ML(Mailing List) would be able to
> > >> > guide
> > >> here.
> > >> >
> > >> >
> > >> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi
> > >> > <::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Stuti Awasthi <st...@hcl.com>.

Hey Satish,
Thanks a ton. It worked for me also. Is there any way to increase reducer also currently only single reducer is working.

Thanks
Stuti

-----Original Message-----
From: satish verma [mailto:satish.bigdata@gmail.com] 
Sent: Monday, January 28, 2013 7:13 PM
To: user@mahout.apache.org
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

I faced this problem too.

Split the seq file in which ur data is there into Multiple files. Then run the matrix multiplication with the folder as input . If the folder contains N sequence files, N mappers will be created.



On Monday, 28 January 2013, Sean Owen wrote:

> These are settings to Hadoop, not Mahout. You may need to set them in 
> your cluster config. They are still only suggestions.
>
> The question still remains why you think you need several mappers. Why?
>
> On Mon, Jan 28, 2013 at 1:28 PM, Stuti Awasthi <st...@hcl.com>
> wrote:
> > Hi,
> > I would like to again consolidate all the steps which I performed.
> >
> > Issue : MatrixMultiplication example is getting executed with only 1 
> > map
> task.
> >
> > Steps :
> > 1. I created a file with size 104MB which is divided into 11 blocks 
> > with
> size 10MB each. The file contains 200x100000 size of matrix.
> > 2. I exported $MAHOUT_OPTS to the following
> >           $   echo $MAHOUT_OPTS
> >           -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7 3.  
> > Tried to execute matrix multiplication example using commandline :
> > mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200
> --numColsA 100000 --inputPathB /test/points/matrixA --numRowsB 200 
> --numColsB 100000 --tempDir /test/temp
> >
> > When I check the Jobtracker UI , its shows me following for the 
> > running
> job :
> > Running Map Tasks : 1
> > Occupied Map Slots: 1
> >
> > How can I distribute the map task on different mappers for
> MatrixMultiplication Job dynamically.
> > Is it even possible that MatrixMultiplication can run distributedly 
> > on
> multiple mappers as it internally uses CompositeInputFormat .
> >
> > Please Suggest
> >
> > Thanks
> > Stuti
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Wednesday, January 23, 2013 6:42 PM
> > To: Mahout User List
> > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > Mappers are usually extremely fast since they start themselves on 
> > top of
> the data and their job is usually just parsing and emitting key value 
> pairs. Hadoop's choices are usually fine.
> >
> > If not it is usually because the mapper is emitting far more data 
> > than
> it ingests. Are you computing some kind of Cartesian product of input?
> >
> > That's slow no matter what. More mappers may increase parallelism 
> > but
> its still a lot of I/O. Avoid it if you can by sampling or pruning 
> unimportant values. Otherwise , try to implement a Combiner.
> > On Jan 23, 2013 12:04 PM, "Jonas Grote" <jf...@gmail.com> wrote:
> >
> >> I'd play with the mapred.map.tasks option. Setting it to something 
> >> bigger than 1 gave me performance improvements for various hadoop 
> >> jobs on my cluster.
> >>
> >>
> >> 2013/1/16 Ashish <pa...@gmail.com>
> >>
> >> > I am afraid I don't know the answer. Need to experiment a bit more.
> >> > I
> >> have
> >> > not used CompositeInputFormat so cannot comment.
> >> >
> >> > Probably, someone else on the ML(Mailing List) would be able to 
> >> > guide
> >> here.
> >> >
> >> >
> >> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi 
> >> > <st...@hcl.com>
> >> > wrote:
> >> >
> >> > > Thanks Ashish,
> >> > >
> >> > > So according to the link if one is using CompositeInputFormat 
> >> > > then it
> >> > will
> >> > > take entire file as Input to a mapper without considering 
> >> > > InputSplits/blocksize.
> >> > > If I am understanding it correctly then it is asking to break 
> >> > > [Original Input File]->[flie1,file2,....] .
> >> > >
> >> > > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1, 
> >> > > [/test/smallfiles/file2, [/test/smallfiles/file3...............  
> >> > > ]
> >> > >
> >> > > Now will the input path in MatrixMultiplicationJob will be 
> >> > > directory
> >> path
> >> > > : /test/smallfiles  ??
> >> > >
> >> > > Will breaking file in such manner will cause problem in 
> >> > > algorithmic execution of MR job. Im not sure if output will be
> correct .
> >> > >
> >> > > -----Original Message-----
> >> > > From: Ashish [mailto:paliwalashish@gmail.com]
> >> > > Sent: Wednesday, Januar


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by satish verma <sa...@gmail.com>.

I faced this problem too.

Split the seq file in which ur data is there into
Multiple files. Then run the matrix multiplication with the folder as input
. If the folder contains N sequence files, N mappers will be created.



On Monday, 28 January 2013, Sean Owen wrote:

> These are settings to Hadoop, not Mahout. You may need to set them in
> your cluster config. They are still only suggestions.
>
> The question still remains why you think you need several mappers. Why?
>
> On Mon, Jan 28, 2013 at 1:28 PM, Stuti Awasthi <st...@hcl.com>
> wrote:
> > Hi,
> > I would like to again consolidate all the steps which I performed.
> >
> > Issue : MatrixMultiplication example is getting executed with only 1 map
> task.
> >
> > Steps :
> > 1. I created a file with size 104MB which is divided into 11 blocks with
> size 10MB each. The file contains 200x100000 size of matrix.
> > 2. I exported $MAHOUT_OPTS to the following
> >           $   echo $MAHOUT_OPTS
> >           -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7
> > 3.  Tried to execute matrix multiplication example using commandline :
> > mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200
> --numColsA 100000 --inputPathB /test/points/matrixA --numRowsB 200
> --numColsB 100000 --tempDir /test/temp
> >
> > When I check the Jobtracker UI , its shows me following for the running
> job :
> > Running Map Tasks : 1
> > Occupied Map Slots: 1
> >
> > How can I distribute the map task on different mappers for
> MatrixMultiplication Job dynamically.
> > Is it even possible that MatrixMultiplication can run distributedly on
> multiple mappers as it internally uses CompositeInputFormat .
> >
> > Please Suggest
> >
> > Thanks
> > Stuti
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Wednesday, January 23, 2013 6:42 PM
> > To: Mahout User List
> > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > Mappers are usually extremely fast since they start themselves on top of
> the data and their job is usually just parsing and emitting key value
> pairs. Hadoop's choices are usually fine.
> >
> > If not it is usually because the mapper is emitting far more data than
> it ingests. Are you computing some kind of Cartesian product of input?
> >
> > That's slow no matter what. More mappers may increase parallelism but
> its still a lot of I/O. Avoid it if you can by sampling or pruning
> unimportant values. Otherwise , try to implement a Combiner.
> > On Jan 23, 2013 12:04 PM, "Jonas Grote" <jf...@gmail.com> wrote:
> >
> >> I'd play with the mapred.map.tasks option. Setting it to something
> >> bigger than 1 gave me performance improvements for various hadoop jobs
> >> on my cluster.
> >>
> >>
> >> 2013/1/16 Ashish <pa...@gmail.com>
> >>
> >> > I am afraid I don't know the answer. Need to experiment a bit more.
> >> > I
> >> have
> >> > not used CompositeInputFormat so cannot comment.
> >> >
> >> > Probably, someone else on the ML(Mailing List) would be able to
> >> > guide
> >> here.
> >> >
> >> >
> >> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi
> >> > <st...@hcl.com>
> >> > wrote:
> >> >
> >> > > Thanks Ashish,
> >> > >
> >> > > So according to the link if one is using CompositeInputFormat then
> >> > > it
> >> > will
> >> > > take entire file as Input to a mapper without considering
> >> > > InputSplits/blocksize.
> >> > > If I am understanding it correctly then it is asking to break
> >> > > [Original Input File]->[flie1,file2,....] .
> >> > >
> >> > > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1,
> >> > > [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
> >> > >
> >> > > Now will the input path in MatrixMultiplicationJob will be
> >> > > directory
> >> path
> >> > > : /test/smallfiles  ??
> >> > >
> >> > > Will breaking file in such manner will cause problem in
> >> > > algorithmic execution of MR job. Im not sure if output will be
> correct .
> >> > >
> >> > > -----Original Message-----
> >> > > From: Ashish [mailto:paliwalashish@gmail.com]
> >> > > Sent: Wednesday, Januar

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Sean Owen <sr...@gmail.com>.

These are settings to Hadoop, not Mahout. You may need to set them in
your cluster config. They are still only suggestions.

The question still remains why you think you need several mappers. Why?

On Mon, Jan 28, 2013 at 1:28 PM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi,
> I would like to again consolidate all the steps which I performed.
>
> Issue : MatrixMultiplication example is getting executed with only 1 map task.
>
> Steps :
> 1. I created a file with size 104MB which is divided into 11 blocks with size 10MB each. The file contains 200x100000 size of matrix.
> 2. I exported $MAHOUT_OPTS to the following
>           $   echo $MAHOUT_OPTS
>           -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7
> 3.  Tried to execute matrix multiplication example using commandline :
> mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200 --numColsA 100000 --inputPathB /test/points/matrixA --numRowsB 200 --numColsB 100000 --tempDir /test/temp
>
> When I check the Jobtracker UI , its shows me following for the running job :
> Running Map Tasks : 1
> Occupied Map Slots: 1
>
> How can I distribute the map task on different mappers for MatrixMultiplication Job dynamically.
> Is it even possible that MatrixMultiplication can run distributedly on multiple mappers as it internally uses CompositeInputFormat .
>
> Please Suggest
>
> Thanks
> Stuti
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Wednesday, January 23, 2013 6:42 PM
> To: Mahout User List
> Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>
> Mappers are usually extremely fast since they start themselves on top of the data and their job is usually just parsing and emitting key value pairs. Hadoop's choices are usually fine.
>
> If not it is usually because the mapper is emitting far more data than it ingests. Are you computing some kind of Cartesian product of input?
>
> That's slow no matter what. More mappers may increase parallelism but its still a lot of I/O. Avoid it if you can by sampling or pruning unimportant values. Otherwise , try to implement a Combiner.
> On Jan 23, 2013 12:04 PM, "Jonas Grote" <jf...@gmail.com> wrote:
>
>> I'd play with the mapred.map.tasks option. Setting it to something
>> bigger than 1 gave me performance improvements for various hadoop jobs
>> on my cluster.
>>
>>
>> 2013/1/16 Ashish <pa...@gmail.com>
>>
>> > I am afraid I don't know the answer. Need to experiment a bit more.
>> > I
>> have
>> > not used CompositeInputFormat so cannot comment.
>> >
>> > Probably, someone else on the ML(Mailing List) would be able to
>> > guide
>> here.
>> >
>> >
>> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi
>> > <st...@hcl.com>
>> > wrote:
>> >
>> > > Thanks Ashish,
>> > >
>> > > So according to the link if one is using CompositeInputFormat then
>> > > it
>> > will
>> > > take entire file as Input to a mapper without considering
>> > > InputSplits/blocksize.
>> > > If I am understanding it correctly then it is asking to break
>> > > [Original Input File]->[flie1,file2,....] .
>> > >
>> > > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1,
>> > > [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
>> > >
>> > > Now will the input path in MatrixMultiplicationJob will be
>> > > directory
>> path
>> > > : /test/smallfiles  ??
>> > >
>> > > Will breaking file in such manner will cause problem in
>> > > algorithmic execution of MR job. Im not sure if output will be correct .
>> > >
>> > > -----Original Message-----
>> > > From: Ashish [mailto:paliwalashish@gmail.com]
>> > > Sent: Wednesday, January 16, 2013 5:44 PM
>> > > To: user@mahout.apache.org
>> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>> > >
>> > > MatrixMultiplicationJob internally sets InputFormat as
>> > CompositeInputFormat
>> > >
>> > > JobConf conf = new JobConf(initialConf,
>> > > MatrixMultiplicationJob.class);
>> > > conf.setInputFormat(CompositeInputFormat.class);
>> > >
>> > > and AFAIK, CompositeInputFormat ignores the splits. See this
>> > >
>> >
>> http://stackoverflow.com/questions/8654200/hadoop-file-splits-composit
>> einputformat-inner-join
>> > >
>> > > Unfortunately, I don't know any other alternative as of now.
>> > >
>> > >
>> > > On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi
>> > > <st...@hcl.com>
>> > > wrote:
>> > >
>> > > > The issue is that currently my matrix is of dimension
>> > > > (100x100k), Later it can be (1MX10M) or big.
>> > > >
>> > > > Even now if my job is running with the single mapper for
>> > > > (100x100k) and it is not able to complete the Job. As I
>> > > > mentioned map task just proceed to 0.99% and started spilling
>> > > > the map output. Hence I wanted to tune my job so that Mahout is
>> > > > able to complete the job and I can utilize my cluster resources.
>> > > >
>> > > > As MatrixMultiplicationJob is a MR, so it should be able to
>> > > > handle parallel map tasks. I am not sure if there is any
>> > > > algorithmic constraints due to which it runs only with single mapper ?
>> > > > I have taken the reference of thread so that I can set
>> > > > Configuration myself rather by getting it with getConf() but did
>> > > > not got any
>> success
>> > > >
>> > > >
>> http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
>> > > > ers-in-DistributedRowMatrix-Jobs-td888980.html
>> > > >
>> > > > Stuti
>> > > >
>> > > > -----Original Message-----
>> > > > From: Sean Owen [mailto:srowen@gmail.com]
>> > > > Sent: Wednesday, January 16, 2013 4:46 PM
>> > > > To: Mahout User List
>> > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
>> > > >
>> > > > Why do you need multiple mappers? Is one too slow? Many are not
>> > > > necessarily faster for small input On Jan 16, 2013 10:46 AM,
>> > > > "Stuti Awasthi" <st...@hcl.com> wrote:
>> > > >
>> > > > > Hi,
>> > > > > I tried to call programmatically also but facing same issue :
>> > > > > Only single MapTask is running and that too spilling the map
>> > > > > output
>> > > >  continuously.
>> > > > > Hence im not able to generate the output for large matrix
>> > > multiplication.
>> > > > >
>> > > > > Code Snippet :
>> > > > >
>> > > > > DistributedRowMatrix a = new DistributedRowMatrix(new
>> > > > > Path("/test/points/matrixA"), new
>> > > > > Path("/test/temp"),Integer.parseInt("100"),
>> > > > > Integer.parseInt("100000")); DistributedRowMatrix b = new
>> > > > > DistributedRowMatrix(new Path("/test/points/matrixA"),new
>> > > > > Path("tempDir"),Integer.parseInt("100"),
>> > > > > Integer.parseInt("100000"));
>> > > > > Configuration conf = new Configuration();
>> > > > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
>> > > > > conf.set("mapred.child.java.opts",
>> > > > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
>> > > > > a.setConf(conf);
>> > > > > b.setConf(conf);
>> > > > > a.times(b);
>> > > > >
>> > > > > Where Im going wrong. Any idea ?
>> > > > >
>> > > > > Thanks
>> > > > > Stuti
>> > > > > -----Original Message-----
>> > > > > From: Stuti Awasthi
>> > > > > Sent: Wednesday, January 16, 2013 2:55 PM
>> > > > > To: Mahout User List
>> > > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
>> > > > >
>> > > > > Hey Sean,
>> > > > > Thanks for response. MatrixMultiplicationJob help shows the
>> > > > > usage
>> > like
>> > > :
>> > > > > usage: <command> [Generic Options] [Job-Specific Options]
>> > > > >
>> > > > > Here Generic Option can be provided by -D <property=value>.
>> > > > > Hence I tried with commandline -D options but it seems like
>> > > > > that it is not making any effect.  It is also suggested in :
>> > > > >
>> > > > >
>> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
>> > > > > ut
>> > > > > /common/AbstractJob.html
>> > > > >
>> > > > > Here I have noted 1 thing after your suggestion  that
>> > > > > currently Im passing arguments like -D<property=value> rather
>> > > > > than -D <property=value>. I tried with space between -D and
>> > > > > property=value also but then its giving error
>> > > > > like:
>> > > > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
>> > > > > /test/points/matrixA while processing Job-Specific Options:
>> > > > >
>> > > > > No such error comes if im passing the arguments without space
>> between
>> > > -D.
>> > > > >
>> > > > > By reference of Hadoop Definite Guide : "Do not confuse
>> > > > > setting Hadoop properties using the -D property=value option
>> > > > > to GenericOptionsParser (and
>> > > > > ToolRunner) with setting JVM system properties using the
>> > > > > -Dproperty=value option to the java command. The syntax for
>> > > > > JVM system properties does not allow any whitespace between
>> > > > > the D and the property name, whereas GenericOptionsParser
>> > > > > requires them to be separated by whitespace."
>> > > > >
>> > > > > Hence I suppose that GenericOptions should be parsed by -D
>> > > > > property=value rather than -Dproperty=value.
>> > > > >
>> > > > > Additionally I tried -Dmapred.max.split.size=10485760 also
>> > > > > through commandline but again only single MapTask started.
>> > > > >
>> > > > > Please Suggest
>> > > > >
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Sean Owen [mailto:srowen@gmail.com]
>> > > > > Sent: Wednesday, January 16, 2013 1:23 PM
>> > > > > To: Mahout User List
>> > > > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>> > > > >
>> > > > > It's up to Hadoop in the end.
>> > > > >
>> > > > > Try calling FileInputFormat.setMaxInputSplitSize() with a
>> > > > > smallish value, like your 10MB (10000000).
>> > > > >
>> > > > > I don't know if Hadoop params can be set as sys properties
>> > > > > like
>> that
>> > > > > anyway?
>> > > > >
>> > > > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi
>> > > > > <st...@hcl.com>
>> > > > > wrote:
>> > > > > > Hi,
>> > > > > >
>> > > > > > I am trying to multiple dense matrix of size [100 x 100k].
>> > > > > > The size of
>> > > > > the file is 104MB and with default block sizeof 64MB only 2
>> > > > > blocks are getting created.
>> > > > > > So I reduced the block size to 10MB and now my file divided
>> > > > > > into
>> > > > > > 11
>> > > > > blocks across the cluster. Cluster size is 10 nodes with 1
>> > > > > NN/JT
>> and
>> > > > > 9 DN/TT.
>> > > > > >
>> > > > > > Everytime Im running Mahout MatrixMultiplicationJob through
>> > > > > > commandline,
>> > > > > I can see on JobTracker WebUI that only 1 map task is launched.
>> > > > > According to my understanding of Inputsplit, there should be
>> > > > > 11 map
>> > > > tasks launched.
>> > > > > > Apart from this Map task stays at 0.99% completion and in
>> > > > > > the Tasks Logs
>> > > > > , I can see that map task is spilling the map output.
>> > > > > >
>> > > > > > Mahout Command:
>> > > > > >
>> > > > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
>> > > > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100
>> > > > > > -Dio.sort.mb=200
>> > > > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA
>> --numRowsA
>> > > > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB
>> > > > > > 100 --numColsB
>> > > > > > 100000 --tempDir /test/temp
>> > > > > >
>> > > > > > Now here I want to know that why only 1 map task is launched
>> > > > > > everytime
>> > > > > and how can I performance tune the cluster so that I can
>> > > > > perform
>> the
>> > > > > dense matrix multiplication of the order [90K x 1 Million] .
>> > > > > >
>> > > > > > Thanks
>> > > > > > Stuti
>> > > > > >
>> > > > > >
>> > > > > > ::DISCLAIMER::
>> > > > > >
>> ------------------------------------------------------------------
>> > > > > > --
>> > > > > > --
>> > > > > >
>> ------------------------------------------------------------------
>> > > > > > --
>> > > > > > --
>> > > > > > --------
>> > > > > >
>> > > > > > The contents of this e-mail and any attachment(s) are
>> confidential
>> > > > > > and
>> > > > > intended for the named recipient(s) only.
>> > > > > > E-mail transmission is not guaranteed to be secure or
>> > > > > > error-free as information could be intercepted, corrupted,
>> > > > > > lost, destroyed, arrive late or incomplete, or may contain
>> > > > > > viruses in
>> transmission.
>> > > > > > The e mail
>> > > > > and its contents (with or without referred errors) shall
>> > > > > therefore not attach any liability on the originator or HCL or
>> > > > > its
>> affiliates.
>> > > > > > Views or opinions, if any, presented in this email are
>> > > > > > solely those of the author and may not necessarily reflect
>> > > > > > the views or opinions of HCL or its affiliates. Any form of
>> > > > > > reproduction, dissemination, copying, disclosure,
>> > > > > > modification, distribution
>> and
>> > > > > > / or publication of
>> > > > > this message without the prior written consent of authorized
>> > > > > representative of HCL is strictly prohibited. If you have
>> > > > > received this email in error please delete it and notify the
>> > > > > sender
>> > immediately.
>> > > > > > Before opening any email and/or attachments, please check
>> > > > > > them
>> for
>> > > > > viruses and other defects.
>> > > > > >
>> > > > > >
>> ------------------------------------------------------------------
>> > > > > > --
>> > > > > > --
>> > > > > >
>> ------------------------------------------------------------------
>> > > > > > --
>> > > > > > --
>> > > > > > --------
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > thanks
>> > > ashish
>> > >
>> > > Blog: http://www.ashishpaliwal.com/blog My Photo Galleries:
>> > > http://www.pbase.com/ashishpaliwal
>> > >
>> >
>> >
>> >
>> > --
>> > thanks
>> > ashish
>> >
>> > Blog: http://www.ashishpaliwal.com/blog My Photo Galleries:
>> > http://www.pbase.com/ashishpaliwal
>> >
>>
>
>
> ::DISCLAIMER::
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses and other defects.
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Stuti Awasthi <st...@hcl.com>.

Hi,
I would like to again consolidate all the steps which I performed. 

Issue : MatrixMultiplication example is getting executed with only 1 map task.

Steps :
1. I created a file with size 104MB which is divided into 11 blocks with size 10MB each. The file contains 200x100000 size of matrix. 
2. I exported $MAHOUT_OPTS to the following 
          $   echo $MAHOUT_OPTS
          -Dmapred.min.split.size=10485760 -Dmapred.map.tasks=7
3.  Tried to execute matrix multiplication example using commandline :
mahout matrixmult --inputPathA /test/points/matrixA --numRowsA 200 --numColsA 100000 --inputPathB /test/points/matrixA --numRowsB 200 --numColsB 100000 --tempDir /test/temp

When I check the Jobtracker UI , its shows me following for the running job :
Running Map Tasks : 1
Occupied Map Slots: 1

How can I distribute the map task on different mappers for MatrixMultiplication Job dynamically. 
Is it even possible that MatrixMultiplication can run distributedly on multiple mappers as it internally uses CompositeInputFormat .

Please Suggest

Thanks
Stuti


-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Wednesday, January 23, 2013 6:42 PM
To: Mahout User List
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

Mappers are usually extremely fast since they start themselves on top of the data and their job is usually just parsing and emitting key value pairs. Hadoop's choices are usually fine.

If not it is usually because the mapper is emitting far more data than it ingests. Are you computing some kind of Cartesian product of input?

That's slow no matter what. More mappers may increase parallelism but its still a lot of I/O. Avoid it if you can by sampling or pruning unimportant values. Otherwise , try to implement a Combiner.
On Jan 23, 2013 12:04 PM, "Jonas Grote" <jf...@gmail.com> wrote:

> I'd play with the mapred.map.tasks option. Setting it to something 
> bigger than 1 gave me performance improvements for various hadoop jobs 
> on my cluster.
>
>
> 2013/1/16 Ashish <pa...@gmail.com>
>
> > I am afraid I don't know the answer. Need to experiment a bit more. 
> > I
> have
> > not used CompositeInputFormat so cannot comment.
> >
> > Probably, someone else on the ML(Mailing List) would be able to 
> > guide
> here.
> >
> >
> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi 
> > <st...@hcl.com>
> > wrote:
> >
> > > Thanks Ashish,
> > >
> > > So according to the link if one is using CompositeInputFormat then 
> > > it
> > will
> > > take entire file as Input to a mapper without considering 
> > > InputSplits/blocksize.
> > > If I am understanding it correctly then it is asking to break 
> > > [Original Input File]->[flie1,file2,....] .
> > >
> > > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1, 
> > > [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
> > >
> > > Now will the input path in MatrixMultiplicationJob will be 
> > > directory
> path
> > > : /test/smallfiles  ??
> > >
> > > Will breaking file in such manner will cause problem in 
> > > algorithmic execution of MR job. Im not sure if output will be correct .
> > >
> > > -----Original Message-----
> > > From: Ashish [mailto:paliwalashish@gmail.com]
> > > Sent: Wednesday, January 16, 2013 5:44 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > MatrixMultiplicationJob internally sets InputFormat as
> > CompositeInputFormat
> > >
> > > JobConf conf = new JobConf(initialConf, 
> > > MatrixMultiplicationJob.class); 
> > > conf.setInputFormat(CompositeInputFormat.class);
> > >
> > > and AFAIK, CompositeInputFormat ignores the splits. See this
> > >
> >
> http://stackoverflow.com/questions/8654200/hadoop-file-splits-composit
> einputformat-inner-join
> > >
> > > Unfortunately, I don't know any other alternative as of now.
> > >
> > >
> > > On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi 
> > > <st...@hcl.com>
> > > wrote:
> > >
> > > > The issue is that currently my matrix is of dimension 
> > > > (100x100k), Later it can be (1MX10M) or big.
> > > >
> > > > Even now if my job is running with the single mapper for 
> > > > (100x100k) and it is not able to complete the Job. As I 
> > > > mentioned map task just proceed to 0.99% and started spilling 
> > > > the map output. Hence I wanted to tune my job so that Mahout is 
> > > > able to complete the job and I can utilize my cluster resources.
> > > >
> > > > As MatrixMultiplicationJob is a MR, so it should be able to 
> > > > handle parallel map tasks. I am not sure if there is any 
> > > > algorithmic constraints due to which it runs only with single mapper ?
> > > > I have taken the reference of thread so that I can set 
> > > > Configuration myself rather by getting it with getConf() but did 
> > > > not got any
> success
> > > >
> > > >
> http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> > > > ers-in-DistributedRowMatrix-Jobs-td888980.html
> > > >
> > > > Stuti
> > > >
> > > > -----Original Message-----
> > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > Sent: Wednesday, January 16, 2013 4:46 PM
> > > > To: Mahout User List
> > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > >
> > > > Why do you need multiple mappers? Is one too slow? Many are not 
> > > > necessarily faster for small input On Jan 16, 2013 10:46 AM, 
> > > > "Stuti Awasthi" <st...@hcl.com> wrote:
> > > >
> > > > > Hi,
> > > > > I tried to call programmatically also but facing same issue : 
> > > > > Only single MapTask is running and that too spilling the map 
> > > > > output
> > > >  continuously.
> > > > > Hence im not able to generate the output for large matrix
> > > multiplication.
> > > > >
> > > > > Code Snippet :
> > > > >
> > > > > DistributedRowMatrix a = new DistributedRowMatrix(new 
> > > > > Path("/test/points/matrixA"), new 
> > > > > Path("/test/temp"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000")); DistributedRowMatrix b = new 
> > > > > DistributedRowMatrix(new Path("/test/points/matrixA"),new 
> > > > > Path("tempDir"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000"));
> > > > > Configuration conf = new Configuration(); 
> > > > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818"); 
> > > > > conf.set("mapred.child.java.opts",
> > > > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > > > > a.setConf(conf);
> > > > > b.setConf(conf);
> > > > > a.times(b);
> > > > >
> > > > > Where Im going wrong. Any idea ?
> > > > >
> > > > > Thanks
> > > > > Stuti
> > > > > -----Original Message-----
> > > > > From: Stuti Awasthi
> > > > > Sent: Wednesday, January 16, 2013 2:55 PM
> > > > > To: Mahout User List
> > > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > Hey Sean,
> > > > > Thanks for response. MatrixMultiplicationJob help shows the 
> > > > > usage
> > like
> > > :
> > > > > usage: <command> [Generic Options] [Job-Specific Options]
> > > > >
> > > > > Here Generic Option can be provided by -D <property=value>. 
> > > > > Hence I tried with commandline -D options but it seems like 
> > > > > that it is not making any effect.  It is also suggested in :
> > > > >
> > > > >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > > > > ut
> > > > > /common/AbstractJob.html
> > > > >
> > > > > Here I have noted 1 thing after your suggestion  that 
> > > > > currently Im passing arguments like -D<property=value> rather 
> > > > > than -D <property=value>. I tried with space between -D and 
> > > > > property=value also but then its giving error
> > > > > like:
> > > > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected 
> > > > > /test/points/matrixA while processing Job-Specific Options:
> > > > >
> > > > > No such error comes if im passing the arguments without space
> between
> > > -D.
> > > > >
> > > > > By reference of Hadoop Definite Guide : "Do not confuse 
> > > > > setting Hadoop properties using the -D property=value option 
> > > > > to GenericOptionsParser (and
> > > > > ToolRunner) with setting JVM system properties using the 
> > > > > -Dproperty=value option to the java command. The syntax for 
> > > > > JVM system properties does not allow any whitespace between 
> > > > > the D and the property name, whereas GenericOptionsParser 
> > > > > requires them to be separated by whitespace."
> > > > >
> > > > > Hence I suppose that GenericOptions should be parsed by -D 
> > > > > property=value rather than -Dproperty=value.
> > > > >
> > > > > Additionally I tried -Dmapred.max.split.size=10485760 also 
> > > > > through commandline but again only single MapTask started.
> > > > >
> > > > > Please Suggest
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > > Sent: Wednesday, January 16, 2013 1:23 PM
> > > > > To: Mahout User List
> > > > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > It's up to Hadoop in the end.
> > > > >
> > > > > Try calling FileInputFormat.setMaxInputSplitSize() with a 
> > > > > smallish value, like your 10MB (10000000).
> > > > >
> > > > > I don't know if Hadoop params can be set as sys properties 
> > > > > like
> that
> > > > > anyway?
> > > > >
> > > > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi 
> > > > > <st...@hcl.com>
> > > > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to multiple dense matrix of size [100 x 100k]. 
> > > > > > The size of
> > > > > the file is 104MB and with default block sizeof 64MB only 2 
> > > > > blocks are getting created.
> > > > > > So I reduced the block size to 10MB and now my file divided 
> > > > > > into
> > > > > > 11
> > > > > blocks across the cluster. Cluster size is 10 nodes with 1 
> > > > > NN/JT
> and
> > > > > 9 DN/TT.
> > > > > >
> > > > > > Everytime Im running Mahout MatrixMultiplicationJob through 
> > > > > > commandline,
> > > > > I can see on JobTracker WebUI that only 1 map task is launched.
> > > > > According to my understanding of Inputsplit, there should be 
> > > > > 11 map
> > > > tasks launched.
> > > > > > Apart from this Map task stays at 0.99% completion and in 
> > > > > > the Tasks Logs
> > > > > , I can see that map task is spilling the map output.
> > > > > >
> > > > > > Mahout Command:
> > > > > >
> > > > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 
> > > > > > -Dio.sort.mb=200
> > > > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA
> --numRowsA
> > > > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 
> > > > > > 100 --numColsB
> > > > > > 100000 --tempDir /test/temp
> > > > > >
> > > > > > Now here I want to know that why only 1 map task is launched 
> > > > > > everytime
> > > > > and how can I performance tune the cluster so that I can 
> > > > > perform
> the
> > > > > dense matrix multiplication of the order [90K x 1 Million] .
> > > > > >
> > > > > > Thanks
> > > > > > Stuti
> > > > > >
> > > > > >
> > > > > > ::DISCLAIMER::
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > > >
> > > > > > The contents of this e-mail and any attachment(s) are
> confidential
> > > > > > and
> > > > > intended for the named recipient(s) only.
> > > > > > E-mail transmission is not guaranteed to be secure or 
> > > > > > error-free as information could be intercepted, corrupted, 
> > > > > > lost, destroyed, arrive late or incomplete, or may contain 
> > > > > > viruses in
> transmission.
> > > > > > The e mail
> > > > > and its contents (with or without referred errors) shall 
> > > > > therefore not attach any liability on the originator or HCL or 
> > > > > its
> affiliates.
> > > > > > Views or opinions, if any, presented in this email are 
> > > > > > solely those of the author and may not necessarily reflect 
> > > > > > the views or opinions of HCL or its affiliates. Any form of 
> > > > > > reproduction, dissemination, copying, disclosure, 
> > > > > > modification, distribution
> and
> > > > > > / or publication of
> > > > > this message without the prior written consent of authorized 
> > > > > representative of HCL is strictly prohibited. If you have 
> > > > > received this email in error please delete it and notify the 
> > > > > sender
> > immediately.
> > > > > > Before opening any email and/or attachments, please check 
> > > > > > them
> for
> > > > > viruses and other defects.
> > > > > >
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > thanks
> > > ashish
> > >
> > > Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: 
> > > http://www.pbase.com/ashishpaliwal
> > >
> >
> >
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: 
> > http://www.pbase.com/ashishpaliwal
> >
>


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Sean Owen <sr...@gmail.com>.

Mappers are usually extremely fast since they start themselves on top of
the data and their job is usually just parsing and emitting key value
pairs. Hadoop's choices are usually fine.

If not it is usually because the mapper is emitting far more data than it
ingests. Are you computing some kind of Cartesian product of input?

That's slow no matter what. More mappers may increase parallelism but its
still a lot of I/O. Avoid it if you can by sampling or pruning unimportant
values. Otherwise , try to implement a Combiner.
On Jan 23, 2013 12:04 PM, "Jonas Grote" <jf...@gmail.com> wrote:

> I'd play with the mapred.map.tasks option. Setting it to something bigger
> than 1 gave me performance improvements for various hadoop jobs on my
> cluster.
>
>
> 2013/1/16 Ashish <pa...@gmail.com>
>
> > I am afraid I don't know the answer. Need to experiment a bit more. I
> have
> > not used CompositeInputFormat so cannot comment.
> >
> > Probably, someone else on the ML(Mailing List) would be able to guide
> here.
> >
> >
> > On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi <st...@hcl.com>
> > wrote:
> >
> > > Thanks Ashish,
> > >
> > > So according to the link if one is using CompositeInputFormat then it
> > will
> > > take entire file as Input to a mapper without considering
> > > InputSplits/blocksize.
> > > If I am understanding it correctly then it is asking to break [Original
> > > Input File]->[flie1,file2,....] .
> > >
> > > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1,
> > > [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
> > >
> > > Now will the input path in MatrixMultiplicationJob will be directory
> path
> > > : /test/smallfiles  ??
> > >
> > > Will breaking file in such manner will cause problem in algorithmic
> > > execution of MR job. Im not sure if output will be correct .
> > >
> > > -----Original Message-----
> > > From: Ashish [mailto:paliwalashish@gmail.com]
> > > Sent: Wednesday, January 16, 2013 5:44 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > MatrixMultiplicationJob internally sets InputFormat as
> > CompositeInputFormat
> > >
> > > JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class);
> > > conf.setInputFormat(CompositeInputFormat.class);
> > >
> > > and AFAIK, CompositeInputFormat ignores the splits. See this
> > >
> >
> http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join
> > >
> > > Unfortunately, I don't know any other alternative as of now.
> > >
> > >
> > > On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <st...@hcl.com>
> > > wrote:
> > >
> > > > The issue is that currently my matrix is of dimension (100x100k),
> > > > Later it can be (1MX10M) or big.
> > > >
> > > > Even now if my job is running with the single mapper for (100x100k)
> > > > and it is not able to complete the Job. As I mentioned map task just
> > > > proceed to 0.99% and started spilling the map output. Hence I wanted
> > > > to tune my job so that Mahout is able to complete the job and I can
> > > > utilize my cluster resources.
> > > >
> > > > As MatrixMultiplicationJob is a MR, so it should be able to handle
> > > > parallel map tasks. I am not sure if there is any algorithmic
> > > > constraints due to which it runs only with single mapper ?
> > > > I have taken the reference of thread so that I can set Configuration
> > > > myself rather by getting it with getConf() but did not got any
> success
> > > >
> > > >
> http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> > > > ers-in-DistributedRowMatrix-Jobs-td888980.html
> > > >
> > > > Stuti
> > > >
> > > > -----Original Message-----
> > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > Sent: Wednesday, January 16, 2013 4:46 PM
> > > > To: Mahout User List
> > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > >
> > > > Why do you need multiple mappers? Is one too slow? Many are not
> > > > necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti
> > > > Awasthi" <st...@hcl.com> wrote:
> > > >
> > > > > Hi,
> > > > > I tried to call programmatically also but facing same issue : Only
> > > > > single MapTask is running and that too spilling the map output
> > > >  continuously.
> > > > > Hence im not able to generate the output for large matrix
> > > multiplication.
> > > > >
> > > > > Code Snippet :
> > > > >
> > > > > DistributedRowMatrix a = new DistributedRowMatrix(new
> > > > > Path("/test/points/matrixA"), new
> > > > > Path("/test/temp"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000")); DistributedRowMatrix b = new
> > > > > DistributedRowMatrix(new Path("/test/points/matrixA"),new
> > > > > Path("tempDir"),Integer.parseInt("100"),
> > > > > Integer.parseInt("100000"));
> > > > > Configuration conf = new Configuration();
> > > > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
> > > > > conf.set("mapred.child.java.opts",
> > > > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > > > > a.setConf(conf);
> > > > > b.setConf(conf);
> > > > > a.times(b);
> > > > >
> > > > > Where Im going wrong. Any idea ?
> > > > >
> > > > > Thanks
> > > > > Stuti
> > > > > -----Original Message-----
> > > > > From: Stuti Awasthi
> > > > > Sent: Wednesday, January 16, 2013 2:55 PM
> > > > > To: Mahout User List
> > > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > Hey Sean,
> > > > > Thanks for response. MatrixMultiplicationJob help shows the usage
> > like
> > > :
> > > > > usage: <command> [Generic Options] [Job-Specific Options]
> > > > >
> > > > > Here Generic Option can be provided by -D <property=value>. Hence I
> > > > > tried with commandline -D options but it seems like that it is not
> > > > > making any effect.  It is also suggested in :
> > > > >
> > > > >
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > > > > ut
> > > > > /common/AbstractJob.html
> > > > >
> > > > > Here I have noted 1 thing after your suggestion  that currently Im
> > > > > passing arguments like -D<property=value> rather than -D
> > > > > <property=value>. I tried with space between -D and property=value
> > > > > also but then its giving error
> > > > > like:
> > > > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
> > > > > /test/points/matrixA while processing Job-Specific Options:
> > > > >
> > > > > No such error comes if im passing the arguments without space
> between
> > > -D.
> > > > >
> > > > > By reference of Hadoop Definite Guide : "Do not confuse setting
> > > > > Hadoop properties using the -D property=value option to
> > > > > GenericOptionsParser (and
> > > > > ToolRunner) with setting JVM system properties using the
> > > > > -Dproperty=value option to the java command. The syntax for JVM
> > > > > system properties does not allow any whitespace between the D and
> > > > > the property name, whereas GenericOptionsParser requires them to be
> > > > > separated by whitespace."
> > > > >
> > > > > Hence I suppose that GenericOptions should be parsed by -D
> > > > > property=value rather than -Dproperty=value.
> > > > >
> > > > > Additionally I tried -Dmapred.max.split.size=10485760 also through
> > > > > commandline but again only single MapTask started.
> > > > >
> > > > > Please Suggest
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > > Sent: Wednesday, January 16, 2013 1:23 PM
> > > > > To: Mahout User List
> > > > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > > > >
> > > > > It's up to Hadoop in the end.
> > > > >
> > > > > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
> > > > > value, like your 10MB (10000000).
> > > > >
> > > > > I don't know if Hadoop params can be set as sys properties like
> that
> > > > > anyway?
> > > > >
> > > > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi
> > > > > <st...@hcl.com>
> > > > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am trying to multiple dense matrix of size [100 x 100k]. The
> > > > > > size of
> > > > > the file is 104MB and with default block sizeof 64MB only 2 blocks
> > > > > are getting created.
> > > > > > So I reduced the block size to 10MB and now my file divided into
> > > > > > 11
> > > > > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT
> and
> > > > > 9 DN/TT.
> > > > > >
> > > > > > Everytime Im running Mahout MatrixMultiplicationJob through
> > > > > > commandline,
> > > > > I can see on JobTracker WebUI that only 1 map task is launched.
> > > > > According to my understanding of Inputsplit, there should be 11 map
> > > > tasks launched.
> > > > > > Apart from this Map task stays at 0.99% completion and in the
> > > > > > Tasks Logs
> > > > > , I can see that map task is spilling the map output.
> > > > > >
> > > > > > Mahout Command:
> > > > > >
> > > > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > > > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA
> --numRowsA
> > > > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100
> > > > > > --numColsB
> > > > > > 100000 --tempDir /test/temp
> > > > > >
> > > > > > Now here I want to know that why only 1 map task is launched
> > > > > > everytime
> > > > > and how can I performance tune the cluster so that I can perform
> the
> > > > > dense matrix multiplication of the order [90K x 1 Million] .
> > > > > >
> > > > > > Thanks
> > > > > > Stuti
> > > > > >
> > > > > >
> > > > > > ::DISCLAIMER::
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > > >
> > > > > > The contents of this e-mail and any attachment(s) are
> confidential
> > > > > > and
> > > > > intended for the named recipient(s) only.
> > > > > > E-mail transmission is not guaranteed to be secure or error-free
> > > > > > as information could be intercepted, corrupted, lost, destroyed,
> > > > > > arrive late or incomplete, or may contain viruses in
> transmission.
> > > > > > The e mail
> > > > > and its contents (with or without referred errors) shall therefore
> > > > > not attach any liability on the originator or HCL or its
> affiliates.
> > > > > > Views or opinions, if any, presented in this email are solely
> > > > > > those of the author and may not necessarily reflect the views or
> > > > > > opinions of HCL or its affiliates. Any form of reproduction,
> > > > > > dissemination, copying, disclosure, modification, distribution
> and
> > > > > > / or publication of
> > > > > this message without the prior written consent of authorized
> > > > > representative of HCL is strictly prohibited. If you have received
> > > > > this email in error please delete it and notify the sender
> > immediately.
> > > > > > Before opening any email and/or attachments, please check them
> for
> > > > > viruses and other defects.
> > > > > >
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > >
> ------------------------------------------------------------------
> > > > > > --
> > > > > > --
> > > > > > --------
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > thanks
> > > ashish
> > >
> > > Blog: http://www.ashishpaliwal.com/blog
> > > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> > >
> >
> >
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog
> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> >
>

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Jonas Grote <jf...@gmail.com>.

I'd play with the mapred.map.tasks option. Setting it to something bigger
than 1 gave me performance improvements for various hadoop jobs on my
cluster.


2013/1/16 Ashish <pa...@gmail.com>

> I am afraid I don't know the answer. Need to experiment a bit more. I have
> not used CompositeInputFormat so cannot comment.
>
> Probably, someone else on the ML(Mailing List) would be able to guide here.
>
>
> On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi <st...@hcl.com>
> wrote:
>
> > Thanks Ashish,
> >
> > So according to the link if one is using CompositeInputFormat then it
> will
> > take entire file as Input to a mapper without considering
> > InputSplits/blocksize.
> > If I am understanding it correctly then it is asking to break [Original
> > Input File]->[flie1,file2,....] .
> >
> > So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1,
> > [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
> >
> > Now will the input path in MatrixMultiplicationJob will be directory path
> > : /test/smallfiles  ??
> >
> > Will breaking file in such manner will cause problem in algorithmic
> > execution of MR job. Im not sure if output will be correct .
> >
> > -----Original Message-----
> > From: Ashish [mailto:paliwalashish@gmail.com]
> > Sent: Wednesday, January 16, 2013 5:44 PM
> > To: user@mahout.apache.org
> > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > MatrixMultiplicationJob internally sets InputFormat as
> CompositeInputFormat
> >
> > JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class);
> > conf.setInputFormat(CompositeInputFormat.class);
> >
> > and AFAIK, CompositeInputFormat ignores the splits. See this
> >
> http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join
> >
> > Unfortunately, I don't know any other alternative as of now.
> >
> >
> > On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <st...@hcl.com>
> > wrote:
> >
> > > The issue is that currently my matrix is of dimension (100x100k),
> > > Later it can be (1MX10M) or big.
> > >
> > > Even now if my job is running with the single mapper for (100x100k)
> > > and it is not able to complete the Job. As I mentioned map task just
> > > proceed to 0.99% and started spilling the map output. Hence I wanted
> > > to tune my job so that Mahout is able to complete the job and I can
> > > utilize my cluster resources.
> > >
> > > As MatrixMultiplicationJob is a MR, so it should be able to handle
> > > parallel map tasks. I am not sure if there is any algorithmic
> > > constraints due to which it runs only with single mapper ?
> > > I have taken the reference of thread so that I can set Configuration
> > > myself rather by getting it with getConf() but did not got any success
> > >
> > > http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> > > ers-in-DistributedRowMatrix-Jobs-td888980.html
> > >
> > > Stuti
> > >
> > > -----Original Message-----
> > > From: Sean Owen [mailto:srowen@gmail.com]
> > > Sent: Wednesday, January 16, 2013 4:46 PM
> > > To: Mahout User List
> > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > Why do you need multiple mappers? Is one too slow? Many are not
> > > necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti
> > > Awasthi" <st...@hcl.com> wrote:
> > >
> > > > Hi,
> > > > I tried to call programmatically also but facing same issue : Only
> > > > single MapTask is running and that too spilling the map output
> > >  continuously.
> > > > Hence im not able to generate the output for large matrix
> > multiplication.
> > > >
> > > > Code Snippet :
> > > >
> > > > DistributedRowMatrix a = new DistributedRowMatrix(new
> > > > Path("/test/points/matrixA"), new
> > > > Path("/test/temp"),Integer.parseInt("100"),
> > > > Integer.parseInt("100000")); DistributedRowMatrix b = new
> > > > DistributedRowMatrix(new Path("/test/points/matrixA"),new
> > > > Path("tempDir"),Integer.parseInt("100"),
> > > > Integer.parseInt("100000"));
> > > > Configuration conf = new Configuration();
> > > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
> > > > conf.set("mapred.child.java.opts",
> > > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > > > a.setConf(conf);
> > > > b.setConf(conf);
> > > > a.times(b);
> > > >
> > > > Where Im going wrong. Any idea ?
> > > >
> > > > Thanks
> > > > Stuti
> > > > -----Original Message-----
> > > > From: Stuti Awasthi
> > > > Sent: Wednesday, January 16, 2013 2:55 PM
> > > > To: Mahout User List
> > > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > > >
> > > > Hey Sean,
> > > > Thanks for response. MatrixMultiplicationJob help shows the usage
> like
> > :
> > > > usage: <command> [Generic Options] [Job-Specific Options]
> > > >
> > > > Here Generic Option can be provided by -D <property=value>. Hence I
> > > > tried with commandline -D options but it seems like that it is not
> > > > making any effect.  It is also suggested in :
> > > >
> > > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > > > ut
> > > > /common/AbstractJob.html
> > > >
> > > > Here I have noted 1 thing after your suggestion  that currently Im
> > > > passing arguments like -D<property=value> rather than -D
> > > > <property=value>. I tried with space between -D and property=value
> > > > also but then its giving error
> > > > like:
> > > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
> > > > /test/points/matrixA while processing Job-Specific Options:
> > > >
> > > > No such error comes if im passing the arguments without space between
> > -D.
> > > >
> > > > By reference of Hadoop Definite Guide : "Do not confuse setting
> > > > Hadoop properties using the -D property=value option to
> > > > GenericOptionsParser (and
> > > > ToolRunner) with setting JVM system properties using the
> > > > -Dproperty=value option to the java command. The syntax for JVM
> > > > system properties does not allow any whitespace between the D and
> > > > the property name, whereas GenericOptionsParser requires them to be
> > > > separated by whitespace."
> > > >
> > > > Hence I suppose that GenericOptions should be parsed by -D
> > > > property=value rather than -Dproperty=value.
> > > >
> > > > Additionally I tried -Dmapred.max.split.size=10485760 also through
> > > > commandline but again only single MapTask started.
> > > >
> > > > Please Suggest
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Sean Owen [mailto:srowen@gmail.com]
> > > > Sent: Wednesday, January 16, 2013 1:23 PM
> > > > To: Mahout User List
> > > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > > >
> > > > It's up to Hadoop in the end.
> > > >
> > > > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
> > > > value, like your 10MB (10000000).
> > > >
> > > > I don't know if Hadoop params can be set as sys properties like that
> > > > anyway?
> > > >
> > > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi
> > > > <st...@hcl.com>
> > > > wrote:
> > > > > Hi,
> > > > >
> > > > > I am trying to multiple dense matrix of size [100 x 100k]. The
> > > > > size of
> > > > the file is 104MB and with default block sizeof 64MB only 2 blocks
> > > > are getting created.
> > > > > So I reduced the block size to 10MB and now my file divided into
> > > > > 11
> > > > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and
> > > > 9 DN/TT.
> > > > >
> > > > > Everytime Im running Mahout MatrixMultiplicationJob through
> > > > > commandline,
> > > > I can see on JobTracker WebUI that only 1 map task is launched.
> > > > According to my understanding of Inputsplit, there should be 11 map
> > > tasks launched.
> > > > > Apart from this Map task stays at 0.99% completion and in the
> > > > > Tasks Logs
> > > > , I can see that map task is spilling the map output.
> > > > >
> > > > > Mahout Command:
> > > > >
> > > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA
> > > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100
> > > > > --numColsB
> > > > > 100000 --tempDir /test/temp
> > > > >
> > > > > Now here I want to know that why only 1 map task is launched
> > > > > everytime
> > > > and how can I performance tune the cluster so that I can perform the
> > > > dense matrix multiplication of the order [90K x 1 Million] .
> > > > >
> > > > > Thanks
> > > > > Stuti
> > > > >
> > > > >
> > > > > ::DISCLAIMER::
> > > > > ------------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > ------------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > --------
> > > > >
> > > > > The contents of this e-mail and any attachment(s) are confidential
> > > > > and
> > > > intended for the named recipient(s) only.
> > > > > E-mail transmission is not guaranteed to be secure or error-free
> > > > > as information could be intercepted, corrupted, lost, destroyed,
> > > > > arrive late or incomplete, or may contain viruses in transmission.
> > > > > The e mail
> > > > and its contents (with or without referred errors) shall therefore
> > > > not attach any liability on the originator or HCL or its affiliates.
> > > > > Views or opinions, if any, presented in this email are solely
> > > > > those of the author and may not necessarily reflect the views or
> > > > > opinions of HCL or its affiliates. Any form of reproduction,
> > > > > dissemination, copying, disclosure, modification, distribution and
> > > > > / or publication of
> > > > this message without the prior written consent of authorized
> > > > representative of HCL is strictly prohibited. If you have received
> > > > this email in error please delete it and notify the sender
> immediately.
> > > > > Before opening any email and/or attachments, please check them for
> > > > viruses and other defects.
> > > > >
> > > > > ------------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > ------------------------------------------------------------------
> > > > > --
> > > > > --
> > > > > --------
> > > >
> > >
> >
> >
> >
> > --
> > thanks
> > ashish
> >
> > Blog: http://www.ashishpaliwal.com/blog
> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
> >
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Ashish <pa...@gmail.com>.

I am afraid I don't know the answer. Need to experiment a bit more. I have
not used CompositeInputFormat so cannot comment.

Probably, someone else on the ML(Mailing List) would be able to guide here.


On Wed, Jan 16, 2013 at 6:01 PM, Stuti Awasthi <st...@hcl.com> wrote:

> Thanks Ashish,
>
> So according to the link if one is using CompositeInputFormat then it will
> take entire file as Input to a mapper without considering
> InputSplits/blocksize.
> If I am understanding it correctly then it is asking to break [Original
> Input File]->[flie1,file2,....] .
>
> So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1,
> [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]
>
> Now will the input path in MatrixMultiplicationJob will be directory path
> : /test/smallfiles  ??
>
> Will breaking file in such manner will cause problem in algorithmic
> execution of MR job. Im not sure if output will be correct .
>
> -----Original Message-----
> From: Ashish [mailto:paliwalashish@gmail.com]
> Sent: Wednesday, January 16, 2013 5:44 PM
> To: user@mahout.apache.org
> Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>
> MatrixMultiplicationJob internally sets InputFormat as CompositeInputFormat
>
> JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class);
> conf.setInputFormat(CompositeInputFormat.class);
>
> and AFAIK, CompositeInputFormat ignores the splits. See this
> http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join
>
> Unfortunately, I don't know any other alternative as of now.
>
>
> On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <st...@hcl.com>
> wrote:
>
> > The issue is that currently my matrix is of dimension (100x100k),
> > Later it can be (1MX10M) or big.
> >
> > Even now if my job is running with the single mapper for (100x100k)
> > and it is not able to complete the Job. As I mentioned map task just
> > proceed to 0.99% and started spilling the map output. Hence I wanted
> > to tune my job so that Mahout is able to complete the job and I can
> > utilize my cluster resources.
> >
> > As MatrixMultiplicationJob is a MR, so it should be able to handle
> > parallel map tasks. I am not sure if there is any algorithmic
> > constraints due to which it runs only with single mapper ?
> > I have taken the reference of thread so that I can set Configuration
> > myself rather by getting it with getConf() but did not got any success
> >
> > http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> > ers-in-DistributedRowMatrix-Jobs-td888980.html
> >
> > Stuti
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Wednesday, January 16, 2013 4:46 PM
> > To: Mahout User List
> > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > Why do you need multiple mappers? Is one too slow? Many are not
> > necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti
> > Awasthi" <st...@hcl.com> wrote:
> >
> > > Hi,
> > > I tried to call programmatically also but facing same issue : Only
> > > single MapTask is running and that too spilling the map output
> >  continuously.
> > > Hence im not able to generate the output for large matrix
> multiplication.
> > >
> > > Code Snippet :
> > >
> > > DistributedRowMatrix a = new DistributedRowMatrix(new
> > > Path("/test/points/matrixA"), new
> > > Path("/test/temp"),Integer.parseInt("100"),
> > > Integer.parseInt("100000")); DistributedRowMatrix b = new
> > > DistributedRowMatrix(new Path("/test/points/matrixA"),new
> > > Path("tempDir"),Integer.parseInt("100"),
> > > Integer.parseInt("100000"));
> > > Configuration conf = new Configuration();
> > > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
> > > conf.set("mapred.child.java.opts",
> > > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > > a.setConf(conf);
> > > b.setConf(conf);
> > > a.times(b);
> > >
> > > Where Im going wrong. Any idea ?
> > >
> > > Thanks
> > > Stuti
> > > -----Original Message-----
> > > From: Stuti Awasthi
> > > Sent: Wednesday, January 16, 2013 2:55 PM
> > > To: Mahout User List
> > > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > Hey Sean,
> > > Thanks for response. MatrixMultiplicationJob help shows the usage like
> :
> > > usage: <command> [Generic Options] [Job-Specific Options]
> > >
> > > Here Generic Option can be provided by -D <property=value>. Hence I
> > > tried with commandline -D options but it seems like that it is not
> > > making any effect.  It is also suggested in :
> > >
> > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > > ut
> > > /common/AbstractJob.html
> > >
> > > Here I have noted 1 thing after your suggestion  that currently Im
> > > passing arguments like -D<property=value> rather than -D
> > > <property=value>. I tried with space between -D and property=value
> > > also but then its giving error
> > > like:
> > > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
> > > /test/points/matrixA while processing Job-Specific Options:
> > >
> > > No such error comes if im passing the arguments without space between
> -D.
> > >
> > > By reference of Hadoop Definite Guide : "Do not confuse setting
> > > Hadoop properties using the -D property=value option to
> > > GenericOptionsParser (and
> > > ToolRunner) with setting JVM system properties using the
> > > -Dproperty=value option to the java command. The syntax for JVM
> > > system properties does not allow any whitespace between the D and
> > > the property name, whereas GenericOptionsParser requires them to be
> > > separated by whitespace."
> > >
> > > Hence I suppose that GenericOptions should be parsed by -D
> > > property=value rather than -Dproperty=value.
> > >
> > > Additionally I tried -Dmapred.max.split.size=10485760 also through
> > > commandline but again only single MapTask started.
> > >
> > > Please Suggest
> > >
> > >
> > > -----Original Message-----
> > > From: Sean Owen [mailto:srowen@gmail.com]
> > > Sent: Wednesday, January 16, 2013 1:23 PM
> > > To: Mahout User List
> > > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> > >
> > > It's up to Hadoop in the end.
> > >
> > > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
> > > value, like your 10MB (10000000).
> > >
> > > I don't know if Hadoop params can be set as sys properties like that
> > > anyway?
> > >
> > > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi
> > > <st...@hcl.com>
> > > wrote:
> > > > Hi,
> > > >
> > > > I am trying to multiple dense matrix of size [100 x 100k]. The
> > > > size of
> > > the file is 104MB and with default block sizeof 64MB only 2 blocks
> > > are getting created.
> > > > So I reduced the block size to 10MB and now my file divided into
> > > > 11
> > > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and
> > > 9 DN/TT.
> > > >
> > > > Everytime Im running Mahout MatrixMultiplicationJob through
> > > > commandline,
> > > I can see on JobTracker WebUI that only 1 map task is launched.
> > > According to my understanding of Inputsplit, there should be 11 map
> > tasks launched.
> > > > Apart from this Map task stays at 0.99% completion and in the
> > > > Tasks Logs
> > > , I can see that map task is spilling the map output.
> > > >
> > > > Mahout Command:
> > > >
> > > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA
> > > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100
> > > > --numColsB
> > > > 100000 --tempDir /test/temp
> > > >
> > > > Now here I want to know that why only 1 map task is launched
> > > > everytime
> > > and how can I performance tune the cluster so that I can perform the
> > > dense matrix multiplication of the order [90K x 1 Million] .
> > > >
> > > > Thanks
> > > > Stuti
> > > >
> > > >
> > > > ::DISCLAIMER::
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > --------
> > > >
> > > > The contents of this e-mail and any attachment(s) are confidential
> > > > and
> > > intended for the named recipient(s) only.
> > > > E-mail transmission is not guaranteed to be secure or error-free
> > > > as information could be intercepted, corrupted, lost, destroyed,
> > > > arrive late or incomplete, or may contain viruses in transmission.
> > > > The e mail
> > > and its contents (with or without referred errors) shall therefore
> > > not attach any liability on the originator or HCL or its affiliates.
> > > > Views or opinions, if any, presented in this email are solely
> > > > those of the author and may not necessarily reflect the views or
> > > > opinions of HCL or its affiliates. Any form of reproduction,
> > > > dissemination, copying, disclosure, modification, distribution and
> > > > / or publication of
> > > this message without the prior written consent of authorized
> > > representative of HCL is strictly prohibited. If you have received
> > > this email in error please delete it and notify the sender immediately.
> > > > Before opening any email and/or attachments, please check them for
> > > viruses and other defects.
> > > >
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > ------------------------------------------------------------------
> > > > --
> > > > --
> > > > --------
> > >
> >
>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Stuti Awasthi <st...@hcl.com>.

Thanks Ashish,

So according to the link if one is using CompositeInputFormat then it will take entire file as Input to a mapper without considering InputSplits/blocksize.
If I am understanding it correctly then it is asking to break [Original Input File]->[flie1,file2,....] .

So If my file is  [/test/MatrixA] --> [/test/smallfiles/file1, [/test/smallfiles/file2, [/test/smallfiles/file3...............  ]

Now will the input path in MatrixMultiplicationJob will be directory path : /test/smallfiles  ??

Will breaking file in such manner will cause problem in algorithmic execution of MR job. Im not sure if output will be correct .

-----Original Message-----
From: Ashish [mailto:paliwalashish@gmail.com] 
Sent: Wednesday, January 16, 2013 5:44 PM
To: user@mahout.apache.org
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

MatrixMultiplicationJob internally sets InputFormat as CompositeInputFormat

JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class); conf.setInputFormat(CompositeInputFormat.class);

and AFAIK, CompositeInputFormat ignores the splits. See this http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join

Unfortunately, I don't know any other alternative as of now.


On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <st...@hcl.com> wrote:

> The issue is that currently my matrix is of dimension (100x100k), 
> Later it can be (1MX10M) or big.
>
> Even now if my job is running with the single mapper for (100x100k) 
> and it is not able to complete the Job. As I mentioned map task just 
> proceed to 0.99% and started spilling the map output. Hence I wanted 
> to tune my job so that Mahout is able to complete the job and I can 
> utilize my cluster resources.
>
> As MatrixMultiplicationJob is a MR, so it should be able to handle 
> parallel map tasks. I am not sure if there is any algorithmic 
> constraints due to which it runs only with single mapper ?
> I have taken the reference of thread so that I can set Configuration 
> myself rather by getting it with getConf() but did not got any success
>
> http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reduc
> ers-in-DistributedRowMatrix-Jobs-td888980.html
>
> Stuti
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Wednesday, January 16, 2013 4:46 PM
> To: Mahout User List
> Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
>
> Why do you need multiple mappers? Is one too slow? Many are not 
> necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti 
> Awasthi" <st...@hcl.com> wrote:
>
> > Hi,
> > I tried to call programmatically also but facing same issue : Only 
> > single MapTask is running and that too spilling the map output
>  continuously.
> > Hence im not able to generate the output for large matrix multiplication.
> >
> > Code Snippet :
> >
> > DistributedRowMatrix a = new DistributedRowMatrix(new 
> > Path("/test/points/matrixA"), new 
> > Path("/test/temp"),Integer.parseInt("100"),
> > Integer.parseInt("100000")); DistributedRowMatrix b = new 
> > DistributedRowMatrix(new Path("/test/points/matrixA"),new 
> > Path("tempDir"),Integer.parseInt("100"),
> > Integer.parseInt("100000"));
> > Configuration conf = new Configuration(); 
> > conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818"); 
> > conf.set("mapred.child.java.opts",
> > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > a.setConf(conf);
> > b.setConf(conf);
> > a.times(b);
> >
> > Where Im going wrong. Any idea ?
> >
> > Thanks
> > Stuti
> > -----Original Message-----
> > From: Stuti Awasthi
> > Sent: Wednesday, January 16, 2013 2:55 PM
> > To: Mahout User List
> > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > Hey Sean,
> > Thanks for response. MatrixMultiplicationJob help shows the usage like :
> > usage: <command> [Generic Options] [Job-Specific Options]
> >
> > Here Generic Option can be provided by -D <property=value>. Hence I 
> > tried with commandline -D options but it seems like that it is not 
> > making any effect.  It is also suggested in :
> >
> > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/maho
> > ut
> > /common/AbstractJob.html
> >
> > Here I have noted 1 thing after your suggestion  that currently Im 
> > passing arguments like -D<property=value> rather than -D 
> > <property=value>. I tried with space between -D and property=value 
> > also but then its giving error
> > like:
> > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected 
> > /test/points/matrixA while processing Job-Specific Options:
> >
> > No such error comes if im passing the arguments without space between -D.
> >
> > By reference of Hadoop Definite Guide : "Do not confuse setting 
> > Hadoop properties using the -D property=value option to 
> > GenericOptionsParser (and
> > ToolRunner) with setting JVM system properties using the 
> > -Dproperty=value option to the java command. The syntax for JVM 
> > system properties does not allow any whitespace between the D and 
> > the property name, whereas GenericOptionsParser requires them to be 
> > separated by whitespace."
> >
> > Hence I suppose that GenericOptions should be parsed by -D 
> > property=value rather than -Dproperty=value.
> >
> > Additionally I tried -Dmapred.max.split.size=10485760 also through 
> > commandline but again only single MapTask started.
> >
> > Please Suggest
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Wednesday, January 16, 2013 1:23 PM
> > To: Mahout User List
> > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > It's up to Hadoop in the end.
> >
> > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish 
> > value, like your 10MB (10000000).
> >
> > I don't know if Hadoop params can be set as sys properties like that 
> > anyway?
> >
> > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi 
> > <st...@hcl.com>
> > wrote:
> > > Hi,
> > >
> > > I am trying to multiple dense matrix of size [100 x 100k]. The 
> > > size of
> > the file is 104MB and with default block sizeof 64MB only 2 blocks 
> > are getting created.
> > > So I reduced the block size to 10MB and now my file divided into 
> > > 11
> > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 
> > 9 DN/TT.
> > >
> > > Everytime Im running Mahout MatrixMultiplicationJob through 
> > > commandline,
> > I can see on JobTracker WebUI that only 1 map task is launched.
> > According to my understanding of Inputsplit, there should be 11 map
> tasks launched.
> > > Apart from this Map task stays at 0.99% completion and in the 
> > > Tasks Logs
> > , I can see that map task is spilling the map output.
> > >
> > > Mahout Command:
> > >
> > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA
> > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 
> > > --numColsB
> > > 100000 --tempDir /test/temp
> > >
> > > Now here I want to know that why only 1 map task is launched 
> > > everytime
> > and how can I performance tune the cluster so that I can perform the 
> > dense matrix multiplication of the order [90K x 1 Million] .
> > >
> > > Thanks
> > > Stuti
> > >
> > >
> > > ::DISCLAIMER::
> > > ------------------------------------------------------------------
> > > --
> > > --
> > > ------------------------------------------------------------------
> > > --
> > > --
> > > --------
> > >
> > > The contents of this e-mail and any attachment(s) are confidential 
> > > and
> > intended for the named recipient(s) only.
> > > E-mail transmission is not guaranteed to be secure or error-free 
> > > as information could be intercepted, corrupted, lost, destroyed, 
> > > arrive late or incomplete, or may contain viruses in transmission. 
> > > The e mail
> > and its contents (with or without referred errors) shall therefore 
> > not attach any liability on the originator or HCL or its affiliates.
> > > Views or opinions, if any, presented in this email are solely 
> > > those of the author and may not necessarily reflect the views or 
> > > opinions of HCL or its affiliates. Any form of reproduction, 
> > > dissemination, copying, disclosure, modification, distribution and 
> > > / or publication of
> > this message without the prior written consent of authorized 
> > representative of HCL is strictly prohibited. If you have received 
> > this email in error please delete it and notify the sender immediately.
> > > Before opening any email and/or attachments, please check them for
> > viruses and other defects.
> > >
> > > ------------------------------------------------------------------
> > > --
> > > --
> > > ------------------------------------------------------------------
> > > --
> > > --
> > > --------
> >
>



--
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Ashish <pa...@gmail.com>.

MatrixMultiplicationJob internally sets InputFormat as CompositeInputFormat

JobConf conf = new JobConf(initialConf, MatrixMultiplicationJob.class);
conf.setInputFormat(CompositeInputFormat.class);

and AFAIK, CompositeInputFormat ignores the splits. See this
http://stackoverflow.com/questions/8654200/hadoop-file-splits-compositeinputformat-inner-join

Unfortunately, I don't know any other alternative as of now.


On Wed, Jan 16, 2013 at 5:05 PM, Stuti Awasthi <st...@hcl.com> wrote:

> The issue is that currently my matrix is of dimension (100x100k), Later it
> can be (1MX10M) or big.
>
> Even now if my job is running with the single mapper for (100x100k) and it
> is not able to complete the Job. As I mentioned map task just proceed to
> 0.99% and started spilling the map output. Hence I wanted to tune my job so
> that Mahout is able to complete the job and I can utilize my cluster
> resources.
>
> As MatrixMultiplicationJob is a MR, so it should be able to handle
> parallel map tasks. I am not sure if there is any algorithmic constraints
> due to which it runs only with single mapper ?
> I have taken the reference of thread so that I can set Configuration
> myself rather by getting it with getConf() but did not got any success
>
> http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reducers-in-DistributedRowMatrix-Jobs-td888980.html
>
> Stuti
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Wednesday, January 16, 2013 4:46 PM
> To: Mahout User List
> Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
>
> Why do you need multiple mappers? Is one too slow? Many are not
> necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti
> Awasthi" <st...@hcl.com> wrote:
>
> > Hi,
> > I tried to call programmatically also but facing same issue : Only
> > single MapTask is running and that too spilling the map output
>  continuously.
> > Hence im not able to generate the output for large matrix multiplication.
> >
> > Code Snippet :
> >
> > DistributedRowMatrix a = new DistributedRowMatrix(new
> > Path("/test/points/matrixA"), new
> > Path("/test/temp"),Integer.parseInt("100"),
> > Integer.parseInt("100000")); DistributedRowMatrix b = new
> > DistributedRowMatrix(new Path("/test/points/matrixA"),new
> > Path("tempDir"),Integer.parseInt("100"),
> > Integer.parseInt("100000"));
> > Configuration conf = new Configuration(); conf.set("fs.default.name",
> > "hdfs://DS-1078D24B4736:10818"); conf.set("mapred.child.java.opts",
> > "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> > a.setConf(conf);
> > b.setConf(conf);
> > a.times(b);
> >
> > Where Im going wrong. Any idea ?
> >
> > Thanks
> > Stuti
> > -----Original Message-----
> > From: Stuti Awasthi
> > Sent: Wednesday, January 16, 2013 2:55 PM
> > To: Mahout User List
> > Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > Hey Sean,
> > Thanks for response. MatrixMultiplicationJob help shows the usage like :
> > usage: <command> [Generic Options] [Job-Specific Options]
> >
> > Here Generic Option can be provided by -D <property=value>. Hence I
> > tried with commandline -D options but it seems like that it is not
> > making any effect.  It is also suggested in :
> >
> > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout
> > /common/AbstractJob.html
> >
> > Here I have noted 1 thing after your suggestion  that currently Im
> > passing arguments like -D<property=value> rather than -D
> > <property=value>. I tried with space between -D and property=value
> > also but then its giving error
> > like:
> > 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
> > /test/points/matrixA while processing Job-Specific Options:
> >
> > No such error comes if im passing the arguments without space between -D.
> >
> > By reference of Hadoop Definite Guide : "Do not confuse setting Hadoop
> > properties using the -D property=value option to GenericOptionsParser
> > (and
> > ToolRunner) with setting JVM system properties using the
> > -Dproperty=value option to the java command. The syntax for JVM system
> > properties does not allow any whitespace between the D and the
> > property name, whereas GenericOptionsParser requires them to be
> > separated by whitespace."
> >
> > Hence I suppose that GenericOptions should be parsed by -D
> > property=value rather than -Dproperty=value.
> >
> > Additionally I tried -Dmapred.max.split.size=10485760 also through
> > commandline but again only single MapTask started.
> >
> > Please Suggest
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Wednesday, January 16, 2013 1:23 PM
> > To: Mahout User List
> > Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
> >
> > It's up to Hadoop in the end.
> >
> > Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
> > value, like your 10MB (10000000).
> >
> > I don't know if Hadoop params can be set as sys properties like that
> > anyway?
> >
> > On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi <st...@hcl.com>
> > wrote:
> > > Hi,
> > >
> > > I am trying to multiple dense matrix of size [100 x 100k]. The size
> > > of
> > the file is 104MB and with default block sizeof 64MB only 2 blocks are
> > getting created.
> > > So I reduced the block size to 10MB and now my file divided into 11
> > blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9
> > DN/TT.
> > >
> > > Everytime Im running Mahout MatrixMultiplicationJob through
> > > commandline,
> > I can see on JobTracker WebUI that only 1 map task is launched.
> > According to my understanding of Inputsplit, there should be 11 map
> tasks launched.
> > > Apart from this Map task stays at 0.99% completion and in the Tasks
> > > Logs
> > , I can see that map task is spilling the map output.
> > >
> > > Mahout Command:
> > >
> > > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA
> > > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100
> > > --numColsB
> > > 100000 --tempDir /test/temp
> > >
> > > Now here I want to know that why only 1 map task is launched
> > > everytime
> > and how can I performance tune the cluster so that I can perform the
> > dense matrix multiplication of the order [90K x 1 Million] .
> > >
> > > Thanks
> > > Stuti
> > >
> > >
> > > ::DISCLAIMER::
> > > --------------------------------------------------------------------
> > > --
> > > --------------------------------------------------------------------
> > > --
> > > --------
> > >
> > > The contents of this e-mail and any attachment(s) are confidential
> > > and
> > intended for the named recipient(s) only.
> > > E-mail transmission is not guaranteed to be secure or error-free as
> > > information could be intercepted, corrupted, lost, destroyed, arrive
> > > late or incomplete, or may contain viruses in transmission. The e
> > > mail
> > and its contents (with or without referred errors) shall therefore not
> > attach any liability on the originator or HCL or its affiliates.
> > > Views or opinions, if any, presented in this email are solely those
> > > of the author and may not necessarily reflect the views or opinions
> > > of HCL or its affiliates. Any form of reproduction, dissemination,
> > > copying, disclosure, modification, distribution and / or publication
> > > of
> > this message without the prior written consent of authorized
> > representative of HCL is strictly prohibited. If you have received
> > this email in error please delete it and notify the sender immediately.
> > > Before opening any email and/or attachments, please check them for
> > viruses and other defects.
> > >
> > > --------------------------------------------------------------------
> > > --
> > > --------------------------------------------------------------------
> > > --
> > > --------
> >
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Stuti Awasthi <st...@hcl.com>.

The issue is that currently my matrix is of dimension (100x100k), Later it can be (1MX10M) or big.
 
Even now if my job is running with the single mapper for (100x100k) and it is not able to complete the Job. As I mentioned map task just proceed to 0.99% and started spilling the map output. Hence I wanted to tune my job so that Mahout is able to complete the job and I can utilize my cluster resources.

As MatrixMultiplicationJob is a MR, so it should be able to handle parallel map tasks. I am not sure if there is any algorithmic constraints due to which it runs only with single mapper ?
I have taken the reference of thread so that I can set Configuration myself rather by getting it with getConf() but did not got any success
http://lucene.472066.n3.nabble.com/Setting-Number-of-Mappers-and-Reducers-in-DistributedRowMatrix-Jobs-td888980.html 

Stuti

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com] 
Sent: Wednesday, January 16, 2013 4:46 PM
To: Mahout User List
Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?

Why do you need multiple mappers? Is one too slow? Many are not necessarily faster for small input On Jan 16, 2013 10:46 AM, "Stuti Awasthi" <st...@hcl.com> wrote:

> Hi,
> I tried to call programmatically also but facing same issue : Only 
> single MapTask is running and that too spilling the map output  continuously.
> Hence im not able to generate the output for large matrix multiplication.
>
> Code Snippet :
>
> DistributedRowMatrix a = new DistributedRowMatrix(new 
> Path("/test/points/matrixA"), new 
> Path("/test/temp"),Integer.parseInt("100"), 
> Integer.parseInt("100000")); DistributedRowMatrix b = new 
> DistributedRowMatrix(new Path("/test/points/matrixA"),new 
> Path("tempDir"),Integer.parseInt("100"),
> Integer.parseInt("100000"));
> Configuration conf = new Configuration(); conf.set("fs.default.name", 
> "hdfs://DS-1078D24B4736:10818"); conf.set("mapred.child.java.opts", 
> "-Xmx2048m"); conf.set("mapred.max.split.size","10485760");
> a.setConf(conf);
> b.setConf(conf);
> a.times(b);
>
> Where Im going wrong. Any idea ?
>
> Thanks
> Stuti
> -----Original Message-----
> From: Stuti Awasthi
> Sent: Wednesday, January 16, 2013 2:55 PM
> To: Mahout User List
> Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
>
> Hey Sean,
> Thanks for response. MatrixMultiplicationJob help shows the usage like :
> usage: <command> [Generic Options] [Job-Specific Options]
>
> Here Generic Option can be provided by -D <property=value>. Hence I 
> tried with commandline -D options but it seems like that it is not 
> making any effect.  It is also suggested in :
>
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout
> /common/AbstractJob.html
>
> Here I have noted 1 thing after your suggestion  that currently Im 
> passing arguments like -D<property=value> rather than -D 
> <property=value>. I tried with space between -D and property=value 
> also but then its giving error
> like:
> 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected 
> /test/points/matrixA while processing Job-Specific Options:
>
> No such error comes if im passing the arguments without space between -D.
>
> By reference of Hadoop Definite Guide : "Do not confuse setting Hadoop 
> properties using the -D property=value option to GenericOptionsParser 
> (and
> ToolRunner) with setting JVM system properties using the 
> -Dproperty=value option to the java command. The syntax for JVM system 
> properties does not allow any whitespace between the D and the 
> property name, whereas GenericOptionsParser requires them to be 
> separated by whitespace."
>
> Hence I suppose that GenericOptions should be parsed by -D 
> property=value rather than -Dproperty=value.
>
> Additionally I tried -Dmapred.max.split.size=10485760 also through 
> commandline but again only single MapTask started.
>
> Please Suggest
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Wednesday, January 16, 2013 1:23 PM
> To: Mahout User List
> Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>
> It's up to Hadoop in the end.
>
> Try calling FileInputFormat.setMaxInputSplitSize() with a smallish 
> value, like your 10MB (10000000).
>
> I don't know if Hadoop params can be set as sys properties like that 
> anyway?
>
> On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi <st...@hcl.com>
> wrote:
> > Hi,
> >
> > I am trying to multiple dense matrix of size [100 x 100k]. The size 
> > of
> the file is 104MB and with default block sizeof 64MB only 2 blocks are 
> getting created.
> > So I reduced the block size to 10MB and now my file divided into 11
> blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 
> DN/TT.
> >
> > Everytime Im running Mahout MatrixMultiplicationJob through 
> > commandline,
> I can see on JobTracker WebUI that only 1 map task is launched. 
> According to my understanding of Inputsplit, there should be 11 map tasks launched.
> > Apart from this Map task stays at 0.99% completion and in the Tasks 
> > Logs
> , I can see that map task is spilling the map output.
> >
> > Mahout Command:
> >
> > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 
> > 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 
> > --numColsB
> > 100000 --tempDir /test/temp
> >
> > Now here I want to know that why only 1 map task is launched 
> > everytime
> and how can I performance tune the cluster so that I can perform the 
> dense matrix multiplication of the order [90K x 1 Million] .
> >
> > Thanks
> > Stuti
> >
> >
> > ::DISCLAIMER::
> > --------------------------------------------------------------------
> > --
> > --------------------------------------------------------------------
> > --
> > --------
> >
> > The contents of this e-mail and any attachment(s) are confidential 
> > and
> intended for the named recipient(s) only.
> > E-mail transmission is not guaranteed to be secure or error-free as 
> > information could be intercepted, corrupted, lost, destroyed, arrive 
> > late or incomplete, or may contain viruses in transmission. The e 
> > mail
> and its contents (with or without referred errors) shall therefore not 
> attach any liability on the originator or HCL or its affiliates.
> > Views or opinions, if any, presented in this email are solely those 
> > of the author and may not necessarily reflect the views or opinions 
> > of HCL or its affiliates. Any form of reproduction, dissemination, 
> > copying, disclosure, modification, distribution and / or publication 
> > of
> this message without the prior written consent of authorized 
> representative of HCL is strictly prohibited. If you have received 
> this email in error please delete it and notify the sender immediately.
> > Before opening any email and/or attachments, please check them for
> viruses and other defects.
> >
> > --------------------------------------------------------------------
> > --
> > --------------------------------------------------------------------
> > --
> > --------
>

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Sean Owen <sr...@gmail.com>.

Why do you need multiple mappers? Is one too slow? Many are not necessarily
faster for small input
On Jan 16, 2013 10:46 AM, "Stuti Awasthi" <st...@hcl.com> wrote:

> Hi,
> I tried to call programmatically also but facing same issue : Only single
> MapTask is running and that too spilling the map output  continuously.
> Hence im not able to generate the output for large matrix multiplication.
>
> Code Snippet :
>
> DistributedRowMatrix a = new DistributedRowMatrix(new
> Path("/test/points/matrixA"), new
> Path("/test/temp"),Integer.parseInt("100"), Integer.parseInt("100000"));
> DistributedRowMatrix b = new DistributedRowMatrix(new
> Path("/test/points/matrixA"),new Path("tempDir"),Integer.parseInt("100"),
> Integer.parseInt("100000"));
> Configuration conf = new Configuration();
> conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
> conf.set("mapred.child.java.opts", "-Xmx2048m");
> conf.set("mapred.max.split.size","10485760");
> a.setConf(conf);
> b.setConf(conf);
> a.times(b);
>
> Where Im going wrong. Any idea ?
>
> Thanks
> Stuti
> -----Original Message-----
> From: Stuti Awasthi
> Sent: Wednesday, January 16, 2013 2:55 PM
> To: Mahout User List
> Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?
>
> Hey Sean,
> Thanks for response. MatrixMultiplicationJob help shows the usage like :
> usage: <command> [Generic Options] [Job-Specific Options]
>
> Here Generic Option can be provided by -D <property=value>. Hence I tried
> with commandline -D options but it seems like that it is not making any
> effect.  It is also suggested in :
>
> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/AbstractJob.html
>
> Here I have noted 1 thing after your suggestion  that currently Im passing
> arguments like -D<property=value> rather than -D <property=value>. I tried
> with space between -D and property=value also but then its giving error
> like:
> 13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected
> /test/points/matrixA while processing Job-Specific Options:
>
> No such error comes if im passing the arguments without space between -D.
>
> By reference of Hadoop Definite Guide : "Do not confuse setting Hadoop
> properties using the -D property=value option to GenericOptionsParser (and
> ToolRunner) with setting JVM system properties using the
> -Dproperty=value option to the java command. The syntax for JVM system
> properties does not allow any whitespace between the D and the property
> name, whereas GenericOptionsParser requires them to be separated by
> whitespace."
>
> Hence I suppose that GenericOptions should be parsed by -D property=value
> rather than -Dproperty=value.
>
> Additionally I tried -Dmapred.max.split.size=10485760 also through
> commandline but again only single MapTask started.
>
> Please Suggest
>
>
> -----Original Message-----
> From: Sean Owen [mailto:srowen@gmail.com]
> Sent: Wednesday, January 16, 2013 1:23 PM
> To: Mahout User List
> Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?
>
> It's up to Hadoop in the end.
>
> Try calling FileInputFormat.setMaxInputSplitSize() with a smallish value,
> like your 10MB (10000000).
>
> I don't know if Hadoop params can be set as sys properties like that
> anyway?
>
> On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi <st...@hcl.com>
> wrote:
> > Hi,
> >
> > I am trying to multiple dense matrix of size [100 x 100k]. The size of
> the file is 104MB and with default block sizeof 64MB only 2 blocks are
> getting created.
> > So I reduced the block size to 10MB and now my file divided into 11
> blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9
> DN/TT.
> >
> > Everytime Im running Mahout MatrixMultiplicationJob through commandline,
> I can see on JobTracker WebUI that only 1 map task is launched. According
> to my understanding of Inputsplit, there should be 11 map tasks launched.
> > Apart from this Map task stays at 0.99% completion and in the Tasks Logs
> , I can see that map task is spilling the map output.
> >
> > Mahout Command:
> >
> > mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> > -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> > -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100
> > --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 --numColsB
> > 100000 --tempDir /test/temp
> >
> > Now here I want to know that why only 1 map task is launched everytime
> and how can I performance tune the cluster so that I can perform the dense
> matrix multiplication of the order [90K x 1 Million] .
> >
> > Thanks
> > Stuti
> >
> >
> > ::DISCLAIMER::
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > --------
> >
> > The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> > E-mail transmission is not guaranteed to be secure or error-free as
> > information could be intercepted, corrupted, lost, destroyed, arrive
> > late or incomplete, or may contain viruses in transmission. The e mail
> and its contents (with or without referred errors) shall therefore not
> attach any liability on the originator or HCL or its affiliates.
> > Views or opinions, if any, presented in this email are solely those of
> > the author and may not necessarily reflect the views or opinions of
> > HCL or its affiliates. Any form of reproduction, dissemination,
> > copying, disclosure, modification, distribution and / or publication of
> this message without the prior written consent of authorized representative
> of HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> > Before opening any email and/or attachments, please check them for
> viruses and other defects.
> >
> > ----------------------------------------------------------------------
> > ----------------------------------------------------------------------
> > --------
>

RE: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Stuti Awasthi <st...@hcl.com>.

Hi,
I tried to call programmatically also but facing same issue : Only single MapTask is running and that too spilling the map output  continuously. Hence im not able to generate the output for large matrix multiplication.

Code Snippet :

DistributedRowMatrix a = new DistributedRowMatrix(new Path("/test/points/matrixA"), new Path("/test/temp"),Integer.parseInt("100"), Integer.parseInt("100000"));
DistributedRowMatrix b = new DistributedRowMatrix(new Path("/test/points/matrixA"),new Path("tempDir"),Integer.parseInt("100"), Integer.parseInt("100000"));
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://DS-1078D24B4736:10818");
conf.set("mapred.child.java.opts", "-Xmx2048m");
conf.set("mapred.max.split.size","10485760");
a.setConf(conf);
b.setConf(conf);
a.times(b);

Where Im going wrong. Any idea ?

Thanks
Stuti
-----Original Message-----
From: Stuti Awasthi 
Sent: Wednesday, January 16, 2013 2:55 PM
To: Mahout User List
Subject: RE: MatrixMultiplicationJob runs with 1 mapper only ?

Hey Sean,
Thanks for response. MatrixMultiplicationJob help shows the usage like :
usage: <command> [Generic Options] [Job-Specific Options] 

Here Generic Option can be provided by -D <property=value>. Hence I tried with commandline -D options but it seems like that it is not making any effect.  It is also suggested in :
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/common/AbstractJob.html 

Here I have noted 1 thing after your suggestion  that currently Im passing arguments like -D<property=value> rather than -D <property=value>. I tried with space between -D and property=value also but then its giving error like:
13/01/16 14:21:47 ERROR common.AbstractJob: Unexpected /test/points/matrixA while processing Job-Specific Options:

No such error comes if im passing the arguments without space between -D.

By reference of Hadoop Definite Guide : "Do not confuse setting Hadoop properties using the -D property=value option to GenericOptionsParser (and ToolRunner) with setting JVM system properties using the                   -Dproperty=value option to the java command. The syntax for JVM system properties does not allow any whitespace between the D and the property name, whereas GenericOptionsParser requires them to be separated by whitespace."

Hence I suppose that GenericOptions should be parsed by -D property=value rather than -Dproperty=value.

Additionally I tried -Dmapred.max.split.size=10485760 also through commandline but again only single MapTask started.

Please Suggest

-----Original Message-----
From: Sean Owen [mailto:srowen@gmail.com]
Sent: Wednesday, January 16, 2013 1:23 PM
To: Mahout User List
Subject: Re: MatrixMultiplicationJob runs with 1 mapper only ?

It's up to Hadoop in the end.

Try calling FileInputFormat.setMaxInputSplitSize() with a smallish value, like your 10MB (10000000).

I don't know if Hadoop params can be set as sys properties like that anyway?

On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi,
>
> I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and with default block sizeof 64MB only 2 blocks are getting created.
> So I reduced the block size to 10MB and now my file divided into 11 blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 DN/TT.
>
> Everytime Im running Mahout MatrixMultiplicationJob through commandline, I can see on JobTracker WebUI that only 1 map task is launched. According to my understanding of Inputsplit, there should be 11 map tasks launched.
> Apart from this Map task stays at 0.99% completion and in the Tasks Logs , I can see that map task is spilling the map output.
>
> Mahout Command:
>
> mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M
> -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200
> -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100 
> --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 --numColsB
> 100000 --tempDir /test/temp
>
> Now here I want to know that why only 1 map task is launched everytime and how can I performance tune the cluster so that I can perform the dense matrix multiplication of the order [90K x 1 Million] .
>
> Thanks
> Stuti
>
>
> ::DISCLAIMER::
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as 
> information could be intercepted, corrupted, lost, destroyed, arrive 
> late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of 
> the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, 
> copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses and other defects.
>
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------

Re: MatrixMultiplicationJob runs with 1 mapper only ?

Posted by Sean Owen <sr...@gmail.com>.

It's up to Hadoop in the end.

Try calling FileInputFormat.setMaxInputSplitSize() with a smallish
value, like your 10MB (10000000).

I don't know if Hadoop params can be set as sys properties like that anyway?

On Wed, Jan 16, 2013 at 7:48 AM, Stuti Awasthi <st...@hcl.com> wrote:
> Hi,
>
> I am trying to multiple dense matrix of size [100 x 100k]. The size of the file is 104MB and with default block sizeof 64MB only 2 blocks are getting created.
> So I reduced the block size to 10MB and now my file divided into 11 blocks across the cluster. Cluster size is 10 nodes with 1 NN/JT and 9 DN/TT.
>
> Everytime Im running Mahout MatrixMultiplicationJob through commandline, I can see on JobTracker WebUI that only 1 map task is launched. According to my understanding of Inputsplit, there should be 11 map tasks launched.
> Apart from this Map task stays at 0.99% completion and in the Tasks Logs , I can see that map task is spilling the map output.
>
> Mahout Command:
>
> mahout matrixmult -Dmapred.child.java.opts=-Xmx1024M -Dfs.inmemory.size.mb=200 -Dio.sort.factor=100 -Dio.sort.mb=200 -Dio.file.buffer.size=131072 --inputPathA /test/matrixA --numRowsA 100 --numColsA 100000 --inputPathB /test/matrixA --numRowsB 100 --numColsB 100000 --tempDir /test/temp
>
> Now here I want to know that why only 1 map task is launched everytime and how can I performance tune the cluster so that I can perform the dense matrix multiplication of the order [90K x 1 Million] .
>
> Thanks
> Stuti
>
>
> ::DISCLAIMER::
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses and other defects.
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------