You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Abhijat Vatsyayan <ab...@gmail.com> on 2010/09/12 22:25:26 UTC

DistributedRowMatrix transpose method problem

I isolated a bug in my program to a place where I am using DistributedRowMatrix.transpose(). When I send a "transpose" message to a DistributedRowMatrix object, I see the mapper and reducer being started, and the method finishes without any errors but my attempt to read the contents of the (transposed) matrix fails. Seems like I am missing something really basic here but any help will be appreciated. 

Here is the test case code (imports, package statement and comments not shown): 
public class TestMatrixIO {
	@Test
	public  void testDistributedTranspose( ) throws Exception 
	{
		Configuration cfg = new Configuration( );
		DistributedRowMatrix matrix = new DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1", 3,4);
		matrix.configure(new JobConf(cfg));
		int count = printMatrix(matrix); // prints OK .. 
		System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
		DistributedRowMatrix matrix_t = matrix.transpose(); 
		System.out.println("[testReadingDistributedMatrix()]..Transpose done");
		printMatrix(matrix_t); // Fails 
	}
	private static int printMatrix(DistributedRowMatrix matrix) {
		Iterator<MatrixSlice> iterator = matrix.iterateAll();
		int count = 0;
		while(iterator.hasNext())
		{
			MatrixSlice slice = iterator.next();
			Vector v = slice.vector();
			int size = v.size();
			for(int i=0;i<size;i++)
			{
				Element e = v.getElement(i);
				count++;
				System.out.print(e.get()+" ");
			}
			System.out.println();
		}
		return count;
	}
}

The stack trace when I try to print the matrix on the last line of  the testDistributedTranspose method is : 
java.lang.IllegalStateException: java.io.IOException: Cannot open filename /user/abhijat/input/transpose-104/_logs
	at org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
	at net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
	at net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Caused by: java.io.IOException: Cannot open filename /user/abhijat/input/transpose-104/_logs
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
	at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
	at org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
	at org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
	... 24 more


"hadoop fs -ls input" shows that the transpose job did create the directory and output files. I created the matrix file using following code (imports, package statement and comments not shown): 
public class TestWriteMatrix {
	public static final String INPUT_TEST_MATRIX_FILE = "input/test.matrix.file";
	public static final double[][] matrix_dat = 
	{
		{1,3,-2,0}, 
		{2,3,2,-9}, 
		{-1,1,-4,10}
	};
	@Test
	public void testWritingMatrix() throws Exception 
	{
		Configuration cfg = new Configuration( );
		FileSystem fs = FileSystem.get(cfg);
		SequenceFile.Writer writer = SequenceFile.createWriter(fs, cfg, new Path(INPUT_TEST_MATRIX_FILE), 
				IntWritable.class, VectorWritable.class) ;
		for(int i=0;i<matrix_dat.length;i++)
		{
			DenseVector  row = new DenseVector(matrix_dat[i]);
			VectorWritable vwritable = new VectorWritable(row);
			writer.append(new IntWritable(i), vwritable);
		}
		writer.close();
	}
}

Re: DistributedRowMatrix transpose method problem

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, Sep 12, 2010 at 3:37 PM, Abhijat Vatsyayan <
abhijat.vatsyayan@gmail.com> wrote:

> Thanks Jake. Wouldn't using  FileSystem.globStatus(Path pathPattern,
> PathFilter filter) along with a filter for ignoring directories be easier?
> transpose() could then additionally set the filter  in the transposed matrix
> (to ignore the directories). I have very little understanding/knowledge of
> the interface contracts so am not sure if it will break something else but
> will try a few things and see what works.
>

I'm sure you're right about that, the PathFilter for ignoring subdirectories
seems to be the right way to go when doing iterate()/iterateAll().  Those
are really pretty fragile / debuggy methods, as you typically will want to
access all of the rows of a DistributedRowMatrix as a MR job, not a local
HDFS SequenceFile iteration.

Feel free to open a JIRA ticket / post a patch, I'm sure we can get it
reviewed and committed quickly if we can verify it fixes this.

  -jake


> Abhijat
>
> On Sep 12, 2010, at 5:03 PM, Jake Mannix wrote:
>
> > Hi Abhijat,
> >
> >  It looks like you've found a bug not in transpose(), but in iterateAll()
> > (and probably iterate() ) - the file globbing of the contents of the
> > sequence file directory is grabbing the  "_logs" subdirectory
> automatically
> > created by hadoop and trying to treat that as a part of the SequenceFile,
> > which it is not.
> >
> >  Yep, line 207 of DistributedRowMatrix globs together anything in the
> > matrix row directory, that "*" should be more restrictive (maybe you can
> try
> > "part*" and recompile and see if your code works?).
> >
> >  -jake
> >
> > On Sun, Sep 12, 2010 at 1:25 PM, Abhijat Vatsyayan <
> > abhijat.vatsyayan@gmail.com> wrote:
> >
> >> I isolated a bug in my program to a place where I am using
> >> DistributedRowMatrix.transpose(). When I send a "transpose" message to a
> >> DistributedRowMatrix object, I see the mapper and reducer being started,
> and
> >> the method finishes without any errors but my attempt to read the
> contents
> >> of the (transposed) matrix fails. Seems like I am missing something
> really
> >> basic here but any help will be appreciated.
> >>
> >> Here is the test case code (imports, package statement and comments not
> >> shown):
> >> public class TestMatrixIO {
> >>       @Test
> >>       public  void testDistributedTranspose( ) throws Exception
> >>       {
> >>               Configuration cfg = new Configuration( );
> >>               DistributedRowMatrix matrix = new
> >>
> DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1",
> >> 3,4);
> >>               matrix.configure(new JobConf(cfg));
> >>               int count = printMatrix(matrix); // prints OK ..
> >>
> >>
> System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
> >>               DistributedRowMatrix matrix_t = matrix.transpose();
> >>
> >> System.out.println("[testReadingDistributedMatrix()]..Transpose done");
> >>               printMatrix(matrix_t); // Fails
> >>       }
> >>       private static int printMatrix(DistributedRowMatrix matrix) {
> >>               Iterator<MatrixSlice> iterator = matrix.iterateAll();
> >>               int count = 0;
> >>               while(iterator.hasNext())
> >>               {
> >>                       MatrixSlice slice = iterator.next();
> >>                       Vector v = slice.vector();
> >>                       int size = v.size();
> >>                       for(int i=0;i<size;i++)
> >>                       {
> >>                               Element e = v.getElement(i);
> >>                               count++;
> >>                               System.out.print(e.get()+" ");
> >>                       }
> >>                       System.out.println();
> >>               }
> >>               return count;
> >>       }
> >> }
> >>
> >> The stack trace when I try to print the matrix on the last line of  the
> >> testDistributedTranspose method is :
> >> java.lang.IllegalStateException: java.io.IOException: Cannot open
> filename
> >> /user/abhijat/input/transpose-104/_logs
> >>       at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
> >>       at
> >>
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
> >>       at
> >>
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
> >>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>       at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>       at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>       at java.lang.reflect.Method.invoke(Method.java:597)
> >>       at
> >>
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
> >>       at
> >>
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
> >>       at
> >>
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
> >>       at
> >>
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
> >>       at
> >>
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
> >>       at
> >>
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> >>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
> >>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
> >>       at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
> >>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
> >>       at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
> >>       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
> >>       at
> >>
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
> >>       at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
> >> Caused by: java.io.IOException: Cannot open filename
> >> /user/abhijat/input/transpose-104/_logs
> >>       at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
> >>       at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
> >>       at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
> >>       at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
> >>       at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
> >>       at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
> >>       at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >>       at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >>       at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
> >>       at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
> >>       ... 24 more
> >>
> >>
> >> "hadoop fs -ls input" shows that the transpose job did create the
> directory
> >> and output files. I created the matrix file using following code
> (imports,
> >> package statement and comments not shown):
> >> public class TestWriteMatrix {
> >>       public static final String INPUT_TEST_MATRIX_FILE =
> >> "input/test.matrix.file";
> >>       public static final double[][] matrix_dat =
> >>       {
> >>               {1,3,-2,0},
> >>               {2,3,2,-9},
> >>               {-1,1,-4,10}
> >>       };
> >>       @Test
> >>       public void testWritingMatrix() throws Exception
> >>       {
> >>               Configuration cfg = new Configuration( );
> >>               FileSystem fs = FileSystem.get(cfg);
> >>               SequenceFile.Writer writer = SequenceFile.createWriter(fs,
> >> cfg, new Path(INPUT_TEST_MATRIX_FILE),
> >>                               IntWritable.class, VectorWritable.class) ;
> >>               for(int i=0;i<matrix_dat.length;i++)
> >>               {
> >>                       DenseVector  row = new DenseVector(matrix_dat[i]);
> >>                       VectorWritable vwritable = new
> VectorWritable(row);
> >>                       writer.append(new IntWritable(i), vwritable);
> >>               }
> >>               writer.close();
> >>       }
> >> }
>
>

Re: DistributedRowMatrix transpose method problem

Posted by Abhijat Vatsyayan <ab...@gmail.com>.

Thanks Jake. Wouldn't using  FileSystem.globStatus(Path pathPattern, PathFilter filter) along with a filter for ignoring directories be easier? transpose() could then additionally set the filter  in the transposed matrix (to ignore the directories). I have very little understanding/knowledge of the interface contracts so am not sure if it will break something else but will try a few things and see what works. 
Abhijat 

On Sep 12, 2010, at 5:03 PM, Jake Mannix wrote:

> Hi Abhijat,
> 
>  It looks like you've found a bug not in transpose(), but in iterateAll()
> (and probably iterate() ) - the file globbing of the contents of the
> sequence file directory is grabbing the  "_logs" subdirectory automatically
> created by hadoop and trying to treat that as a part of the SequenceFile,
> which it is not.
> 
>  Yep, line 207 of DistributedRowMatrix globs together anything in the
> matrix row directory, that "*" should be more restrictive (maybe you can try
> "part*" and recompile and see if your code works?).
> 
>  -jake
> 
> On Sun, Sep 12, 2010 at 1:25 PM, Abhijat Vatsyayan <
> abhijat.vatsyayan@gmail.com> wrote:
> 
>> I isolated a bug in my program to a place where I am using
>> DistributedRowMatrix.transpose(). When I send a "transpose" message to a
>> DistributedRowMatrix object, I see the mapper and reducer being started, and
>> the method finishes without any errors but my attempt to read the contents
>> of the (transposed) matrix fails. Seems like I am missing something really
>> basic here but any help will be appreciated.
>> 
>> Here is the test case code (imports, package statement and comments not
>> shown):
>> public class TestMatrixIO {
>>       @Test
>>       public  void testDistributedTranspose( ) throws Exception
>>       {
>>               Configuration cfg = new Configuration( );
>>               DistributedRowMatrix matrix = new
>> DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1",
>> 3,4);
>>               matrix.configure(new JobConf(cfg));
>>               int count = printMatrix(matrix); // prints OK ..
>> 
>> System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
>>               DistributedRowMatrix matrix_t = matrix.transpose();
>> 
>> System.out.println("[testReadingDistributedMatrix()]..Transpose done");
>>               printMatrix(matrix_t); // Fails
>>       }
>>       private static int printMatrix(DistributedRowMatrix matrix) {
>>               Iterator<MatrixSlice> iterator = matrix.iterateAll();
>>               int count = 0;
>>               while(iterator.hasNext())
>>               {
>>                       MatrixSlice slice = iterator.next();
>>                       Vector v = slice.vector();
>>                       int size = v.size();
>>                       for(int i=0;i<size;i++)
>>                       {
>>                               Element e = v.getElement(i);
>>                               count++;
>>                               System.out.print(e.get()+" ");
>>                       }
>>                       System.out.println();
>>               }
>>               return count;
>>       }
>> }
>> 
>> The stack trace when I try to print the matrix on the last line of  the
>> testDistributedTranspose method is :
>> java.lang.IllegalStateException: java.io.IOException: Cannot open filename
>> /user/abhijat/input/transpose-104/_logs
>>       at
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
>>       at
>> net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
>>       at
>> net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>       at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>       at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>       at
>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>       at
>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>       at
>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>       at
>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>       at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
>>       at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>       at
>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
>>       at
>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>       at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>>       at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>>       at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>>       at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
>> Caused by: java.io.IOException: Cannot open filename
>> /user/abhijat/input/transpose-104/_logs
>>       at
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
>>       at
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
>>       at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
>>       at
>> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
>>       at
>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>>       at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>>       at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>>       at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>>       at
>> org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
>>       at
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
>>       ... 24 more
>> 
>> 
>> "hadoop fs -ls input" shows that the transpose job did create the directory
>> and output files. I created the matrix file using following code (imports,
>> package statement and comments not shown):
>> public class TestWriteMatrix {
>>       public static final String INPUT_TEST_MATRIX_FILE =
>> "input/test.matrix.file";
>>       public static final double[][] matrix_dat =
>>       {
>>               {1,3,-2,0},
>>               {2,3,2,-9},
>>               {-1,1,-4,10}
>>       };
>>       @Test
>>       public void testWritingMatrix() throws Exception
>>       {
>>               Configuration cfg = new Configuration( );
>>               FileSystem fs = FileSystem.get(cfg);
>>               SequenceFile.Writer writer = SequenceFile.createWriter(fs,
>> cfg, new Path(INPUT_TEST_MATRIX_FILE),
>>                               IntWritable.class, VectorWritable.class) ;
>>               for(int i=0;i<matrix_dat.length;i++)
>>               {
>>                       DenseVector  row = new DenseVector(matrix_dat[i]);
>>                       VectorWritable vwritable = new VectorWritable(row);
>>                       writer.append(new IntWritable(i), vwritable);
>>               }
>>               writer.close();
>>       }
>> }

Re: DistributedRowMatrix transpose method problem

Posted by Jake Mannix <ja...@gmail.com>.

Hi Abhijat,

  It looks like you've found a bug not in transpose(), but in iterateAll()
(and probably iterate() ) - the file globbing of the contents of the
sequence file directory is grabbing the  "_logs" subdirectory automatically
created by hadoop and trying to treat that as a part of the SequenceFile,
which it is not.

  Yep, line 207 of DistributedRowMatrix globs together anything in the
matrix row directory, that "*" should be more restrictive (maybe you can try
"part*" and recompile and see if your code works?).

  -jake

On Sun, Sep 12, 2010 at 1:25 PM, Abhijat Vatsyayan <
abhijat.vatsyayan@gmail.com> wrote:

> I isolated a bug in my program to a place where I am using
> DistributedRowMatrix.transpose(). When I send a "transpose" message to a
> DistributedRowMatrix object, I see the mapper and reducer being started, and
> the method finishes without any errors but my attempt to read the contents
> of the (transposed) matrix fails. Seems like I am missing something really
> basic here but any help will be appreciated.
>
> Here is the test case code (imports, package statement and comments not
> shown):
> public class TestMatrixIO {
>        @Test
>        public  void testDistributedTranspose( ) throws Exception
>        {
>                Configuration cfg = new Configuration( );
>                DistributedRowMatrix matrix = new
> DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1",
> 3,4);
>                matrix.configure(new JobConf(cfg));
>                int count = printMatrix(matrix); // prints OK ..
>
>  System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
>                DistributedRowMatrix matrix_t = matrix.transpose();
>
>  System.out.println("[testReadingDistributedMatrix()]..Transpose done");
>                printMatrix(matrix_t); // Fails
>        }
>        private static int printMatrix(DistributedRowMatrix matrix) {
>                Iterator<MatrixSlice> iterator = matrix.iterateAll();
>                int count = 0;
>                while(iterator.hasNext())
>                {
>                        MatrixSlice slice = iterator.next();
>                        Vector v = slice.vector();
>                        int size = v.size();
>                        for(int i=0;i<size;i++)
>                        {
>                                Element e = v.getElement(i);
>                                count++;
>                                System.out.print(e.get()+" ");
>                        }
>                        System.out.println();
>                }
>                return count;
>        }
> }
>
> The stack trace when I try to print the matrix on the last line of  the
> testDistributedTranspose method is :
> java.lang.IllegalStateException: java.io.IOException: Cannot open filename
> /user/abhijat/input/transpose-104/_logs
>        at
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
>        at
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
>        at
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>        at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>        at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>        at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>        at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
>        at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>        at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>        at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>        at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>        at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>        at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>        at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>        at
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
>        at
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>        at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>        at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>        at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>        at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
> Caused by: java.io.IOException: Cannot open filename
> /user/abhijat/input/transpose-104/_logs
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
>        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
>        at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>        at
> org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
>        at
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
>        ... 24 more
>
>
> "hadoop fs -ls input" shows that the transpose job did create the directory
> and output files. I created the matrix file using following code (imports,
> package statement and comments not shown):
> public class TestWriteMatrix {
>        public static final String INPUT_TEST_MATRIX_FILE =
> "input/test.matrix.file";
>        public static final double[][] matrix_dat =
>        {
>                {1,3,-2,0},
>                {2,3,2,-9},
>                {-1,1,-4,10}
>        };
>        @Test
>        public void testWritingMatrix() throws Exception
>        {
>                Configuration cfg = new Configuration( );
>                FileSystem fs = FileSystem.get(cfg);
>                SequenceFile.Writer writer = SequenceFile.createWriter(fs,
> cfg, new Path(INPUT_TEST_MATRIX_FILE),
>                                IntWritable.class, VectorWritable.class) ;
>                for(int i=0;i<matrix_dat.length;i++)
>                {
>                        DenseVector  row = new DenseVector(matrix_dat[i]);
>                        VectorWritable vwritable = new VectorWritable(row);
>                        writer.append(new IntWritable(i), vwritable);
>                }
>                writer.close();
>        }
> }