You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Abhijat Vatsyayan <ab...@gmail.com> on 2010/09/12 22:25:26 UTC
DistributedRowMatrix transpose method problem
I isolated a bug in my program to a place where I am using DistributedRowMatrix.transpose(). When I send a "transpose" message to a DistributedRowMatrix object, I see the mapper and reducer being started, and the method finishes without any errors but my attempt to read the contents of the (transposed) matrix fails. Seems like I am missing something really basic here but any help will be appreciated.
Here is the test case code (imports, package statement and comments not shown):
public class TestMatrixIO {
@Test
public void testDistributedTranspose( ) throws Exception
{
Configuration cfg = new Configuration( );
DistributedRowMatrix matrix = new DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1", 3,4);
matrix.configure(new JobConf(cfg));
int count = printMatrix(matrix); // prints OK ..
System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
DistributedRowMatrix matrix_t = matrix.transpose();
System.out.println("[testReadingDistributedMatrix()]..Transpose done");
printMatrix(matrix_t); // Fails
}
private static int printMatrix(DistributedRowMatrix matrix) {
Iterator<MatrixSlice> iterator = matrix.iterateAll();
int count = 0;
while(iterator.hasNext())
{
MatrixSlice slice = iterator.next();
Vector v = slice.vector();
int size = v.size();
for(int i=0;i<size;i++)
{
Element e = v.getElement(i);
count++;
System.out.print(e.get()+" ");
}
System.out.println();
}
return count;
}
}
The stack trace when I try to print the matrix on the last line of the testDistributedTranspose method is :
java.lang.IllegalStateException: java.io.IOException: Cannot open filename /user/abhijat/input/transpose-104/_logs
at org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
at net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
at net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Caused by: java.io.IOException: Cannot open filename /user/abhijat/input/transpose-104/_logs
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
at org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
at org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
... 24 more
"hadoop fs -ls input" shows that the transpose job did create the directory and output files. I created the matrix file using following code (imports, package statement and comments not shown):
public class TestWriteMatrix {
public static final String INPUT_TEST_MATRIX_FILE = "input/test.matrix.file";
public static final double[][] matrix_dat =
{
{1,3,-2,0},
{2,3,2,-9},
{-1,1,-4,10}
};
@Test
public void testWritingMatrix() throws Exception
{
Configuration cfg = new Configuration( );
FileSystem fs = FileSystem.get(cfg);
SequenceFile.Writer writer = SequenceFile.createWriter(fs, cfg, new Path(INPUT_TEST_MATRIX_FILE),
IntWritable.class, VectorWritable.class) ;
for(int i=0;i<matrix_dat.length;i++)
{
DenseVector row = new DenseVector(matrix_dat[i]);
VectorWritable vwritable = new VectorWritable(row);
writer.append(new IntWritable(i), vwritable);
}
writer.close();
}
}
Re: DistributedRowMatrix transpose method problem
Posted by Jake Mannix <ja...@gmail.com>.
On Sun, Sep 12, 2010 at 3:37 PM, Abhijat Vatsyayan <
abhijat.vatsyayan@gmail.com> wrote:
> Thanks Jake. Wouldn't using FileSystem.globStatus(Path pathPattern,
> PathFilter filter) along with a filter for ignoring directories be easier?
> transpose() could then additionally set the filter in the transposed matrix
> (to ignore the directories). I have very little understanding/knowledge of
> the interface contracts so am not sure if it will break something else but
> will try a few things and see what works.
>
I'm sure you're right about that, the PathFilter for ignoring subdirectories
seems to be the right way to go when doing iterate()/iterateAll(). Those
are really pretty fragile / debuggy methods, as you typically will want to
access all of the rows of a DistributedRowMatrix as a MR job, not a local
HDFS SequenceFile iteration.
Feel free to open a JIRA ticket / post a patch, I'm sure we can get it
reviewed and committed quickly if we can verify it fixes this.
-jake
> Abhijat
>
> On Sep 12, 2010, at 5:03 PM, Jake Mannix wrote:
>
> > Hi Abhijat,
> >
> > It looks like you've found a bug not in transpose(), but in iterateAll()
> > (and probably iterate() ) - the file globbing of the contents of the
> > sequence file directory is grabbing the "_logs" subdirectory
> automatically
> > created by hadoop and trying to treat that as a part of the SequenceFile,
> > which it is not.
> >
> > Yep, line 207 of DistributedRowMatrix globs together anything in the
> > matrix row directory, that "*" should be more restrictive (maybe you can
> try
> > "part*" and recompile and see if your code works?).
> >
> > -jake
> >
> > On Sun, Sep 12, 2010 at 1:25 PM, Abhijat Vatsyayan <
> > abhijat.vatsyayan@gmail.com> wrote:
> >
> >> I isolated a bug in my program to a place where I am using
> >> DistributedRowMatrix.transpose(). When I send a "transpose" message to a
> >> DistributedRowMatrix object, I see the mapper and reducer being started,
> and
> >> the method finishes without any errors but my attempt to read the
> contents
> >> of the (transposed) matrix fails. Seems like I am missing something
> really
> >> basic here but any help will be appreciated.
> >>
> >> Here is the test case code (imports, package statement and comments not
> >> shown):
> >> public class TestMatrixIO {
> >> @Test
> >> public void testDistributedTranspose( ) throws Exception
> >> {
> >> Configuration cfg = new Configuration( );
> >> DistributedRowMatrix matrix = new
> >>
> DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1",
> >> 3,4);
> >> matrix.configure(new JobConf(cfg));
> >> int count = printMatrix(matrix); // prints OK ..
> >>
> >>
> System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
> >> DistributedRowMatrix matrix_t = matrix.transpose();
> >>
> >> System.out.println("[testReadingDistributedMatrix()]..Transpose done");
> >> printMatrix(matrix_t); // Fails
> >> }
> >> private static int printMatrix(DistributedRowMatrix matrix) {
> >> Iterator<MatrixSlice> iterator = matrix.iterateAll();
> >> int count = 0;
> >> while(iterator.hasNext())
> >> {
> >> MatrixSlice slice = iterator.next();
> >> Vector v = slice.vector();
> >> int size = v.size();
> >> for(int i=0;i<size;i++)
> >> {
> >> Element e = v.getElement(i);
> >> count++;
> >> System.out.print(e.get()+" ");
> >> }
> >> System.out.println();
> >> }
> >> return count;
> >> }
> >> }
> >>
> >> The stack trace when I try to print the matrix on the last line of the
> >> testDistributedTranspose method is :
> >> java.lang.IllegalStateException: java.io.IOException: Cannot open
> filename
> >> /user/abhijat/input/transpose-104/_logs
> >> at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
> >> at
> >>
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
> >> at
> >>
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >> at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >> at
> >>
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
> >> at
> >>
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
> >> at
> >>
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
> >> at
> >>
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
> >> at
> >>
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
> >> at
> >>
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> >> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
> >> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
> >> at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
> >> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
> >> at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
> >> at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
> >> at
> >>
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
> >> at
> >>
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> >> at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
> >> at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
> >> at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
> >> at
> >>
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
> >> Caused by: java.io.IOException: Cannot open filename
> >> /user/abhijat/input/transpose-104/_logs
> >> at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
> >> at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
> >> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
> >> at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
> >> at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
> >> at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
> >> at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> >> at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> >> at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
> >> at
> >>
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
> >> ... 24 more
> >>
> >>
> >> "hadoop fs -ls input" shows that the transpose job did create the
> directory
> >> and output files. I created the matrix file using following code
> (imports,
> >> package statement and comments not shown):
> >> public class TestWriteMatrix {
> >> public static final String INPUT_TEST_MATRIX_FILE =
> >> "input/test.matrix.file";
> >> public static final double[][] matrix_dat =
> >> {
> >> {1,3,-2,0},
> >> {2,3,2,-9},
> >> {-1,1,-4,10}
> >> };
> >> @Test
> >> public void testWritingMatrix() throws Exception
> >> {
> >> Configuration cfg = new Configuration( );
> >> FileSystem fs = FileSystem.get(cfg);
> >> SequenceFile.Writer writer = SequenceFile.createWriter(fs,
> >> cfg, new Path(INPUT_TEST_MATRIX_FILE),
> >> IntWritable.class, VectorWritable.class) ;
> >> for(int i=0;i<matrix_dat.length;i++)
> >> {
> >> DenseVector row = new DenseVector(matrix_dat[i]);
> >> VectorWritable vwritable = new
> VectorWritable(row);
> >> writer.append(new IntWritable(i), vwritable);
> >> }
> >> writer.close();
> >> }
> >> }
>
>
Re: DistributedRowMatrix transpose method problem
Posted by Abhijat Vatsyayan <ab...@gmail.com>.
Thanks Jake. Wouldn't using FileSystem.globStatus(Path pathPattern, PathFilter filter) along with a filter for ignoring directories be easier? transpose() could then additionally set the filter in the transposed matrix (to ignore the directories). I have very little understanding/knowledge of the interface contracts so am not sure if it will break something else but will try a few things and see what works.
Abhijat
On Sep 12, 2010, at 5:03 PM, Jake Mannix wrote:
> Hi Abhijat,
>
> It looks like you've found a bug not in transpose(), but in iterateAll()
> (and probably iterate() ) - the file globbing of the contents of the
> sequence file directory is grabbing the "_logs" subdirectory automatically
> created by hadoop and trying to treat that as a part of the SequenceFile,
> which it is not.
>
> Yep, line 207 of DistributedRowMatrix globs together anything in the
> matrix row directory, that "*" should be more restrictive (maybe you can try
> "part*" and recompile and see if your code works?).
>
> -jake
>
> On Sun, Sep 12, 2010 at 1:25 PM, Abhijat Vatsyayan <
> abhijat.vatsyayan@gmail.com> wrote:
>
>> I isolated a bug in my program to a place where I am using
>> DistributedRowMatrix.transpose(). When I send a "transpose" message to a
>> DistributedRowMatrix object, I see the mapper and reducer being started, and
>> the method finishes without any errors but my attempt to read the contents
>> of the (transposed) matrix fails. Seems like I am missing something really
>> basic here but any help will be appreciated.
>>
>> Here is the test case code (imports, package statement and comments not
>> shown):
>> public class TestMatrixIO {
>> @Test
>> public void testDistributedTranspose( ) throws Exception
>> {
>> Configuration cfg = new Configuration( );
>> DistributedRowMatrix matrix = new
>> DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1",
>> 3,4);
>> matrix.configure(new JobConf(cfg));
>> int count = printMatrix(matrix); // prints OK ..
>>
>> System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
>> DistributedRowMatrix matrix_t = matrix.transpose();
>>
>> System.out.println("[testReadingDistributedMatrix()]..Transpose done");
>> printMatrix(matrix_t); // Fails
>> }
>> private static int printMatrix(DistributedRowMatrix matrix) {
>> Iterator<MatrixSlice> iterator = matrix.iterateAll();
>> int count = 0;
>> while(iterator.hasNext())
>> {
>> MatrixSlice slice = iterator.next();
>> Vector v = slice.vector();
>> int size = v.size();
>> for(int i=0;i<size;i++)
>> {
>> Element e = v.getElement(i);
>> count++;
>> System.out.print(e.get()+" ");
>> }
>> System.out.println();
>> }
>> return count;
>> }
>> }
>>
>> The stack trace when I try to print the matrix on the last line of the
>> testDistributedTranspose method is :
>> java.lang.IllegalStateException: java.io.IOException: Cannot open filename
>> /user/abhijat/input/transpose-104/_logs
>> at
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
>> at
>> net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
>> at
>> net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at
>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>> at
>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>> at
>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>> at
>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>> at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
>> at
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>> at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>> at
>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
>> at
>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>> at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>> at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>> at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>> at
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
>> Caused by: java.io.IOException: Cannot open filename
>> /user/abhijat/input/transpose-104/_logs
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
>> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
>> at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
>> at
>> org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
>> at
>> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
>> ... 24 more
>>
>>
>> "hadoop fs -ls input" shows that the transpose job did create the directory
>> and output files. I created the matrix file using following code (imports,
>> package statement and comments not shown):
>> public class TestWriteMatrix {
>> public static final String INPUT_TEST_MATRIX_FILE =
>> "input/test.matrix.file";
>> public static final double[][] matrix_dat =
>> {
>> {1,3,-2,0},
>> {2,3,2,-9},
>> {-1,1,-4,10}
>> };
>> @Test
>> public void testWritingMatrix() throws Exception
>> {
>> Configuration cfg = new Configuration( );
>> FileSystem fs = FileSystem.get(cfg);
>> SequenceFile.Writer writer = SequenceFile.createWriter(fs,
>> cfg, new Path(INPUT_TEST_MATRIX_FILE),
>> IntWritable.class, VectorWritable.class) ;
>> for(int i=0;i<matrix_dat.length;i++)
>> {
>> DenseVector row = new DenseVector(matrix_dat[i]);
>> VectorWritable vwritable = new VectorWritable(row);
>> writer.append(new IntWritable(i), vwritable);
>> }
>> writer.close();
>> }
>> }
Re: DistributedRowMatrix transpose method problem
Posted by Jake Mannix <ja...@gmail.com>.
Hi Abhijat,
It looks like you've found a bug not in transpose(), but in iterateAll()
(and probably iterate() ) - the file globbing of the contents of the
sequence file directory is grabbing the "_logs" subdirectory automatically
created by hadoop and trying to treat that as a part of the SequenceFile,
which it is not.
Yep, line 207 of DistributedRowMatrix globs together anything in the
matrix row directory, that "*" should be more restrictive (maybe you can try
"part*" and recompile and see if your code works?).
-jake
On Sun, Sep 12, 2010 at 1:25 PM, Abhijat Vatsyayan <
abhijat.vatsyayan@gmail.com> wrote:
> I isolated a bug in my program to a place where I am using
> DistributedRowMatrix.transpose(). When I send a "transpose" message to a
> DistributedRowMatrix object, I see the mapper and reducer being started, and
> the method finishes without any errors but my attempt to read the contents
> of the (transposed) matrix fails. Seems like I am missing something really
> basic here but any help will be appreciated.
>
> Here is the test case code (imports, package statement and comments not
> shown):
> public class TestMatrixIO {
> @Test
> public void testDistributedTranspose( ) throws Exception
> {
> Configuration cfg = new Configuration( );
> DistributedRowMatrix matrix = new
> DistributedRowMatrix(TestWriteMatrix.INPUT_TEST_MATRIX_FILE,"input/tmp_1",
> 3,4);
> matrix.configure(new JobConf(cfg));
> int count = printMatrix(matrix); // prints OK ..
>
> System.out.println("[testReadingDistributedMatrix()]..NumElements="+count);
> DistributedRowMatrix matrix_t = matrix.transpose();
>
> System.out.println("[testReadingDistributedMatrix()]..Transpose done");
> printMatrix(matrix_t); // Fails
> }
> private static int printMatrix(DistributedRowMatrix matrix) {
> Iterator<MatrixSlice> iterator = matrix.iterateAll();
> int count = 0;
> while(iterator.hasNext())
> {
> MatrixSlice slice = iterator.next();
> Vector v = slice.vector();
> int size = v.size();
> for(int i=0;i<size;i++)
> {
> Element e = v.getElement(i);
> count++;
> System.out.print(e.get()+" ");
> }
> System.out.println();
> }
> return count;
> }
> }
>
> The stack trace when I try to print the matrix on the last line of the
> testDistributedTranspose method is :
> java.lang.IllegalStateException: java.io.IOException: Cannot open filename
> /user/abhijat/input/transpose-104/_logs
> at
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:118)
> at
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.printMatrix(TestMatrixIO.java:28)
> at
> net.abhijat.hadoop.mr.testexec.TestMatrixIO.testDistributedTranspose(TestMatrixIO.java:25)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
> at
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:45)
> at
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
> at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
> Caused by: java.io.IOException: Cannot open filename
> /user/abhijat/input/transpose-104/_logs
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1488)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:178)
> at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1437)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1417)
> at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)
> at
> org.apache.mahout.math.hadoop.DistributedRowMatrix$DistributedMatrixIterator.<init>(DistributedRowMatrix.java:216)
> at
> org.apache.mahout.math.hadoop.DistributedRowMatrix.iterateAll(DistributedRowMatrix.java:116)
> ... 24 more
>
>
> "hadoop fs -ls input" shows that the transpose job did create the directory
> and output files. I created the matrix file using following code (imports,
> package statement and comments not shown):
> public class TestWriteMatrix {
> public static final String INPUT_TEST_MATRIX_FILE =
> "input/test.matrix.file";
> public static final double[][] matrix_dat =
> {
> {1,3,-2,0},
> {2,3,2,-9},
> {-1,1,-4,10}
> };
> @Test
> public void testWritingMatrix() throws Exception
> {
> Configuration cfg = new Configuration( );
> FileSystem fs = FileSystem.get(cfg);
> SequenceFile.Writer writer = SequenceFile.createWriter(fs,
> cfg, new Path(INPUT_TEST_MATRIX_FILE),
> IntWritable.class, VectorWritable.class) ;
> for(int i=0;i<matrix_dat.length;i++)
> {
> DenseVector row = new DenseVector(matrix_dat[i]);
> VectorWritable vwritable = new VectorWritable(row);
> writer.append(new IntWritable(i), vwritable);
> }
> writer.close();
> }
> }