You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Shannon Quinn <sq...@gatech.edu> on 2010/06/16 00:58:02 UTC

M/R capturing line numbers in text files

Hi all,

I have a few questions on the specifics of map/reduce:

1) I've made the assumption so far that the input to my clustering 
algorithm will be a single CSV file containing the entire affinity 
matrix, where each line in the file is a row in the matrix. Is there 
another input approach that would work better for reading this affinity 
matrix?

2) I've committed a patch for what the M/R task of creating a 
DistributedRowMatrix out of the input data might look like, but it's 
unfinished. There isn't a straightforward way of determining which row 
in the CSV file is currently being processed (as the keys are the number 
of bytes, rather than the line number), and it's crucial that lines in 
the CSV file correspond to rows in the DistributedRowMatrix. I've found 
a few ways to handle this, but they're either too hacky (adding a column 
to the CSV file) or very in-depth (subclassing RecordReader), so I 
thought I'd ask if anyone else has thoughts on this?

3) Once I am able to track which rows are which, how can I make sure the 
SequenceFiles are written in such a way so that the ensuing 
DistributedRowMatrix accurately reflects the arrangement of data in 
original CSV file? I've been using TransposeJob as a model for this, but 
it seems to work with the advantage that the keys in the Map step 
already correspond to rows. The syntheticcontrol InputMapper has also 
been useful, but in this case the clustering algorithms don't need to 
keep the rows in any particular orientation.

Thanks again for all the assistance :)

Regards,
Shannon

Re: M/R capturing line numbers in text files

Posted by Shannon Quinn <sq...@gatech.edu>.

Perfect. Thank you.

Unfortunately, now I receive this exception:

java.io.IOException: wrong value class: 
org.apache.mahout.math.hadoop.DistributedRowMatrix$MatrixEntryWritable 
is not class org.apache.mahout.math.VectorWritable

My Mapper's value output and Reducer's input is a 
DRM.MatrixEntryWritable, and is specified as such in the Conf object. 
The Reducer's output is a VectorWritable. The stack trace doesn't 
mention any code of mine, so I'm not sure how to approach this.

> The basic problem is that something has produced data that uses a long as an
> ID and your mapper is expecting an int.  Have you posted your code as a
> patch on the jira or a git link?
>    

I attached a patch to my project's ticket on jira (363).

Thanks again!

Regards,
Shannon

Re: M/R capturing line numbers in text files

Posted by Jake Mannix <ja...@gmail.com>.

Shannon,

  When you use TextInputFormat, the keys in your mapper will be the byte
offsets into the file, and in particular the will be LongWritable.  If you
change your mapper to have LongWritable keys, it will all "just work".

  -jake

On Jun 15, 2010 9:59 PM, "Ted Dunning" <te...@gmail.com> wrote:

This has come up before and can be a bit tricky to diagnose without looking
through the code carefully.

The basic problem is that something has produced data that uses a long as an
ID and your mapper is expecting an int.  Have you posted your code as a
patch on the jira or a git link?

On Tue, Jun 15, 2010 at 9:55 PM, Shannon Quinn <sq...@gatech.edu> wrote: >
Hi Ted, > > Thank you ...

Re: M/R capturing line numbers in text files

Posted by Ted Dunning <te...@gmail.com>.

This has come up before and can be a bit tricky to diagnose without looking
through the code carefully.

The basic problem is that something has produced data that uses a long as an
ID and your mapper is expecting an int.  Have you posted your code as a
patch on the jira or a git link?

On Tue, Jun 15, 2010 at 9:55 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Hi Ted,
>
> Thank you very much - very valuable insight as to a more robust input
> format. I've already started implementing it.
>
> I finished the new M/R process to reflect the new assumed input format
> (submitted the patch), but I'm getting an exception I can't seem to
> diagnose. When I start the program, and the INFO lines start rolling from
> the process, right before the M/R task begins I get the following:
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
> cast to org.apache.hadoop.io.IntWritable
>    at
> org.apache.mahout.clustering.eigencuts.EigencutsInputMapper.map(EigencutsInputMapper.java:22)
>
> The line 22 referred to in the message is:
>
> public class EigencutsInputMapper extends Mapper<IntWritable, Text,
> IntWritable, DistributedRowMatrix.MatrixEntryWritable> {
>
> I did a search in all my source files; no mention anywhere (except one
> commented-out line) of LongWritable. It was in my previous implementation,
> but I performed mvn clean multiple times. Any thoughts would be appreciated.
>
> Thank you again!
>
> Regards,
> Shannon
>
>
> On 6/15/2010 7:03 PM, Ted Dunning wrote:
>
>> Shannon,
>>
>> Nice work so far.
>>
>> I think it is a bit more customary to enter a graph by giving the integer
>> pairs that represent the starting and ending nodes for each arc.  That
>> avoids the memory allocation problem you hit if one node is connected to
>> millions of others.  It also may solve your problem of the distributed row
>> matrix since you could write a reducer to gather everything to the right
>> place for writing a row.  In doing that, you would inherently have the row
>> number available because that would be the grouping key.
>>
>> If you keep the current one matrix row per csv line, I would recommend
>> putting the source node at the beginning of the line.
>>
>>
>> On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn<sq...@gatech.edu>  wrote:
>>
>>
>>
>>> 1) I've made the assumption so far that the input to my clustering
>>> algorithm will be a single CSV file containing the entire affinity
>>> matrix,
>>> where each line in the file is a row in the matrix. Is there another
>>> input
>>> approach that would work better for reading this affinity matrix?
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: M/R capturing line numbers in text files

Posted by Shannon Quinn <sq...@gatech.edu>.

Hi Ted,

Thank you very much - very valuable insight as to a more robust input 
format. I've already started implementing it.

I finished the new M/R process to reflect the new assumed input format 
(submitted the patch), but I'm getting an exception I can't seem to 
diagnose. When I start the program, and the INFO lines start rolling 
from the process, right before the M/R task begins I get the following:

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot 
be cast to org.apache.hadoop.io.IntWritable
     at 
org.apache.mahout.clustering.eigencuts.EigencutsInputMapper.map(EigencutsInputMapper.java:22)

The line 22 referred to in the message is:

public class EigencutsInputMapper extends Mapper<IntWritable, Text, 
IntWritable, DistributedRowMatrix.MatrixEntryWritable> {

I did a search in all my source files; no mention anywhere (except one 
commented-out line) of LongWritable. It was in my previous 
implementation, but I performed mvn clean multiple times. Any thoughts 
would be appreciated.

Thank you again!

Regards,
Shannon

On 6/15/2010 7:03 PM, Ted Dunning wrote:
> Shannon,
>
> Nice work so far.
>
> I think it is a bit more customary to enter a graph by giving the integer
> pairs that represent the starting and ending nodes for each arc.  That
> avoids the memory allocation problem you hit if one node is connected to
> millions of others.  It also may solve your problem of the distributed row
> matrix since you could write a reducer to gather everything to the right
> place for writing a row.  In doing that, you would inherently have the row
> number available because that would be the grouping key.
>
> If you keep the current one matrix row per csv line, I would recommend
> putting the source node at the beginning of the line.
>
>
> On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn<sq...@gatech.edu>  wrote:
>
>    
>> 1) I've made the assumption so far that the input to my clustering
>> algorithm will be a single CSV file containing the entire affinity matrix,
>> where each line in the file is a row in the matrix. Is there another input
>> approach that would work better for reading this affinity matrix?
>>
>>
>>      
>

Re: M/R capturing line numbers in text files

Posted by Ted Dunning <te...@gmail.com>.

Shannon,

Nice work so far.

I think it is a bit more customary to enter a graph by giving the integer
pairs that represent the starting and ending nodes for each arc.  That
avoids the memory allocation problem you hit if one node is connected to
millions of others.  It also may solve your problem of the distributed row
matrix since you could write a reducer to gather everything to the right
place for writing a row.  In doing that, you would inherently have the row
number available because that would be the grouping key.

If you keep the current one matrix row per csv line, I would recommend
putting the source node at the beginning of the line.

On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn <sq...@gatech.edu> wrote:

>
> 1) I've made the assumption so far that the input to my clustering
> algorithm will be a single CSV file containing the entire affinity matrix,
> where each line in the file is a row in the matrix. Is there another input
> approach that would work better for reading this affinity matrix?
>
>