You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Shannon Quinn <sq...@gatech.edu> on 2010/06/16 00:58:02 UTC
M/R capturing line numbers in text files
Hi all,
I have a few questions on the specifics of map/reduce:
1) I've made the assumption so far that the input to my clustering
algorithm will be a single CSV file containing the entire affinity
matrix, where each line in the file is a row in the matrix. Is there
another input approach that would work better for reading this affinity
matrix?
2) I've committed a patch for what the M/R task of creating a
DistributedRowMatrix out of the input data might look like, but it's
unfinished. There isn't a straightforward way of determining which row
in the CSV file is currently being processed (as the keys are the number
of bytes, rather than the line number), and it's crucial that lines in
the CSV file correspond to rows in the DistributedRowMatrix. I've found
a few ways to handle this, but they're either too hacky (adding a column
to the CSV file) or very in-depth (subclassing RecordReader), so I
thought I'd ask if anyone else has thoughts on this?
3) Once I am able to track which rows are which, how can I make sure the
SequenceFiles are written in such a way so that the ensuing
DistributedRowMatrix accurately reflects the arrangement of data in
original CSV file? I've been using TransposeJob as a model for this, but
it seems to work with the advantage that the keys in the Map step
already correspond to rows. The syntheticcontrol InputMapper has also
been useful, but in this case the clustering algorithms don't need to
keep the rows in any particular orientation.
Thanks again for all the assistance :)
Regards,
Shannon
Re: M/R capturing line numbers in text files
Posted by Shannon Quinn <sq...@gatech.edu>.
Perfect. Thank you.
Unfortunately, now I receive this exception:
java.io.IOException: wrong value class:
org.apache.mahout.math.hadoop.DistributedRowMatrix$MatrixEntryWritable
is not class org.apache.mahout.math.VectorWritable
My Mapper's value output and Reducer's input is a
DRM.MatrixEntryWritable, and is specified as such in the Conf object.
The Reducer's output is a VectorWritable. The stack trace doesn't
mention any code of mine, so I'm not sure how to approach this.
> The basic problem is that something has produced data that uses a long as an
> ID and your mapper is expecting an int. Have you posted your code as a
> patch on the jira or a git link?
>
I attached a patch to my project's ticket on jira (363).
Thanks again!
Regards,
Shannon
Re: M/R capturing line numbers in text files
Posted by Jake Mannix <ja...@gmail.com>.
Shannon,
When you use TextInputFormat, the keys in your mapper will be the byte
offsets into the file, and in particular the will be LongWritable. If you
change your mapper to have LongWritable keys, it will all "just work".
-jake
On Jun 15, 2010 9:59 PM, "Ted Dunning" <te...@gmail.com> wrote:
This has come up before and can be a bit tricky to diagnose without looking
through the code carefully.
The basic problem is that something has produced data that uses a long as an
ID and your mapper is expecting an int. Have you posted your code as a
patch on the jira or a git link?
On Tue, Jun 15, 2010 at 9:55 PM, Shannon Quinn <sq...@gatech.edu> wrote: >
Hi Ted, > > Thank you ...
Re: M/R capturing line numbers in text files
Posted by Ted Dunning <te...@gmail.com>.
This has come up before and can be a bit tricky to diagnose without looking
through the code carefully.
The basic problem is that something has produced data that uses a long as an
ID and your mapper is expecting an int. Have you posted your code as a
patch on the jira or a git link?
On Tue, Jun 15, 2010 at 9:55 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> Hi Ted,
>
> Thank you very much - very valuable insight as to a more robust input
> format. I've already started implementing it.
>
> I finished the new M/R process to reflect the new assumed input format
> (submitted the patch), but I'm getting an exception I can't seem to
> diagnose. When I start the program, and the INFO lines start rolling from
> the process, right before the M/R task begins I get the following:
>
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be
> cast to org.apache.hadoop.io.IntWritable
> at
> org.apache.mahout.clustering.eigencuts.EigencutsInputMapper.map(EigencutsInputMapper.java:22)
>
> The line 22 referred to in the message is:
>
> public class EigencutsInputMapper extends Mapper<IntWritable, Text,
> IntWritable, DistributedRowMatrix.MatrixEntryWritable> {
>
> I did a search in all my source files; no mention anywhere (except one
> commented-out line) of LongWritable. It was in my previous implementation,
> but I performed mvn clean multiple times. Any thoughts would be appreciated.
>
> Thank you again!
>
> Regards,
> Shannon
>
>
> On 6/15/2010 7:03 PM, Ted Dunning wrote:
>
>> Shannon,
>>
>> Nice work so far.
>>
>> I think it is a bit more customary to enter a graph by giving the integer
>> pairs that represent the starting and ending nodes for each arc. That
>> avoids the memory allocation problem you hit if one node is connected to
>> millions of others. It also may solve your problem of the distributed row
>> matrix since you could write a reducer to gather everything to the right
>> place for writing a row. In doing that, you would inherently have the row
>> number available because that would be the grouping key.
>>
>> If you keep the current one matrix row per csv line, I would recommend
>> putting the source node at the beginning of the line.
>>
>>
>> On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn<sq...@gatech.edu> wrote:
>>
>>
>>
>>> 1) I've made the assumption so far that the input to my clustering
>>> algorithm will be a single CSV file containing the entire affinity
>>> matrix,
>>> where each line in the file is a row in the matrix. Is there another
>>> input
>>> approach that would work better for reading this affinity matrix?
>>>
>>>
>>>
>>>
>>
>>
>
>
Re: M/R capturing line numbers in text files
Posted by Shannon Quinn <sq...@gatech.edu>.
Hi Ted,
Thank you very much - very valuable insight as to a more robust input
format. I've already started implementing it.
I finished the new M/R process to reflect the new assumed input format
(submitted the patch), but I'm getting an exception I can't seem to
diagnose. When I start the program, and the INFO lines start rolling
from the process, right before the M/R task begins I get the following:
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot
be cast to org.apache.hadoop.io.IntWritable
at
org.apache.mahout.clustering.eigencuts.EigencutsInputMapper.map(EigencutsInputMapper.java:22)
The line 22 referred to in the message is:
public class EigencutsInputMapper extends Mapper<IntWritable, Text,
IntWritable, DistributedRowMatrix.MatrixEntryWritable> {
I did a search in all my source files; no mention anywhere (except one
commented-out line) of LongWritable. It was in my previous
implementation, but I performed mvn clean multiple times. Any thoughts
would be appreciated.
Thank you again!
Regards,
Shannon
On 6/15/2010 7:03 PM, Ted Dunning wrote:
> Shannon,
>
> Nice work so far.
>
> I think it is a bit more customary to enter a graph by giving the integer
> pairs that represent the starting and ending nodes for each arc. That
> avoids the memory allocation problem you hit if one node is connected to
> millions of others. It also may solve your problem of the distributed row
> matrix since you could write a reducer to gather everything to the right
> place for writing a row. In doing that, you would inherently have the row
> number available because that would be the grouping key.
>
> If you keep the current one matrix row per csv line, I would recommend
> putting the source node at the beginning of the line.
>
>
> On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn<sq...@gatech.edu> wrote:
>
>
>> 1) I've made the assumption so far that the input to my clustering
>> algorithm will be a single CSV file containing the entire affinity matrix,
>> where each line in the file is a row in the matrix. Is there another input
>> approach that would work better for reading this affinity matrix?
>>
>>
>>
>
Re: M/R capturing line numbers in text files
Posted by Ted Dunning <te...@gmail.com>.
Shannon,
Nice work so far.
I think it is a bit more customary to enter a graph by giving the integer
pairs that represent the starting and ending nodes for each arc. That
avoids the memory allocation problem you hit if one node is connected to
millions of others. It also may solve your problem of the distributed row
matrix since you could write a reducer to gather everything to the right
place for writing a row. In doing that, you would inherently have the row
number available because that would be the grouping key.
If you keep the current one matrix row per csv line, I would recommend
putting the source node at the beginning of the line.
On Tue, Jun 15, 2010 at 3:58 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
> 1) I've made the assumption so far that the input to my clustering
> algorithm will be a single CSV file containing the entire affinity matrix,
> where each line in the file is a row in the matrix. Is there another input
> approach that would work better for reading this affinity matrix?
>
>