You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Kris Jack <mr...@gmail.com> on 2010/07/02 18:22:21 UTC

Re: Generating a Document Similarity Matrix

Hi Sebastian,

I am currently using your code with NamedVectors in my input.  In the
output, however, the names seem to be missing.  Would there be a way to
include them?

Thanks,
Kris



2010/6/29 Sebastian Schelter <ss...@googlemail.com>

> Hi Kris,
>
> I'm glad I could help you and it's really cool that you are testing my
> patches on real data. I'm looking forward to hearing more!
>
> -sebastian
>
> Am 29.06.2010 11:25, schrieb Kris Jack:
> > Hi Sebastian,
> >
> > You really are very kind!  I have taken your code and run it to print out
> > the contents of the output file.  There are indeed only 37,952 results so
> > that gives me more confidence in the vector dumper.  I'm not sure why
> there
> > was a memory problem though, seeing as it seems to have output the
> results
> > correctly.  Now I just have to match them up with my original lucene ids
> and
> > see how it is performing.  I'll keep you posted with the results.
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/28 Sebastian Schelter <ss...@googlemail.com>
> >
> >
> >> Hi Kris,
> >>
> >> Unfortunately I'm not familiar with the VectorDumper code (and a quick
> >> look didn't help either), so I can't help you with the OutOfMemoryError.
> >>
> >> It could be possible that only 37,952 results are found for an input of
> >> 500,000 vectors, it really depends on the actual data. If you're sure
> >> that there should be more results, you could provide me with a sample
> >> input file and I'll try to find out why there aren't more results.
> >>
> >> I wrote a small class for you that dumps the output file of the job to
> >> the console, (I tested it with the output of my unit-tests), maybe that
> >> can help us find the source of the problem.
> >>
> >> -sebastian
> >>
> >> public class MatrixReader extends AbstractJob {
> >>
> >>  public static void main(String[] args) throws Exception {
> >>    ToolRunner.run(new MatrixReader(), args);
> >>  }
> >>
> >>  @Override
> >>  public int run(String[] args) throws Exception {
> >>
> >>    addInputOption();
> >>
> >>    Map<String,String> parsedArgs = parseArguments(args);
> >>    if (parsedArgs == null) {
> >>      return -1;
> >>    }
> >>
> >>    Configuration conf = getConf();
> >>    FileSystem fs = FileSystem.get(conf);
> >>
> >>    Path vectorFile = fs.listStatus(getInputPath(),
> >> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
> >>
> >>    SequenceFile.Reader reader = null;
> >>    try {
> >>      reader = new SequenceFile.Reader(fs, vectorFile, conf);
> >>      IntWritable key = new IntWritable();
> >>      VectorWritable value = new VectorWritable();
> >>
> >>      while (reader.next(key, value)) {
> >>        int row = key.get();
> >>        System.out.print(String.valueOf(key.get()) +  ": ");
> >>        Iterator<Element> elementsIterator =
> value.get().iterateNonZero();
> >>        String separator = "";
> >>        while (elementsIterator.hasNext()) {
> >>          Element element = elementsIterator.next();
> >>          System.out.print(separator + String.valueOf(element.index()) +
> >> "," + String.valueOf(element.get()));
> >>          separator = ";";
> >>        }
> >>        System.out.print("\n");
> >>      }
> >>    } finally {
> >>      reader.close();
> >>    }
> >>    return 0;
> >>  }
> >> }
> >>
> >> Am 28.06.2010 17:18, schrieb Kris Jack:
> >>
> >>> Hi,
> >>>
> >>> I am now using the version of
> >>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that
> Sebastian
> >>>
> >> has
> >>
> >>> written and has been added to the trunk.  Thanks again for that!  I can
> >>> generate an output file that should contain a list of documents with
> >>>
> >> their
> >>
> >>> top 100* *most similar documents.  I am having problems, however, in
> >>> converting the output file into a readable format using mahout's
> >>>
> >> vectordump:
> >>
> >>> $ ./mahout vectordump --seqFile similarRows --output results.out
> >>>
> >> --printKey
> >>
> >>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
> >>> Input Path: /home/kris/similarRows
> >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >>>     at
> >>>
> >>>
> >>
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
> >>
> >>>     at
> >>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> >>>     at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
> >>
> >>>     at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
> >>
> >>>     at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >>
> >>>     at
> >>>
> >>>
> >>
> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
> >>
> >>>     at
> >>>
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
> >>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>     at
> >>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>
> >>>     at
> >>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>
> >>>     at java.lang.reflect.Method.invoke(Method.java:597)
> >>>     at
> >>>
> >>>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>
> >>>     at
> >>>
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>
> >>>     at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
> >>>
> >>> What is this doing that takes up so much memory?  A file is produced
> with
> >>> 37,952 readable rows but I'm expecting more like 500,000 results, since
> I
> >>> have this number of documents.  Should I be using something else to
> read
> >>>
> >> the
> >>
> >>> output file of the RowSimilarityJob?
> >>>
> >>> Thanks,
> >>> Kris
> >>>
> >>>
> >>>
> >>> 2010/6/18 Sebastian Schelter <ss...@googlemail.com>
> >>>
> >>>
> >>>
> >>>> Hi Kris,
> >>>>
> >>>> maybe you want to give the patch from
> >>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not
> yet
> >>>> tested it with larger data yet, but I would be happy to get some
> >>>> feedback for it and maybe it helps you with your usecase.
> >>>>
> >>>> -sebastian
> >>>>
> >>>> Am 18.06.2010 18:46, schrieb Kris Jack:
> >>>>
> >>>>
> >>>>> Thanks Ted,
> >>>>>
> >>>>> I got that working.  Unfortunately, the matrix multiplication job is
> >>>>>
> >>>>>
> >>>> taking
> >>>>
> >>>>
> >>>>> far longer than I hoped.  With just over 10 million documents, 10
> >>>>>
> >> mappers
> >>
> >>>>> and 10 reducers, I can't get it to complete the job in under 48
> hours.
> >>>>>
> >>>>> Perhaps you have an idea for speeding it up?  I have already been
> quite
> >>>>> ruthless with making the vectors sparse.  I did not include terms
> that
> >>>>> appeared in over 1% of the corpus and only kept terms that appeared
> at
> >>>>>
> >>>>>
> >>>> least
> >>>>
> >>>>
> >>>>> 50 times.  Is it normal that the matrix multiplication map reduce
> task
> >>>>> should take so long to process with this quantity of data and
> resources
> >>>>> available or do you think that my system is not configured properly?
> >>>>>
> >>>>> Thanks,
> >>>>> Kris
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2010/6/15 Ted Dunning <te...@gmail.com>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Threshold are generally dangerous.  It is usually preferable to
> >>>>>>
> >> specify
> >>
> >>>>>>
> >>>> the
> >>>>
> >>>>
> >>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
> >>>>>>
> >> descending
> >>
> >>>>>> score order using Hadoop's builtin capabilities and just drop the
> >>>>>>
> >> rest.
> >>
> >>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mr...@gmail.com>
> >>>>>>
> >>>>>>
> >>>> wrote:
> >>>>
> >>>>
> >>>>>>
> >>>>>>
> >>>>>>>  I was wondering if there was an
> >>>>>>> interesting way to do this with the current mahout code such as
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> requesting
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> that the Vector accumulator returns only elements that have values
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> greater
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> than a given threshold, sorting the vector by value rather than
> key,
> >>>>>>>
> >> or
> >>
> >>>>>>> something else?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/

Re: Generating a Document Similarity Matrix

Posted by Sebastian Schelter <ss...@googlemail.com>.

Hi Kris,

I think the best way would be to manually join the names to the result
after executing the job.

--sebastian

Am 02.07.2010 18:22, schrieb Kris Jack:
> Hi Sebastian,
>
> I am currently using your code with NamedVectors in my input.  In the
> output, however, the names seem to be missing.  Would there be a way to
> include them?
>
> Thanks,
> Kris
>
>
>
> 2010/6/29 Sebastian Schelter <ss...@googlemail.com>
>
>   
>> Hi Kris,
>>
>> I'm glad I could help you and it's really cool that you are testing my
>> patches on real data. I'm looking forward to hearing more!
>>
>> -sebastian
>>
>> Am 29.06.2010 11:25, schrieb Kris Jack:
>>     
>>> Hi Sebastian,
>>>
>>> You really are very kind!  I have taken your code and run it to print out
>>> the contents of the output file.  There are indeed only 37,952 results so
>>> that gives me more confidence in the vector dumper.  I'm not sure why
>>>       
>> there
>>     
>>> was a memory problem though, seeing as it seems to have output the
>>>       
>> results
>>     
>>> correctly.  Now I just have to match them up with my original lucene ids
>>>       
>> and
>>     
>>> see how it is performing.  I'll keep you posted with the results.
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> 2010/6/28 Sebastian Schelter <ss...@googlemail.com>
>>>
>>>
>>>       
>>>> Hi Kris,
>>>>
>>>> Unfortunately I'm not familiar with the VectorDumper code (and a quick
>>>> look didn't help either), so I can't help you with the OutOfMemoryError.
>>>>
>>>> It could be possible that only 37,952 results are found for an input of
>>>> 500,000 vectors, it really depends on the actual data. If you're sure
>>>> that there should be more results, you could provide me with a sample
>>>> input file and I'll try to find out why there aren't more results.
>>>>
>>>> I wrote a small class for you that dumps the output file of the job to
>>>> the console, (I tested it with the output of my unit-tests), maybe that
>>>> can help us find the source of the problem.
>>>>
>>>> -sebastian
>>>>
>>>> public class MatrixReader extends AbstractJob {
>>>>
>>>>  public static void main(String[] args) throws Exception {
>>>>    ToolRunner.run(new MatrixReader(), args);
>>>>  }
>>>>
>>>>  @Override
>>>>  public int run(String[] args) throws Exception {
>>>>
>>>>    addInputOption();
>>>>
>>>>    Map<String,String> parsedArgs = parseArguments(args);
>>>>    if (parsedArgs == null) {
>>>>      return -1;
>>>>    }
>>>>
>>>>    Configuration conf = getConf();
>>>>    FileSystem fs = FileSystem.get(conf);
>>>>
>>>>    Path vectorFile = fs.listStatus(getInputPath(),
>>>> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
>>>>
>>>>    SequenceFile.Reader reader = null;
>>>>    try {
>>>>      reader = new SequenceFile.Reader(fs, vectorFile, conf);
>>>>      IntWritable key = new IntWritable();
>>>>      VectorWritable value = new VectorWritable();
>>>>
>>>>      while (reader.next(key, value)) {
>>>>        int row = key.get();
>>>>        System.out.print(String.valueOf(key.get()) +  ": ");
>>>>        Iterator<Element> elementsIterator =
>>>>         
>> value.get().iterateNonZero();
>>     
>>>>        String separator = "";
>>>>        while (elementsIterator.hasNext()) {
>>>>          Element element = elementsIterator.next();
>>>>          System.out.print(separator + String.valueOf(element.index()) +
>>>> "," + String.valueOf(element.get()));
>>>>          separator = ";";
>>>>        }
>>>>        System.out.print("\n");
>>>>      }
>>>>    } finally {
>>>>      reader.close();
>>>>    }
>>>>    return 0;
>>>>  }
>>>> }
>>>>
>>>> Am 28.06.2010 17:18, schrieb Kris Jack:
>>>>
>>>>         
>>>>> Hi,
>>>>>
>>>>> I am now using the version of
>>>>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that
>>>>>           
>> Sebastian
>>     
>>>>>           
>>>> has
>>>>
>>>>         
>>>>> written and has been added to the trunk.  Thanks again for that!  I can
>>>>> generate an output file that should contain a list of documents with
>>>>>
>>>>>           
>>>> their
>>>>
>>>>         
>>>>> top 100* *most similar documents.  I am having problems, however, in
>>>>> converting the output file into a readable format using mahout's
>>>>>
>>>>>           
>>>> vectordump:
>>>>
>>>>         
>>>>> $ ./mahout vectordump --seqFile similarRows --output results.out
>>>>>
>>>>>           
>>>> --printKey
>>>>
>>>>         
>>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>>>> Input Path: /home/kris/similarRows
>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
>>     
>>>>         
>>>>>     at
>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>>>>
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
>>     
>>>>         
>>>>>     at
>>>>>
>>>>>           
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
>>     
>>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     
>>>>         
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     
>>>>         
>>>>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>     at
>>>>>
>>>>>
>>>>>           
>>>>         
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>     
>>>>         
>>>>>     at
>>>>>
>>>>>           
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>
>>>>         
>>>>>     at
>>>>>           
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
>>     
>>>>> What is this doing that takes up so much memory?  A file is produced
>>>>>           
>> with
>>     
>>>>> 37,952 readable rows but I'm expecting more like 500,000 results, since
>>>>>           
>> I
>>     
>>>>> have this number of documents.  Should I be using something else to
>>>>>           
>> read
>>     
>>>>>           
>>>> the
>>>>
>>>>         
>>>>> output file of the RowSimilarityJob?
>>>>>
>>>>> Thanks,
>>>>> Kris
>>>>>
>>>>>
>>>>>
>>>>> 2010/6/18 Sebastian Schelter <ss...@googlemail.com>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>>> Hi Kris,
>>>>>>
>>>>>> maybe you want to give the patch from
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not
>>>>>>             
>> yet
>>     
>>>>>> tested it with larger data yet, but I would be happy to get some
>>>>>> feedback for it and maybe it helps you with your usecase.
>>>>>>
>>>>>> -sebastian
>>>>>>
>>>>>> Am 18.06.2010 18:46, schrieb Kris Jack:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Thanks Ted,
>>>>>>>
>>>>>>> I got that working.  Unfortunately, the matrix multiplication job is
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>> taking
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> far longer than I hoped.  With just over 10 million documents, 10
>>>>>>>
>>>>>>>               
>>>> mappers
>>>>
>>>>         
>>>>>>> and 10 reducers, I can't get it to complete the job in under 48
>>>>>>>               
>> hours.
>>     
>>>>>>> Perhaps you have an idea for speeding it up?  I have already been
>>>>>>>               
>> quite
>>     
>>>>>>> ruthless with making the vectors sparse.  I did not include terms
>>>>>>>               
>> that
>>     
>>>>>>> appeared in over 1% of the corpus and only kept terms that appeared
>>>>>>>               
>> at
>>     
>>>>>>>
>>>>>>>               
>>>>>> least
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> 50 times.  Is it normal that the matrix multiplication map reduce
>>>>>>>               
>> task
>>     
>>>>>>> should take so long to process with this quantity of data and
>>>>>>>               
>> resources
>>     
>>>>>>> available or do you think that my system is not configured properly?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2010/6/15 Ted Dunning <te...@gmail.com>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Threshold are generally dangerous.  It is usually preferable to
>>>>>>>>
>>>>>>>>                 
>>>> specify
>>>>
>>>>         
>>>>>>>>                 
>>>>>> the
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
>>>>>>>>
>>>>>>>>                 
>>>> descending
>>>>
>>>>         
>>>>>>>> score order using Hadoop's builtin capabilities and just drop the
>>>>>>>>
>>>>>>>>                 
>>>> rest.
>>>>
>>>>         
>>>>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mr...@gmail.com>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>>>
>>>>>>>>                 
>>>>>>>>>  I was wondering if there was an
>>>>>>>>> interesting way to do this with the current mahout code such as
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> requesting
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> that the Vector accumulator returns only elements that have values
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>> greater
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>>> than a given threshold, sorting the vector by value rather than
>>>>>>>>>                   
>> key,
>>     
>>>>>>>>>                   
>>>> or
>>>>
>>>>         
>>>>>>>>> something else?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                   
>>>>>>>>                 
>>>>>>>               
>>>>>>
>>>>>>             
>>>>>
>>>>>           
>>>>
>>>>         
>>>
>>>       
>>
>>     
>
>