You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Kris Jack <mr...@gmail.com> on 2010/07/02 18:22:21 UTC
Re: Generating a Document Similarity Matrix
Hi Sebastian,
I am currently using your code with NamedVectors in my input. In the
output, however, the names seem to be missing. Would there be a way to
include them?
Thanks,
Kris
2010/6/29 Sebastian Schelter <ss...@googlemail.com>
> Hi Kris,
>
> I'm glad I could help you and it's really cool that you are testing my
> patches on real data. I'm looking forward to hearing more!
>
> -sebastian
>
> Am 29.06.2010 11:25, schrieb Kris Jack:
> > Hi Sebastian,
> >
> > You really are very kind! I have taken your code and run it to print out
> > the contents of the output file. There are indeed only 37,952 results so
> > that gives me more confidence in the vector dumper. I'm not sure why
> there
> > was a memory problem though, seeing as it seems to have output the
> results
> > correctly. Now I just have to match them up with my original lucene ids
> and
> > see how it is performing. I'll keep you posted with the results.
> >
> > Thanks,
> > Kris
> >
> >
> >
> > 2010/6/28 Sebastian Schelter <ss...@googlemail.com>
> >
> >
> >> Hi Kris,
> >>
> >> Unfortunately I'm not familiar with the VectorDumper code (and a quick
> >> look didn't help either), so I can't help you with the OutOfMemoryError.
> >>
> >> It could be possible that only 37,952 results are found for an input of
> >> 500,000 vectors, it really depends on the actual data. If you're sure
> >> that there should be more results, you could provide me with a sample
> >> input file and I'll try to find out why there aren't more results.
> >>
> >> I wrote a small class for you that dumps the output file of the job to
> >> the console, (I tested it with the output of my unit-tests), maybe that
> >> can help us find the source of the problem.
> >>
> >> -sebastian
> >>
> >> public class MatrixReader extends AbstractJob {
> >>
> >> public static void main(String[] args) throws Exception {
> >> ToolRunner.run(new MatrixReader(), args);
> >> }
> >>
> >> @Override
> >> public int run(String[] args) throws Exception {
> >>
> >> addInputOption();
> >>
> >> Map<String,String> parsedArgs = parseArguments(args);
> >> if (parsedArgs == null) {
> >> return -1;
> >> }
> >>
> >> Configuration conf = getConf();
> >> FileSystem fs = FileSystem.get(conf);
> >>
> >> Path vectorFile = fs.listStatus(getInputPath(),
> >> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
> >>
> >> SequenceFile.Reader reader = null;
> >> try {
> >> reader = new SequenceFile.Reader(fs, vectorFile, conf);
> >> IntWritable key = new IntWritable();
> >> VectorWritable value = new VectorWritable();
> >>
> >> while (reader.next(key, value)) {
> >> int row = key.get();
> >> System.out.print(String.valueOf(key.get()) + ": ");
> >> Iterator<Element> elementsIterator =
> value.get().iterateNonZero();
> >> String separator = "";
> >> while (elementsIterator.hasNext()) {
> >> Element element = elementsIterator.next();
> >> System.out.print(separator + String.valueOf(element.index()) +
> >> "," + String.valueOf(element.get()));
> >> separator = ";";
> >> }
> >> System.out.print("\n");
> >> }
> >> } finally {
> >> reader.close();
> >> }
> >> return 0;
> >> }
> >> }
> >>
> >> Am 28.06.2010 17:18, schrieb Kris Jack:
> >>
> >>> Hi,
> >>>
> >>> I am now using the version of
> >>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that
> Sebastian
> >>>
> >> has
> >>
> >>> written and has been added to the trunk. Thanks again for that! I can
> >>> generate an output file that should contain a list of documents with
> >>>
> >> their
> >>
> >>> top 100* *most similar documents. I am having problems, however, in
> >>> converting the output file into a readable format using mahout's
> >>>
> >> vectordump:
> >>
> >>> $ ./mahout vectordump --seqFile similarRows --output results.out
> >>>
> >> --printKey
> >>
> >>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
> >>> Input Path: /home/kris/similarRows
> >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> >>> at
> >>>
> >>>
> >>
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
> >>
> >>> at
> >>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> >>> at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
> >>
> >>> at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
> >>
> >>> at
> >>>
> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
> >>
> >>> at
> >>>
> >>>
> >>
> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
> >>
> >>> at
> >>>
> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>> at
> >>>
> >>>
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>
> >>> at
> >>>
> >>>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>
> >>> at java.lang.reflect.Method.invoke(Method.java:597)
> >>> at
> >>>
> >>>
> >>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>
> >>> at
> >>>
> >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>
> >>> at
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
> >>>
> >>> What is this doing that takes up so much memory? A file is produced
> with
> >>> 37,952 readable rows but I'm expecting more like 500,000 results, since
> I
> >>> have this number of documents. Should I be using something else to
> read
> >>>
> >> the
> >>
> >>> output file of the RowSimilarityJob?
> >>>
> >>> Thanks,
> >>> Kris
> >>>
> >>>
> >>>
> >>> 2010/6/18 Sebastian Schelter <ss...@googlemail.com>
> >>>
> >>>
> >>>
> >>>> Hi Kris,
> >>>>
> >>>> maybe you want to give the patch from
> >>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not
> yet
> >>>> tested it with larger data yet, but I would be happy to get some
> >>>> feedback for it and maybe it helps you with your usecase.
> >>>>
> >>>> -sebastian
> >>>>
> >>>> Am 18.06.2010 18:46, schrieb Kris Jack:
> >>>>
> >>>>
> >>>>> Thanks Ted,
> >>>>>
> >>>>> I got that working. Unfortunately, the matrix multiplication job is
> >>>>>
> >>>>>
> >>>> taking
> >>>>
> >>>>
> >>>>> far longer than I hoped. With just over 10 million documents, 10
> >>>>>
> >> mappers
> >>
> >>>>> and 10 reducers, I can't get it to complete the job in under 48
> hours.
> >>>>>
> >>>>> Perhaps you have an idea for speeding it up? I have already been
> quite
> >>>>> ruthless with making the vectors sparse. I did not include terms
> that
> >>>>> appeared in over 1% of the corpus and only kept terms that appeared
> at
> >>>>>
> >>>>>
> >>>> least
> >>>>
> >>>>
> >>>>> 50 times. Is it normal that the matrix multiplication map reduce
> task
> >>>>> should take so long to process with this quantity of data and
> resources
> >>>>> available or do you think that my system is not configured properly?
> >>>>>
> >>>>> Thanks,
> >>>>> Kris
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2010/6/15 Ted Dunning <te...@gmail.com>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Threshold are generally dangerous. It is usually preferable to
> >>>>>>
> >> specify
> >>
> >>>>>>
> >>>> the
> >>>>
> >>>>
> >>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
> >>>>>>
> >> descending
> >>
> >>>>>> score order using Hadoop's builtin capabilities and just drop the
> >>>>>>
> >> rest.
> >>
> >>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mr...@gmail.com>
> >>>>>>
> >>>>>>
> >>>> wrote:
> >>>>
> >>>>
> >>>>>>
> >>>>>>
> >>>>>>> I was wondering if there was an
> >>>>>>> interesting way to do this with the current mahout code such as
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> requesting
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> that the Vector accumulator returns only elements that have values
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> greater
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> than a given threshold, sorting the vector by value rather than
> key,
> >>>>>>>
> >> or
> >>
> >>>>>>> something else?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>
--
Dr Kris Jack,
http://www.mendeley.com/profiles/kris-jack/
Re: Generating a Document Similarity Matrix
Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Kris,
I think the best way would be to manually join the names to the result
after executing the job.
--sebastian
Am 02.07.2010 18:22, schrieb Kris Jack:
> Hi Sebastian,
>
> I am currently using your code with NamedVectors in my input. In the
> output, however, the names seem to be missing. Would there be a way to
> include them?
>
> Thanks,
> Kris
>
>
>
> 2010/6/29 Sebastian Schelter <ss...@googlemail.com>
>
>
>> Hi Kris,
>>
>> I'm glad I could help you and it's really cool that you are testing my
>> patches on real data. I'm looking forward to hearing more!
>>
>> -sebastian
>>
>> Am 29.06.2010 11:25, schrieb Kris Jack:
>>
>>> Hi Sebastian,
>>>
>>> You really are very kind! I have taken your code and run it to print out
>>> the contents of the output file. There are indeed only 37,952 results so
>>> that gives me more confidence in the vector dumper. I'm not sure why
>>>
>> there
>>
>>> was a memory problem though, seeing as it seems to have output the
>>>
>> results
>>
>>> correctly. Now I just have to match them up with my original lucene ids
>>>
>> and
>>
>>> see how it is performing. I'll keep you posted with the results.
>>>
>>> Thanks,
>>> Kris
>>>
>>>
>>>
>>> 2010/6/28 Sebastian Schelter <ss...@googlemail.com>
>>>
>>>
>>>
>>>> Hi Kris,
>>>>
>>>> Unfortunately I'm not familiar with the VectorDumper code (and a quick
>>>> look didn't help either), so I can't help you with the OutOfMemoryError.
>>>>
>>>> It could be possible that only 37,952 results are found for an input of
>>>> 500,000 vectors, it really depends on the actual data. If you're sure
>>>> that there should be more results, you could provide me with a sample
>>>> input file and I'll try to find out why there aren't more results.
>>>>
>>>> I wrote a small class for you that dumps the output file of the job to
>>>> the console, (I tested it with the output of my unit-tests), maybe that
>>>> can help us find the source of the problem.
>>>>
>>>> -sebastian
>>>>
>>>> public class MatrixReader extends AbstractJob {
>>>>
>>>> public static void main(String[] args) throws Exception {
>>>> ToolRunner.run(new MatrixReader(), args);
>>>> }
>>>>
>>>> @Override
>>>> public int run(String[] args) throws Exception {
>>>>
>>>> addInputOption();
>>>>
>>>> Map<String,String> parsedArgs = parseArguments(args);
>>>> if (parsedArgs == null) {
>>>> return -1;
>>>> }
>>>>
>>>> Configuration conf = getConf();
>>>> FileSystem fs = FileSystem.get(conf);
>>>>
>>>> Path vectorFile = fs.listStatus(getInputPath(),
>>>> TasteHadoopUtils.PARTS_FILTER)[0].getPath();
>>>>
>>>> SequenceFile.Reader reader = null;
>>>> try {
>>>> reader = new SequenceFile.Reader(fs, vectorFile, conf);
>>>> IntWritable key = new IntWritable();
>>>> VectorWritable value = new VectorWritable();
>>>>
>>>> while (reader.next(key, value)) {
>>>> int row = key.get();
>>>> System.out.print(String.valueOf(key.get()) + ": ");
>>>> Iterator<Element> elementsIterator =
>>>>
>> value.get().iterateNonZero();
>>
>>>> String separator = "";
>>>> while (elementsIterator.hasNext()) {
>>>> Element element = elementsIterator.next();
>>>> System.out.print(separator + String.valueOf(element.index()) +
>>>> "," + String.valueOf(element.get()));
>>>> separator = ";";
>>>> }
>>>> System.out.print("\n");
>>>> }
>>>> } finally {
>>>> reader.close();
>>>> }
>>>> return 0;
>>>> }
>>>> }
>>>>
>>>> Am 28.06.2010 17:18, schrieb Kris Jack:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am now using the version of
>>>>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that
>>>>>
>> Sebastian
>>
>>>>>
>>>> has
>>>>
>>>>
>>>>> written and has been added to the trunk. Thanks again for that! I can
>>>>> generate an output file that should contain a list of documents with
>>>>>
>>>>>
>>>> their
>>>>
>>>>
>>>>> top 100* *most similar documents. I am having problems, however, in
>>>>> converting the output file into a readable format using mahout's
>>>>>
>>>>>
>>>> vectordump:
>>>>
>>>>
>>>>> $ ./mahout vectordump --seqFile similarRows --output results.out
>>>>>
>>>>>
>>>> --printKey
>>>>
>>>>
>>>>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally
>>>>> Input Path: /home/kris/similarRows
>>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59)
>>
>>>>
>>>>> at
>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>> at
>>>>>
>>>>>
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930)
>>>>
>>>>
>>>>> at
>>>>>
>>>>>
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830)
>>>>
>>>>
>>>>> at
>>>>>
>>>>>
>>>> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>>>>
>>>>
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>> org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77)
>>
>>>>
>>>>> at
>>>>>
>>>>>
>> org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138)
>>
>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>
>>>>
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>
>>>>
>>>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>
>>>>
>>>>> at
>>>>>
>>>>>
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>
>>>>
>>>>> at
>>>>>
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174)
>>
>>>>> What is this doing that takes up so much memory? A file is produced
>>>>>
>> with
>>
>>>>> 37,952 readable rows but I'm expecting more like 500,000 results, since
>>>>>
>> I
>>
>>>>> have this number of documents. Should I be using something else to
>>>>>
>> read
>>
>>>>>
>>>> the
>>>>
>>>>
>>>>> output file of the RowSimilarityJob?
>>>>>
>>>>> Thanks,
>>>>> Kris
>>>>>
>>>>>
>>>>>
>>>>> 2010/6/18 Sebastian Schelter <ss...@googlemail.com>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Kris,
>>>>>>
>>>>>> maybe you want to give the patch from
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not
>>>>>>
>> yet
>>
>>>>>> tested it with larger data yet, but I would be happy to get some
>>>>>> feedback for it and maybe it helps you with your usecase.
>>>>>>
>>>>>> -sebastian
>>>>>>
>>>>>> Am 18.06.2010 18:46, schrieb Kris Jack:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Thanks Ted,
>>>>>>>
>>>>>>> I got that working. Unfortunately, the matrix multiplication job is
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> taking
>>>>>>
>>>>>>
>>>>>>
>>>>>>> far longer than I hoped. With just over 10 million documents, 10
>>>>>>>
>>>>>>>
>>>> mappers
>>>>
>>>>
>>>>>>> and 10 reducers, I can't get it to complete the job in under 48
>>>>>>>
>> hours.
>>
>>>>>>> Perhaps you have an idea for speeding it up? I have already been
>>>>>>>
>> quite
>>
>>>>>>> ruthless with making the vectors sparse. I did not include terms
>>>>>>>
>> that
>>
>>>>>>> appeared in over 1% of the corpus and only kept terms that appeared
>>>>>>>
>> at
>>
>>>>>>>
>>>>>>>
>>>>>> least
>>>>>>
>>>>>>
>>>>>>
>>>>>>> 50 times. Is it normal that the matrix multiplication map reduce
>>>>>>>
>> task
>>
>>>>>>> should take so long to process with this quantity of data and
>>>>>>>
>> resources
>>
>>>>>>> available or do you think that my system is not configured properly?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2010/6/15 Ted Dunning <te...@gmail.com>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Threshold are generally dangerous. It is usually preferable to
>>>>>>>>
>>>>>>>>
>>>> specify
>>>>
>>>>
>>>>>>>>
>>>>>> the
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> sparseness you want (1%, 0.2%, whatever), sort the results in
>>>>>>>>
>>>>>>>>
>>>> descending
>>>>
>>>>
>>>>>>>> score order using Hadoop's builtin capabilities and just drop the
>>>>>>>>
>>>>>>>>
>>>> rest.
>>>>
>>>>
>>>>>>>> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <mr...@gmail.com>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> I was wondering if there was an
>>>>>>>>> interesting way to do this with the current mahout code such as
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> requesting
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> that the Vector accumulator returns only elements that have values
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> greater
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> than a given threshold, sorting the vector by value rather than
>>>>>>>>>
>> key,
>>
>>>>>>>>>
>>>> or
>>>>
>>>>
>>>>>>>>> something else?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>