You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Peter Leonard <pl...@gmail.com> on 2012/10/25 17:39:31 UTC

Preserving named vectors during rowsimilarity

I'm following the walkthrough at:

https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line

…to do some vector-space similarity comparisons on text documents with Mahout 0.7. Everything's working well, but I'm losing my named vectors at the rowsimilarity stage. I can reconnect the texts to their original names via a post-processing script that reads docIndex, but I wonder if there isn't a missing step or other hiccup in the instructions.  

Obviously I'm importing the files with -nv and have verified that it's working correctly via seqdumper. The labels are also kept through the transformation from vectors to matrix via rowid (inspecting the matrix file confirms this.) They disappear, however, when I execute:

mahout rowsimilarity \
   -i named-matrix/matrix \
   -o named-similarity \
   -r [column number here]
   --similarityClassname SIMILARITY_COSINE
   -m 10
   -ess

… having been replaced with sequential integers. What's curious is that there's a bug in the walkthrough related to this issue: the rowid command outputs "reuters-matrix" but the next step, rowsimilarity, specifies "reuters-named-matrix" as its input.  So I'm wondering if there might have been an interstitial step that involved invoking docIndex somehow and reconnecting the numeric vectors to their labels? I can't find any command-line argument in rowsimilarity that would save the labels and not cause them to be discarded.

Thanks in advance for any pointers!

Re: Preserving named vectors during rowsimilarity

Posted by Pat Ferrel <pa...@gmail.com>.
I wrote that doc and AFAIK you have to use the docIndex. The names are preserved in the matrix file and are duped in the docIndex file so the issue is not with the row id job. But the row similarity job strips the names from the vectors it puts in named-similarity (using your dirs from below). As you note you can re-associate the names based on IDs in named-matrix/docIndex and I agree that this is less than ideal.

I have run into a couple places where the names of named vectors are treated differently and would love to see those places cleaned up (this was corrected recently in SSVD). Actually I'd vote to use something like WeightedPropertyVectorWritable to store things associated with vectors that will get preserved through any job that works on Vector or DistributedRowMatrix. 

On Oct 25, 2012, at 8:39 AM, Peter Leonard <pl...@gmail.com> wrote:

I'm following the walkthrough at:

https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line

…to do some vector-space similarity comparisons on text documents with Mahout 0.7. Everything's working well, but I'm losing my named vectors at the rowsimilarity stage. I can reconnect the texts to their original names via a post-processing script that reads docIndex, but I wonder if there isn't a missing step or other hiccup in the instructions.  

Obviously I'm importing the files with -nv and have verified that it's working correctly via seqdumper. The labels are also kept through the transformation from vectors to matrix via rowid (inspecting the matrix file confirms this.) They disappear, however, when I execute:

mahout rowsimilarity \
  -i named-matrix/matrix \
  -o named-similarity \
  -r [column number here]
  --similarityClassname SIMILARITY_COSINE
  -m 10
  -ess

… having been replaced with sequential integers. What's curious is that there's a bug in the walkthrough related to this issue: the rowid command outputs "reuters-matrix" but the next step, rowsimilarity, specifies "reuters-named-matrix" as its input.  So I'm wondering if there might have been an interstitial step that involved invoking docIndex somehow and reconnecting the numeric vectors to their labels? I can't find any command-line argument in rowsimilarity that would save the labels and not cause them to be discarded.

Thanks in advance for any pointers!