You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by nikos <ni...@gmail.com> on 2012/08/18 16:25:44 UTC

lucene.vector driver crates wrong results

I am using mahout to cluster a large collection of documents.The most of 
them have four fields:id,title,description and tags.Some of them may not 
have one or two of the fields title,description or tags but they all 
have id.I create a Lucene Index for the documents and then I use 
lucene.vector driver to create the vectors, but when I tested a sample 
of the documents I noticed that lucene.driver products wrong 
vectors.Here is the situation:

--------------------------------------------------------Sample 
documents------------------------------------
*_Document 1:_
*/id/: *0-9Vse287Mc*
/title/: DIESEL RADZI: CHRON' OCZY!
/description/:
/tags/: Obrazki, Reklama, Diesel, Diesel Protect Your Eyes, Diesel 
Sister Yes
*_Document 2:_
*/id/: *0B8~9yRJLQY*
/title/:
/description/ :Ex boyfriends whom I never want to see again.
/tags: /
*_Document 3:_
*/id: /*0-GHzVR2aRI*
/title: /10 Surprising Health Benefits of Sex
/description: When you re in the mood, it s a sure bet that the last 
thing on your mind is boosting your immune system or maintaining a 
healthy weight. Yet good sex offers those health benefits and more/
/tags: /insomnia
*_Document 4:_
*/id: /*0BO18uXlatI*
/title:
description:
tags: /only, tags
--------------------------------------------------------------------------------------------------------------------

So i run lucene vector with --idField id, for each one of the fields and 
-err 0.9.

For each running I get the correct messages:
*title:
*12/08/17 12:34:00 WARN lucene.LuceneIterator: *0B8~9yRJLQY* does not 
have a term vector for title
12/08/17 12:34:00 WARN lucene.LuceneIterator: *0BO18uXlatI* does not 
have a term vector for title
12/08/17 12:34:00 INFO lucene.Driver: Wrote: 2 vectors
*description:
*12/08/17 12:35:12 WARN lucene.LuceneIterator: *0BO18uXlatI* does not 
have a term vector for description
12/08/17 12:35:12 WARN lucene.LuceneIterator: *0-9Vse287Mc* does not 
have a term vector for description
12/08/17 12:35:12 INFO lucene.Driver: Wrote: 2 vectors*
tags*:
12/08/17 14:42:45 WARN lucene.LuceneIterator: *0B8~9yRJLQY* does not 
have a term vector for tags
12/08/17 14:42:45 INFO lucene.Driver: Wrote: 3 vectors

But when I try to read the sequence files tha have been produced, with 
SequenceFile.Reader, I get those results:

for *title* I get:
0: 
*0B8~9yRJLQY*:{6:0.4472136002199899,5:0.6324555220209759,2:0.4472136002199899,0:0.4472136002199899}
1: 
*0BO18uXlat*I:{4:0.5773502691896258,3:0.5773502691896258,1:0.5773502691896258}

for *description *I get:
0: *0B8~9yRJLQY*:{3:1.0}
1: 
*0-GHzVR2aRI*:{14:0.2672612419124244,13:0.2672612419124244,12:0.2672612419124244,11:0.2672612419124244,10:0.2672612419124244,9:0.2672612419124244,8:0.2672612419124244,7:0.2672612419124244}

for *tags *I get:
0: *0B8~9yRJLQY*:{5:1.0}
1: *0BO18uXlatI*:{14:0.7071067811865475,9:0.7071067811865475}
2: 
*0-9Vse287Mc*:{13:0.2886751345948129,12:0.2886751345948129,11:0.2886751345948129,10:0.2886751345948129,8:0.2886751345948129,7:0.2886751345948129,6:0.2886751345948129}

As you can see, altough document with /id/ *0B8~9yRJLQY *has value only 
for the field /description/, Reader says that it has vectors for the 
other fields too. And these vectors should be shown for document with 
/id/ *0-GHzVR2aRI* that has all the fields, but they do not.

Why does this confusion happen and how can I correct this?

PS: Sorry if this is double posted but I'm not sure if the first email 
was successfully been sent.
PS2: Please tell me if this is a lucene.vector driver bug.