You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Lionel Duboeuf <li...@boozter.com> on 2009/06/15 23:06:20 UTC

index per-user basis and document frequency

Hi,

I use Lucene to index user's documents. I have a potential of 2 or more 
millions users so that i think a per-user index will not be a scalable 
solution. All my searches are filtered with a user UID  field.
As far as i know the default similarity calculate Inverse Document 
Frequency  as follow:
 Math.log(numDocs/(double)(docFreq+1)) + 1.0)
where numDocs stands for the number of documents within the whole 
collection and docFreq for the number of times Term t appear in the 
whole collection.
My problem here is that this formula seems not to be reliable for my 
system because numDocs should correspond to the number of documents in 
the user's collection  and docFreq for the number of times the Term T 
appears in the user's collection.
Because Terms are stored as a single token i was thinking of 
concatenating terms with a UID in order to separate them because :
Term "car" for user1 is different to term "car" for user2. My solution 
would index "carUSERUID1" "carUSERUID2".

What would you suggest ?

Regards,

Lionel

Re: index per-user basis and document frequency

Posted by lionel duboeuf <li...@boozter.com>.

Ted Dunning wrote:
> I don't think that this would be such a great idea.
>
> Better to use a custom
> similarity<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html>data
> structure.  Before you do that, though, you might try just using the
> overall corpus statistics and not worry about this per user indexing with
> specialized statistics.  If users' are no more different from each other
> than sub-corpora in a normal retrieval system then you are liable to get
> much better results using corpus wide stats than with user level stats.
>
> On Mon, Jun 15, 2009 at 2:06 PM, Lionel Duboeuf
> <li...@boozter.com>wrote:
>   
ok, enven if i modify similarity measure, i will face polysemy problem.
e.g. the term "car" in english is different to the term "car" in french.
Also what is the best approach to calculate easily (and fastly) numDocs 
for a given user ?

thanks for your answer.

lionel

Re: index per-user basis and document frequency

Posted by Ted Dunning <te...@gmail.com>.

I don't think that this would be such a great idea.

Better to use a custom
similarity<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html>data
structure.  Before you do that, though, you might try just using the
overall corpus statistics and not worry about this per user indexing with
specialized statistics.  If users' are no more different from each other
than sub-corpora in a normal retrieval system then you are liable to get
much better results using corpus wide stats than with user level stats.

On Mon, Jun 15, 2009 at 2:06 PM, Lionel Duboeuf
<li...@boozter.com>wrote:

> Hi,
>
> I use Lucene to index user's documents. I have a potential of 2 or more
> millions users so that i think a per-user index will not be a scalable
> solution. All my searches are filtered with a user UID  field.
> As far as i know the default similarity calculate Inverse Document
> Frequency  as follow:
> Math.log(numDocs/(double)(docFreq+1)) + 1.0)
> where numDocs stands for the number of documents within the whole
> collection and docFreq for the number of times Term t appear in the whole
> collection.
> My problem here is that this formula seems not to be reliable for my system
> because numDocs should correspond to the number of documents in the user's
> collection  and docFreq for the number of times the Term T appears in the
> user's collection.
> Because Terms are stored as a single token i was thinking of concatenating
> terms with a UID in order to separate them because :
> Term "car" for user1 is different to term "car" for user2. My solution
> would index "carUSERUID1" "carUSERUID2".
>
> What would you suggest ?
>
> Regards,
>
> Lionel
>
>

-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)