You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Shai Erera <se...@gmail.com> on 2013/06/25 14:54:49 UTC

Allow TfIdfSimilarity to encode norms in more than a byte

Hi

Today TfIdfSimilarity forces the encoding of norms into a single byte, and
there's no way to override it. E.g. if I don't want to lose precision, the
only thing I can do is write a different Similarity while copying most of
the code from TfIdfSimilarity.

I looked at the code and I was wondering if we cannot move this byte
encoding to DefaultSimilarity, while making encodeNorm and decodeNorm
take/return long values, as eventually Sim.computeNorm returns a long value.

That way, DefaultSim will document the byte-encoding it uses (move the
respective jdocs from TfIdfSim to Default), while allowing to implement
TfIdfSim without losing that much precision, as well as not copying the
entire class.

While we're at it, I read LUCENE-1907 about queryNorm and I think this is
important too. So whether we introduce a queryNorm or make computeWeight
non-final, to allow returning other SimWeight impls, I think we should
allow that extension.

I was about to open a JIRA issue to handle that until I noticed the
explicit documentation in TfIdfSim (re the byte encoding), so thought I'd
ask first if anybody objects.

Back-compat wise, we will need to break it (by changing encode/decodeNorm
signatures), but I think that's ok in this case. Overriding TfIdfSim is
expert enough IMO.

Shai

Re: Allow TfIdfSimilarity to encode norms in more than a byte

Posted by Shai Erera <se...@gmail.com>.
Wait, lengthNorm is already abstract, left to whoever implements
TfIdfSimilarity to return a float. So we're already letting people mess
with length norm.

All I want to remove is the enforcement of a single-byte encoding, to let
you encode a full float, int or even long, as it's a NumericDV under the
covers. You then decode it back to a float. So lengthNorm will still be
float, I don't propose to change it. Only how this float is encoded.

Perhaps a patch would clarify it?

Shai


On Tue, Jun 25, 2013 at 4:34 PM, Robert Muir <rc...@gmail.com> wrote:

>
>
> On Tue, Jun 25, 2013 at 9:20 AM, Shai Erera <se...@gmail.com> wrote:
>
>>
>>
>> Right now, I need to copy most of the "tf-idf" code into my Sim, and I
>> don't think that's good software engineering. How many people really extend
>> Tf-Idf that the API can get complicated?
>>
>
> And we have to do the same thing if we want to modify MmapDirectory to
> call mappedbytebuffer.load, but everyone is ok with it there!
>
> I just mentioned my concern: if we can come up with a patch that isn't
> trappy, I'll be happy. but past experience shows fucking around with
> lengthnorm has caused confusion, non-obvious back compat breaks, bugs,
> etc... hence my concerns.
>

Re: Allow TfIdfSimilarity to encode norms in more than a byte

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Jun 25, 2013 at 9:20 AM, Shai Erera <se...@gmail.com> wrote:

>
>
> Right now, I need to copy most of the "tf-idf" code into my Sim, and I
> don't think that's good software engineering. How many people really extend
> Tf-Idf that the API can get complicated?
>

And we have to do the same thing if we want to modify MmapDirectory to call
mappedbytebuffer.load, but everyone is ok with it there!

I just mentioned my concern: if we can come up with a patch that isn't
trappy, I'll be happy. but past experience shows fucking around with
lengthnorm has caused confusion, non-obvious back compat breaks, bugs,
etc... hence my concerns.

Re: Allow TfIdfSimilarity to encode norms in more than a byte

Posted by Shai Erera <se...@gmail.com>.
The problem as I see it is that if you want to implement a "true" Tf-Idf
similarity, e.g. as specified in the books, you have no way to do that by
extending TfIdfSimilarity which is odd.

I think that we can make TfIdfSim implement the core parts of Tf-Idf,
letting extensions worry about the details. And whoever wants to change a
single marginal thing, should extend DefaultSim. If he wants to gain finer
control e.g. over norms, queryNorm etc., he can extend TfIdfSim. And if he
wants to implement something else completely, well, he can extend
Similarity.

Right now, I need to copy most of the "tf-idf" code into my Sim, and I
don't think that's good software engineering. How many people really extend
Tf-Idf that the API can get complicated?

Shai


On Tue, Jun 25, 2013 at 4:11 PM, Robert Muir <rc...@gmail.com> wrote:

>
>
> On Tue, Jun 25, 2013 at 8:54 AM, Shai Erera <se...@gmail.com> wrote:
>
>> Hi
>>
>> Today TfIdfSimilarity forces the encoding of norms into a single byte,
>> and there's no way to override it. E.g. if I don't want to lose precision,
>> the only thing I can do is write a different Similarity while copying most
>> of the code from TfIdfSimilarity.
>>
>
> But as you said, its expert enough :)
>
> I'm a little worried about how complex this would make the API. Today
> TFIDFSimilarity hides all of this stuff and only provides a simple API with
> tf(), idf(), etc for tuning. I think thats really how they all should
> work...
>

Re: Allow TfIdfSimilarity to encode norms in more than a byte

Posted by Robert Muir <rc...@gmail.com>.
On Tue, Jun 25, 2013 at 8:54 AM, Shai Erera <se...@gmail.com> wrote:

> Hi
>
> Today TfIdfSimilarity forces the encoding of norms into a single byte, and
> there's no way to override it. E.g. if I don't want to lose precision, the
> only thing I can do is write a different Similarity while copying most of
> the code from TfIdfSimilarity.
>

But as you said, its expert enough :)

I'm a little worried about how complex this would make the API. Today
TFIDFSimilarity hides all of this stuff and only provides a simple API with
tf(), idf(), etc for tuning. I think thats really how they all should
work...