You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Scott Smith <ss...@mainstreamdata.com> on 2011/09/22 02:17:27 UTC

MoreLikeThis Interface changes

I'm updating my lucene code from 3.0 to 3.4.  There's a change in the MLT interface I'm confused about.  I used the MLT.like(InputStream) method.  It now appears I should change to the MLT.like(InputStreamReader, fieldname) method.  Easy enough to create an InputStreamReader from an InputStream.

So, my question is regarding the addition of the fieldname parameter.  There's also a call called MLT.setFieldNames(String[]).  This would seem to be redundant except the setFieldNames() allows you to specify multiple fields and like() doesn't.  Am I allowed to specify null as the fieldname in like() (documentation doesn't say you can).  It seems like you shouldn't need to do both.  But there's a difference in functionality between the two (since one allows multiple fields and the other doesn't).

As a second question, is it possible to do something where you boost the MLT words from the subject and as opposed to the body of the document you are looking for similar items on?

Thanks

Scott

RE: MoreLikeThis Interface changes

Posted by Scott Smith <ss...@mainstreamdata.com>.

OK.  Thanks

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Monday, September 26, 2011 12:15 PM
To: java-user@lucene.apache.org
Subject: Re: MoreLikeThis Interface changes

On Mon, Sep 26, 2011 at 2:06 PM, Scott Smith <ss...@mainstreamdata.com> wrote:

> "is" is the input stream.  Did I miss something in your response?
>

Yes, this is totally unrelated to fields[].

it has to do with  which fieldname is passed to the analyzer to
analyze the reader into tokens (and there can be only one for this)



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MoreLikeThis Interface changes

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Sep 26, 2011 at 2:06 PM, Scott Smith <ss...@mainstreamdata.com> wrote:

> "is" is the input stream.  Did I miss something in your response?
>

Yes, this is totally unrelated to fields[].

it has to do with  which fieldname is passed to the analyzer to
analyze the reader into tokens (and there can be only one for this)



-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: MoreLikeThis Interface changes

Posted by Scott Smith <ss...@mainstreamdata.com>.

So, I thought you're response meant that I could eliminate my code:

        String[] fields = new String[1];
        fields[0] = "EVERYTHING";         // use the single "big" field in the index
        mlt.setFieldNames(fields);

But, if I comment out that code, my unit test fails.  If I include it, it passes.

I'm using MLT as follows:

            _query = new BooleanClause(mlt.like(new InputStreamReader(is), "EVERYTHING"), BooleanClause.Occur.MUST);

"is" is the input stream.  Did I miss something in your response?

Scott

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, September 21, 2011 6:59 PM
To: java-user@lucene.apache.org
Subject: Re: MoreLikeThis Interface changes

On Wed, Sep 21, 2011 at 5:17 PM, Scott Smith <ss...@mainstreamdata.com> wrote:
> I'm updating my lucene code from 3.0 to 3.4.  There's a change in the MLT interface I'm confused about.  I used the MLT.like(InputStream) method.  It now appears I should change to the MLT.like(InputStreamReader, fieldname) method.  Easy enough to create an InputStreamReader from an InputStream.

Yes, requiring a reader is to ensure that MLT is using the encoding you want

>
> So, my question is regarding the addition of the fieldname parameter.  There's also a call called MLT.setFieldNames(String[]).  This would seem to be redundant except the setFieldNames() allows you to specify multiple fields and like() doesn't.  Am I allowed to specify null as the fieldname in like() (documentation doesn't say you can).  It seems like you shouldn't need to do both.  But there's a difference in functionality between the two (since one allows multiple fields and the other doesn't).

A Reader has no fields :)
The fieldName is only for passing to the Analyzer (@param fieldName
field passed to the analyzer to use when analyzing the content)
This is because some Analyzers (e.g. PerFieldAnalyzerWrapper) analyze
content differently according to different fields.

Previously, MoreLikeThis would use what was in the setFieldNames
parameter, iteratively like this:
for (field : fieldNames) {
  analyzer.analyze(field, reader);
}

However, MoreLikeThis also had a bug where it would never close() the
reader As you can see this logic was completely bogus, as you can only
consume the field once.

Effectively the reader would be analyzed by fieldNames[0], then MLT
would analyze an exhausted reader with fieldNames[1]...fieldNames[n].

When we fixed MLT to close its resources correctly (around 3.2), it
exposed this second bug, If you tried to pass a reader with multiple
values in fieldNames you would get an IOException because it tried to
re-consume a closed reader.

Now, instead when supplying a reader, you should pass in this
fieldName explicitly so that it analyzes the content the way you want.
For backwards compatibility with the deprecated method, it uses
fieldNames[0] only.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: MoreLikeThis Interface changes

Posted by Scott Smith <ss...@mainstreamdata.com>.

Understand.  Thanks for the information.

-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Wednesday, September 21, 2011 6:59 PM
To: java-user@lucene.apache.org
Subject: Re: MoreLikeThis Interface changes

On Wed, Sep 21, 2011 at 5:17 PM, Scott Smith <ss...@mainstreamdata.com> wrote:
> I'm updating my lucene code from 3.0 to 3.4.  There's a change in the MLT interface I'm confused about.  I used the MLT.like(InputStream) method.  It now appears I should change to the MLT.like(InputStreamReader, fieldname) method.  Easy enough to create an InputStreamReader from an InputStream.

Yes, requiring a reader is to ensure that MLT is using the encoding you want

>
> So, my question is regarding the addition of the fieldname parameter.  There's also a call called MLT.setFieldNames(String[]).  This would seem to be redundant except the setFieldNames() allows you to specify multiple fields and like() doesn't.  Am I allowed to specify null as the fieldname in like() (documentation doesn't say you can).  It seems like you shouldn't need to do both.  But there's a difference in functionality between the two (since one allows multiple fields and the other doesn't).

A Reader has no fields :)
The fieldName is only for passing to the Analyzer (@param fieldName
field passed to the analyzer to use when analyzing the content)
This is because some Analyzers (e.g. PerFieldAnalyzerWrapper) analyze
content differently according to different fields.

Previously, MoreLikeThis would use what was in the setFieldNames
parameter, iteratively like this:
for (field : fieldNames) {
  analyzer.analyze(field, reader);
}

However, MoreLikeThis also had a bug where it would never close() the
reader As you can see this logic was completely bogus, as you can only
consume the field once.

Effectively the reader would be analyzed by fieldNames[0], then MLT
would analyze an exhausted reader with fieldNames[1]...fieldNames[n].

When we fixed MLT to close its resources correctly (around 3.2), it
exposed this second bug, If you tried to pass a reader with multiple
values in fieldNames you would get an IOException because it tried to
re-consume a closed reader.

Now, instead when supplying a reader, you should pass in this
fieldName explicitly so that it analyzes the content the way you want.
For backwards compatibility with the deprecated method, it uses
fieldNames[0] only.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: MoreLikeThis Interface changes

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Sep 21, 2011 at 5:17 PM, Scott Smith <ss...@mainstreamdata.com> wrote:
> I'm updating my lucene code from 3.0 to 3.4.  There's a change in the MLT interface I'm confused about.  I used the MLT.like(InputStream) method.  It now appears I should change to the MLT.like(InputStreamReader, fieldname) method.  Easy enough to create an InputStreamReader from an InputStream.

Yes, requiring a reader is to ensure that MLT is using the encoding you want

>
> So, my question is regarding the addition of the fieldname parameter.  There's also a call called MLT.setFieldNames(String[]).  This would seem to be redundant except the setFieldNames() allows you to specify multiple fields and like() doesn't.  Am I allowed to specify null as the fieldname in like() (documentation doesn't say you can).  It seems like you shouldn't need to do both.  But there's a difference in functionality between the two (since one allows multiple fields and the other doesn't).

A Reader has no fields :)
The fieldName is only for passing to the Analyzer (@param fieldName
field passed to the analyzer to use when analyzing the content)
This is because some Analyzers (e.g. PerFieldAnalyzerWrapper) analyze
content differently according to different fields.

Previously, MoreLikeThis would use what was in the setFieldNames
parameter, iteratively like this:
for (field : fieldNames) {
  analyzer.analyze(field, reader);
}

However, MoreLikeThis also had a bug where it would never close() the
reader As you can see this logic was completely bogus, as you can only
consume the field once.

Effectively the reader would be analyzed by fieldNames[0], then MLT
would analyze an exhausted reader with fieldNames[1]...fieldNames[n].

When we fixed MLT to close its resources correctly (around 3.2), it
exposed this second bug, If you tried to pass a reader with multiple
values in fieldNames you would get an IOException because it tried to
re-consume a closed reader.

Now, instead when supplying a reader, you should pass in this
fieldName explicitly so that it analyzes the content the way you want.
For backwards compatibility with the deprecated method, it uses
fieldNames[0] only.

-- 
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org