You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Alessandro Benedetti <ab...@apache.org> on 2015/12/29 11:56:25 UTC

[More Like This] Query building

Hi guys,
While I was exploring the way we build the More Like This query, I
discovered a part I am not convinced of :



Let's see how we build the query :
org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)

1) we extract the terms from the interesting fields, adding them to a map :

Map<String, Int> termFreqMap = new HashMap<>();

*( we lose the relation field-> term, we don't know anymore where the term
was coming ! )*

org.apache.lucene.queries.mlt.MoreLikeThis#createQueue

2) we build the queue that will contain the query terms, at this point we
connect again there terms to some field, but :

...
> // go through all the fields and find the largest document frequency
> String topField = fieldNames[0];
> int docFreq = 0;
> for (String fieldName : fieldNames) {
>   int freq = ir.docFreq(new Term(fieldName, word));
>   topField = (freq > docFreq) ? fieldName : topField;
>   docFreq = (freq > docFreq) ? freq : docFreq;
> }
> ...


We identify the topField as the field with the highest document frequency
for the term t .
Then we build the termQuery :

queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));

In this way we lose a lot of precision.
Not sure why we do that.
I would prefer to keep the relation between terms and fields.
The MLT query can improve a lot the quality.
If i run the MLT on 2 fields : *description* and *facilities* for example.
It is likely I want to find documents with similar terms in the description
and similar terms in the facilities, without mixing up the things and
loosing the semantic of the terms.

Let me know your opinion,

Cheers


-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

Hi Anshum,
my complaint was not a polemic but a sad observation :(
I perfectly know that it has been more lacking time than the intent!
Hopefully I will get some feedback and we can solve/improve the MLT
together !

Cheers

On 11 March 2016 at 17:26, Anshum Gupta <an...@anshumgupta.net> wrote:

> Hi Alessandro,
>
> I've updated the JIRA. The committers try and review code whenever they
> get time and in this case, like other such times, I think we were all just
> lacking time, rather than the intent.
>
> Also, not all committers work on all parts of the code, so that narrows
> down the people who could potentially help you.
>
> On Fri, Mar 11, 2016 at 8:49 AM, Alessandro Benedetti <
> abenedetti@apache.org> wrote:
>
>> I start to feel that is not that easy to contribute improvements or small
>> fix to Solr ( if they are not super interesting to the mass) .
>> I think this one could be a good improvement in the MLT but I would love
>> to discuss this with some committer.
>> The patch is attached, it is there since months ago...
>> Any feedback would be appreciated, I want to contribute, but I need some
>> second opinions ...
>>
>> Cheers
>>
>> On 11 February 2016 at 13:48, Alessandro Benedetti <abenedetti@apache.org
>> > wrote:
>>
>>> Hi Guys,
>>> is it possible to have any feedback ?
>>> Is there any process to speed up bug resolution / discussions ?
>>> just want to understand if the patch is not good enough, if I need to
>>> improve it or simply no-one took a look ...
>>>
>>> https://issues.apache.org/jira/browse/LUCENE-6954
>>>
>>> Cheers
>>>
>>> On 11 January 2016 at 15:25, Alessandro Benedetti <abenedetti@apache.org
>>> > wrote:
>>>
>>>> Hi guys,
>>>> the patch seems fine to me.
>>>> I didn't spend much more time on the code but I checked the tests and
>>>> the pre-commit checks.
>>>> It seems fine to me.
>>>> Let me know ,
>>>>
>>>> Cheers
>>>>
>>>> On 31 December 2015 at 18:40, Alessandro Benedetti <
>>>> abenedetti@apache.org> wrote:
>>>>
>>>>> https://issues.apache.org/jira/browse/LUCENE-6954
>>>>>
>>>>> First draft patch available, I will check better the tests new year !
>>>>>
>>>>> On 29 December 2015 at 13:43, Alessandro Benedetti <
>>>>> abenedetti@apache.org> wrote:
>>>>>
>>>>>> Sure, I will proceed tomorrow with the Jira and the simple patch +
>>>>>> tests.
>>>>>>
>>>>>> In the meantime let's try to collect some additional feedback.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Feel free to create a JIRA and put up a patch if you can.
>>>>>>>
>>>>>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>>>>>> abenedetti@apache.org
>>>>>>> > wrote:
>>>>>>>
>>>>>>> > Hi guys,
>>>>>>> > While I was exploring the way we build the More Like This query, I
>>>>>>> > discovered a part I am not convinced of :
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Let's see how we build the query :
>>>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>>>>>> >
>>>>>>> > 1) we extract the terms from the interesting fields, adding them
>>>>>>> to a map :
>>>>>>> >
>>>>>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>>>>>> >
>>>>>>> > *( we lose the relation field-> term, we don't know anymore where
>>>>>>> the term
>>>>>>> > was coming ! )*
>>>>>>> >
>>>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>>>>>> >
>>>>>>> > 2) we build the queue that will contain the query terms, at this
>>>>>>> point we
>>>>>>> > connect again there terms to some field, but :
>>>>>>> >
>>>>>>> > ...
>>>>>>> >> // go through all the fields and find the largest document
>>>>>>> frequency
>>>>>>> >> String topField = fieldNames[0];
>>>>>>> >> int docFreq = 0;
>>>>>>> >> for (String fieldName : fieldNames) {
>>>>>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>>>>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>>>>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>>>>>> >> }
>>>>>>> >> ...
>>>>>>> >
>>>>>>> >
>>>>>>> > We identify the topField as the field with the highest document
>>>>>>> frequency
>>>>>>> > for the term t .
>>>>>>> > Then we build the termQuery :
>>>>>>> >
>>>>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq,
>>>>>>> tf));
>>>>>>> >
>>>>>>> > In this way we lose a lot of precision.
>>>>>>> > Not sure why we do that.
>>>>>>> > I would prefer to keep the relation between terms and fields.
>>>>>>> > The MLT query can improve a lot the quality.
>>>>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>>>>>> example.
>>>>>>> > It is likely I want to find documents with similar terms in the
>>>>>>> > description and similar terms in the facilities, without mixing up
>>>>>>> the
>>>>>>> > things and loosing the semantic of the terms.
>>>>>>> >
>>>>>>> > Let me know your opinion,
>>>>>>> >
>>>>>>> > Cheers
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > --------------------------
>>>>>>> >
>>>>>>> > Benedetti Alessandro
>>>>>>> > Visiting card : http://about.me/alessandro_benedetti
>>>>>>> >
>>>>>>> > "Tyger, tyger burning bright
>>>>>>> > In the forests of the night,
>>>>>>> > What immortal hand or eye
>>>>>>> > Could frame thy fearful symmetry?"
>>>>>>> >
>>>>>>> > William Blake - Songs of Experience -1794 England
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Anshum Gupta
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> --------------------------
>>>>>>
>>>>>> Benedetti Alessandro
>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>
>>>>>> "Tyger, tyger burning bright
>>>>>> In the forests of the night,
>>>>>> What immortal hand or eye
>>>>>> Could frame thy fearful symmetry?"
>>>>>>
>>>>>> William Blake - Songs of Experience -1794 England
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --------------------------
>>>>>
>>>>> Benedetti Alessandro
>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>
>>>>> "Tyger, tyger burning bright
>>>>> In the forests of the night,
>>>>> What immortal hand or eye
>>>>> Could frame thy fearful symmetry?"
>>>>>
>>>>> William Blake - Songs of Experience -1794 England
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --------------------------
>>>>
>>>> Benedetti Alessandro
>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>
>>>> "Tyger, tyger burning bright
>>>> In the forests of the night,
>>>> What immortal hand or eye
>>>> Could frame thy fearful symmetry?"
>>>>
>>>> William Blake - Songs of Experience -1794 England
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> Anshum Gupta
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Anshum Gupta <an...@anshumgupta.net>.

Hi Alessandro,

I've updated the JIRA. The committers try and review code whenever they get
time and in this case, like other such times, I think we were all just
lacking time, rather than the intent.

Also, not all committers work on all parts of the code, so that narrows
down the people who could potentially help you.

On Fri, Mar 11, 2016 at 8:49 AM, Alessandro Benedetti <abenedetti@apache.org
> wrote:

> I start to feel that is not that easy to contribute improvements or small
> fix to Solr ( if they are not super interesting to the mass) .
> I think this one could be a good improvement in the MLT but I would love
> to discuss this with some committer.
> The patch is attached, it is there since months ago...
> Any feedback would be appreciated, I want to contribute, but I need some
> second opinions ...
>
> Cheers
>
> On 11 February 2016 at 13:48, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
>> Hi Guys,
>> is it possible to have any feedback ?
>> Is there any process to speed up bug resolution / discussions ?
>> just want to understand if the patch is not good enough, if I need to
>> improve it or simply no-one took a look ...
>>
>> https://issues.apache.org/jira/browse/LUCENE-6954
>>
>> Cheers
>>
>> On 11 January 2016 at 15:25, Alessandro Benedetti <ab...@apache.org>
>> wrote:
>>
>>> Hi guys,
>>> the patch seems fine to me.
>>> I didn't spend much more time on the code but I checked the tests and
>>> the pre-commit checks.
>>> It seems fine to me.
>>> Let me know ,
>>>
>>> Cheers
>>>
>>> On 31 December 2015 at 18:40, Alessandro Benedetti <
>>> abenedetti@apache.org> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/LUCENE-6954
>>>>
>>>> First draft patch available, I will check better the tests new year !
>>>>
>>>> On 29 December 2015 at 13:43, Alessandro Benedetti <
>>>> abenedetti@apache.org> wrote:
>>>>
>>>>> Sure, I will proceed tomorrow with the Jira and the simple patch +
>>>>> tests.
>>>>>
>>>>> In the meantime let's try to collect some additional feedback.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>>>>> wrote:
>>>>>
>>>>>> Feel free to create a JIRA and put up a patch if you can.
>>>>>>
>>>>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>>>>> abenedetti@apache.org
>>>>>> > wrote:
>>>>>>
>>>>>> > Hi guys,
>>>>>> > While I was exploring the way we build the More Like This query, I
>>>>>> > discovered a part I am not convinced of :
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Let's see how we build the query :
>>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>>>>> >
>>>>>> > 1) we extract the terms from the interesting fields, adding them to
>>>>>> a map :
>>>>>> >
>>>>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>>>>> >
>>>>>> > *( we lose the relation field-> term, we don't know anymore where
>>>>>> the term
>>>>>> > was coming ! )*
>>>>>> >
>>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>>>>> >
>>>>>> > 2) we build the queue that will contain the query terms, at this
>>>>>> point we
>>>>>> > connect again there terms to some field, but :
>>>>>> >
>>>>>> > ...
>>>>>> >> // go through all the fields and find the largest document
>>>>>> frequency
>>>>>> >> String topField = fieldNames[0];
>>>>>> >> int docFreq = 0;
>>>>>> >> for (String fieldName : fieldNames) {
>>>>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>>>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>>>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>>>>> >> }
>>>>>> >> ...
>>>>>> >
>>>>>> >
>>>>>> > We identify the topField as the field with the highest document
>>>>>> frequency
>>>>>> > for the term t .
>>>>>> > Then we build the termQuery :
>>>>>> >
>>>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>>>>> >
>>>>>> > In this way we lose a lot of precision.
>>>>>> > Not sure why we do that.
>>>>>> > I would prefer to keep the relation between terms and fields.
>>>>>> > The MLT query can improve a lot the quality.
>>>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>>>>> example.
>>>>>> > It is likely I want to find documents with similar terms in the
>>>>>> > description and similar terms in the facilities, without mixing up
>>>>>> the
>>>>>> > things and loosing the semantic of the terms.
>>>>>> >
>>>>>> > Let me know your opinion,
>>>>>> >
>>>>>> > Cheers
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > --------------------------
>>>>>> >
>>>>>> > Benedetti Alessandro
>>>>>> > Visiting card : http://about.me/alessandro_benedetti
>>>>>> >
>>>>>> > "Tyger, tyger burning bright
>>>>>> > In the forests of the night,
>>>>>> > What immortal hand or eye
>>>>>> > Could frame thy fearful symmetry?"
>>>>>> >
>>>>>> > William Blake - Songs of Experience -1794 England
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Anshum Gupta
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> --------------------------
>>>>>
>>>>> Benedetti Alessandro
>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>
>>>>> "Tyger, tyger burning bright
>>>>> In the forests of the night,
>>>>> What immortal hand or eye
>>>>> Could frame thy fearful symmetry?"
>>>>>
>>>>> William Blake - Songs of Experience -1794 England
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --------------------------
>>>>
>>>> Benedetti Alessandro
>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>
>>>> "Tyger, tyger burning bright
>>>> In the forests of the night,
>>>> What immortal hand or eye
>>>> Could frame thy fearful symmetry?"
>>>>
>>>> William Blake - Songs of Experience -1794 England
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Anshum Gupta

Re: [More Like This] Query building

Posted by Scott Stults <ss...@opensourceconnections.com>.

Hi Alessandro,

It's not uncommon for Solr patches to remain uncommitted for months, even
years. In fact some never get merged. Don't let that discourage you!


k/r,
Scott

On Fri, Mar 11, 2016 at 11:49 AM, Alessandro Benedetti <
abenedetti@apache.org> wrote:

> I start to feel that is not that easy to contribute improvements or small
> fix to Solr ( if they are not super interesting to the mass) .
> I think this one could be a good improvement in the MLT but I would love to
> discuss this with some committer.
> The patch is attached, it is there since months ago...
> Any feedback would be appreciated, I want to contribute, but I need some
> second opinions ...
>
> Cheers
>
> On 11 February 2016 at 13:48, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
> > Hi Guys,
> > is it possible to have any feedback ?
> > Is there any process to speed up bug resolution / discussions ?
> > just want to understand if the patch is not good enough, if I need to
> > improve it or simply no-one took a look ...
> >
> > https://issues.apache.org/jira/browse/LUCENE-6954
> >
> > Cheers
> >
> > On 11 January 2016 at 15:25, Alessandro Benedetti <abenedetti@apache.org
> >
> > wrote:
> >
> >> Hi guys,
> >> the patch seems fine to me.
> >> I didn't spend much more time on the code but I checked the tests and
> the
> >> pre-commit checks.
> >> It seems fine to me.
> >> Let me know ,
> >>
> >> Cheers
> >>
> >> On 31 December 2015 at 18:40, Alessandro Benedetti <
> abenedetti@apache.org
> >> > wrote:
> >>
> >>> https://issues.apache.org/jira/browse/LUCENE-6954
> >>>
> >>> First draft patch available, I will check better the tests new year !
> >>>
> >>> On 29 December 2015 at 13:43, Alessandro Benedetti <
> >>> abenedetti@apache.org> wrote:
> >>>
> >>>> Sure, I will proceed tomorrow with the Jira and the simple patch +
> >>>> tests.
> >>>>
> >>>> In the meantime let's try to collect some additional feedback.
> >>>>
> >>>> Cheers
> >>>>
> >>>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
> >>>> wrote:
> >>>>
> >>>>> Feel free to create a JIRA and put up a patch if you can.
> >>>>>
> >>>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> >>>>> abenedetti@apache.org
> >>>>> > wrote:
> >>>>>
> >>>>> > Hi guys,
> >>>>> > While I was exploring the way we build the More Like This query, I
> >>>>> > discovered a part I am not convinced of :
> >>>>> >
> >>>>> >
> >>>>> >
> >>>>> > Let's see how we build the query :
> >>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >>>>> >
> >>>>> > 1) we extract the terms from the interesting fields, adding them to
> >>>>> a map :
> >>>>> >
> >>>>> > Map<String, Int> termFreqMap = new HashMap<>();
> >>>>> >
> >>>>> > *( we lose the relation field-> term, we don't know anymore where
> >>>>> the term
> >>>>> > was coming ! )*
> >>>>> >
> >>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >>>>> >
> >>>>> > 2) we build the queue that will contain the query terms, at this
> >>>>> point we
> >>>>> > connect again there terms to some field, but :
> >>>>> >
> >>>>> > ...
> >>>>> >> // go through all the fields and find the largest document
> frequency
> >>>>> >> String topField = fieldNames[0];
> >>>>> >> int docFreq = 0;
> >>>>> >> for (String fieldName : fieldNames) {
> >>>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>>>> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >>>>> >> }
> >>>>> >> ...
> >>>>> >
> >>>>> >
> >>>>> > We identify the topField as the field with the highest document
> >>>>> frequency
> >>>>> > for the term t .
> >>>>> > Then we build the termQuery :
> >>>>> >
> >>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq,
> tf));
> >>>>> >
> >>>>> > In this way we lose a lot of precision.
> >>>>> > Not sure why we do that.
> >>>>> > I would prefer to keep the relation between terms and fields.
> >>>>> > The MLT query can improve a lot the quality.
> >>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
> >>>>> example.
> >>>>> > It is likely I want to find documents with similar terms in the
> >>>>> > description and similar terms in the facilities, without mixing up
> >>>>> the
> >>>>> > things and loosing the semantic of the terms.
> >>>>> >
> >>>>> > Let me know your opinion,
> >>>>> >
> >>>>> > Cheers
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> > --------------------------
> >>>>> >
> >>>>> > Benedetti Alessandro
> >>>>> > Visiting card : http://about.me/alessandro_benedetti
> >>>>> >
> >>>>> > "Tyger, tyger burning bright
> >>>>> > In the forests of the night,
> >>>>> > What immortal hand or eye
> >>>>> > Could frame thy fearful symmetry?"
> >>>>> >
> >>>>> > William Blake - Songs of Experience -1794 England
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Anshum Gupta
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> --------------------------
> >>>>
> >>>> Benedetti Alessandro
> >>>> Visiting card : http://about.me/alessandro_benedetti
> >>>>
> >>>> "Tyger, tyger burning bright
> >>>> In the forests of the night,
> >>>> What immortal hand or eye
> >>>> Could frame thy fearful symmetry?"
> >>>>
> >>>> William Blake - Songs of Experience -1794 England
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> --------------------------
> >>>
> >>> Benedetti Alessandro
> >>> Visiting card : http://about.me/alessandro_benedetti
> >>>
> >>> "Tyger, tyger burning bright
> >>> In the forests of the night,
> >>> What immortal hand or eye
> >>> Could frame thy fearful symmetry?"
> >>>
> >>> William Blake - Songs of Experience -1794 England
> >>>
> >>
> >>
> >>
> >> --
> >> --------------------------
> >>
> >> Benedetti Alessandro
> >> Visiting card : http://about.me/alessandro_benedetti
> >>
> >> "Tyger, tyger burning bright
> >> In the forests of the night,
> >> What immortal hand or eye
> >> Could frame thy fearful symmetry?"
> >>
> >> William Blake - Songs of Experience -1794 England
> >>
> >
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

I start to feel that is not that easy to contribute improvements or small
fix to Solr ( if they are not super interesting to the mass) .
I think this one could be a good improvement in the MLT but I would love to
discuss this with some committer.
The patch is attached, it is there since months ago...
Any feedback would be appreciated, I want to contribute, but I need some
second opinions ...

Cheers

On 11 February 2016 at 13:48, Alessandro Benedetti <ab...@apache.org>
wrote:

> Hi Guys,
> is it possible to have any feedback ?
> Is there any process to speed up bug resolution / discussions ?
> just want to understand if the patch is not good enough, if I need to
> improve it or simply no-one took a look ...
>
> https://issues.apache.org/jira/browse/LUCENE-6954
>
> Cheers
>
> On 11 January 2016 at 15:25, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
>> Hi guys,
>> the patch seems fine to me.
>> I didn't spend much more time on the code but I checked the tests and the
>> pre-commit checks.
>> It seems fine to me.
>> Let me know ,
>>
>> Cheers
>>
>> On 31 December 2015 at 18:40, Alessandro Benedetti <abenedetti@apache.org
>> > wrote:
>>
>>> https://issues.apache.org/jira/browse/LUCENE-6954
>>>
>>> First draft patch available, I will check better the tests new year !
>>>
>>> On 29 December 2015 at 13:43, Alessandro Benedetti <
>>> abenedetti@apache.org> wrote:
>>>
>>>> Sure, I will proceed tomorrow with the Jira and the simple patch +
>>>> tests.
>>>>
>>>> In the meantime let's try to collect some additional feedback.
>>>>
>>>> Cheers
>>>>
>>>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>>>> wrote:
>>>>
>>>>> Feel free to create a JIRA and put up a patch if you can.
>>>>>
>>>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>>>> abenedetti@apache.org
>>>>> > wrote:
>>>>>
>>>>> > Hi guys,
>>>>> > While I was exploring the way we build the More Like This query, I
>>>>> > discovered a part I am not convinced of :
>>>>> >
>>>>> >
>>>>> >
>>>>> > Let's see how we build the query :
>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>>>> >
>>>>> > 1) we extract the terms from the interesting fields, adding them to
>>>>> a map :
>>>>> >
>>>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>>>> >
>>>>> > *( we lose the relation field-> term, we don't know anymore where
>>>>> the term
>>>>> > was coming ! )*
>>>>> >
>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>>>> >
>>>>> > 2) we build the queue that will contain the query terms, at this
>>>>> point we
>>>>> > connect again there terms to some field, but :
>>>>> >
>>>>> > ...
>>>>> >> // go through all the fields and find the largest document frequency
>>>>> >> String topField = fieldNames[0];
>>>>> >> int docFreq = 0;
>>>>> >> for (String fieldName : fieldNames) {
>>>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>>>> >> }
>>>>> >> ...
>>>>> >
>>>>> >
>>>>> > We identify the topField as the field with the highest document
>>>>> frequency
>>>>> > for the term t .
>>>>> > Then we build the termQuery :
>>>>> >
>>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>>>> >
>>>>> > In this way we lose a lot of precision.
>>>>> > Not sure why we do that.
>>>>> > I would prefer to keep the relation between terms and fields.
>>>>> > The MLT query can improve a lot the quality.
>>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>>>> example.
>>>>> > It is likely I want to find documents with similar terms in the
>>>>> > description and similar terms in the facilities, without mixing up
>>>>> the
>>>>> > things and loosing the semantic of the terms.
>>>>> >
>>>>> > Let me know your opinion,
>>>>> >
>>>>> > Cheers
>>>>> >
>>>>> >
>>>>> > --
>>>>> > --------------------------
>>>>> >
>>>>> > Benedetti Alessandro
>>>>> > Visiting card : http://about.me/alessandro_benedetti
>>>>> >
>>>>> > "Tyger, tyger burning bright
>>>>> > In the forests of the night,
>>>>> > What immortal hand or eye
>>>>> > Could frame thy fearful symmetry?"
>>>>> >
>>>>> > William Blake - Songs of Experience -1794 England
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --------------------------
>>>>
>>>> Benedetti Alessandro
>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>
>>>> "Tyger, tyger burning bright
>>>> In the forests of the night,
>>>> What immortal hand or eye
>>>> Could frame thy fearful symmetry?"
>>>>
>>>> William Blake - Songs of Experience -1794 England
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

I start to feel that is not that easy to contribute improvements or small
fix to Solr ( if they are not super interesting to the mass) .
I think this one could be a good improvement in the MLT but I would love to
discuss this with some committer.
The patch is attached, it is there since months ago...
Any feedback would be appreciated, I want to contribute, but I need some
second opinions ...

Cheers

On 11 February 2016 at 13:48, Alessandro Benedetti <ab...@apache.org>
wrote:

> Hi Guys,
> is it possible to have any feedback ?
> Is there any process to speed up bug resolution / discussions ?
> just want to understand if the patch is not good enough, if I need to
> improve it or simply no-one took a look ...
>
> https://issues.apache.org/jira/browse/LUCENE-6954
>
> Cheers
>
> On 11 January 2016 at 15:25, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
>> Hi guys,
>> the patch seems fine to me.
>> I didn't spend much more time on the code but I checked the tests and the
>> pre-commit checks.
>> It seems fine to me.
>> Let me know ,
>>
>> Cheers
>>
>> On 31 December 2015 at 18:40, Alessandro Benedetti <abenedetti@apache.org
>> > wrote:
>>
>>> https://issues.apache.org/jira/browse/LUCENE-6954
>>>
>>> First draft patch available, I will check better the tests new year !
>>>
>>> On 29 December 2015 at 13:43, Alessandro Benedetti <
>>> abenedetti@apache.org> wrote:
>>>
>>>> Sure, I will proceed tomorrow with the Jira and the simple patch +
>>>> tests.
>>>>
>>>> In the meantime let's try to collect some additional feedback.
>>>>
>>>> Cheers
>>>>
>>>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>>>> wrote:
>>>>
>>>>> Feel free to create a JIRA and put up a patch if you can.
>>>>>
>>>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>>>> abenedetti@apache.org
>>>>> > wrote:
>>>>>
>>>>> > Hi guys,
>>>>> > While I was exploring the way we build the More Like This query, I
>>>>> > discovered a part I am not convinced of :
>>>>> >
>>>>> >
>>>>> >
>>>>> > Let's see how we build the query :
>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>>>> >
>>>>> > 1) we extract the terms from the interesting fields, adding them to
>>>>> a map :
>>>>> >
>>>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>>>> >
>>>>> > *( we lose the relation field-> term, we don't know anymore where
>>>>> the term
>>>>> > was coming ! )*
>>>>> >
>>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>>>> >
>>>>> > 2) we build the queue that will contain the query terms, at this
>>>>> point we
>>>>> > connect again there terms to some field, but :
>>>>> >
>>>>> > ...
>>>>> >> // go through all the fields and find the largest document frequency
>>>>> >> String topField = fieldNames[0];
>>>>> >> int docFreq = 0;
>>>>> >> for (String fieldName : fieldNames) {
>>>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>>>> >> }
>>>>> >> ...
>>>>> >
>>>>> >
>>>>> > We identify the topField as the field with the highest document
>>>>> frequency
>>>>> > for the term t .
>>>>> > Then we build the termQuery :
>>>>> >
>>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>>>> >
>>>>> > In this way we lose a lot of precision.
>>>>> > Not sure why we do that.
>>>>> > I would prefer to keep the relation between terms and fields.
>>>>> > The MLT query can improve a lot the quality.
>>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>>>> example.
>>>>> > It is likely I want to find documents with similar terms in the
>>>>> > description and similar terms in the facilities, without mixing up
>>>>> the
>>>>> > things and loosing the semantic of the terms.
>>>>> >
>>>>> > Let me know your opinion,
>>>>> >
>>>>> > Cheers
>>>>> >
>>>>> >
>>>>> > --
>>>>> > --------------------------
>>>>> >
>>>>> > Benedetti Alessandro
>>>>> > Visiting card : http://about.me/alessandro_benedetti
>>>>> >
>>>>> > "Tyger, tyger burning bright
>>>>> > In the forests of the night,
>>>>> > What immortal hand or eye
>>>>> > Could frame thy fearful symmetry?"
>>>>> >
>>>>> > William Blake - Songs of Experience -1794 England
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> --------------------------
>>>>
>>>> Benedetti Alessandro
>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>
>>>> "Tyger, tyger burning bright
>>>> In the forests of the night,
>>>> What immortal hand or eye
>>>> Could frame thy fearful symmetry?"
>>>>
>>>> William Blake - Songs of Experience -1794 England
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

Hi Guys,
is it possible to have any feedback ?
Is there any process to speed up bug resolution / discussions ?
just want to understand if the patch is not good enough, if I need to
improve it or simply no-one took a look ...

https://issues.apache.org/jira/browse/LUCENE-6954

Cheers

On 11 January 2016 at 15:25, Alessandro Benedetti <ab...@apache.org>
wrote:

> Hi guys,
> the patch seems fine to me.
> I didn't spend much more time on the code but I checked the tests and the
> pre-commit checks.
> It seems fine to me.
> Let me know ,
>
> Cheers
>
> On 31 December 2015 at 18:40, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
>> https://issues.apache.org/jira/browse/LUCENE-6954
>>
>> First draft patch available, I will check better the tests new year !
>>
>> On 29 December 2015 at 13:43, Alessandro Benedetti <abenedetti@apache.org
>> > wrote:
>>
>>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>>
>>> In the meantime let's try to collect some additional feedback.
>>>
>>> Cheers
>>>
>>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>>> wrote:
>>>
>>>> Feel free to create a JIRA and put up a patch if you can.
>>>>
>>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>>> abenedetti@apache.org
>>>> > wrote:
>>>>
>>>> > Hi guys,
>>>> > While I was exploring the way we build the More Like This query, I
>>>> > discovered a part I am not convinced of :
>>>> >
>>>> >
>>>> >
>>>> > Let's see how we build the query :
>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>>> >
>>>> > 1) we extract the terms from the interesting fields, adding them to a
>>>> map :
>>>> >
>>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>>> >
>>>> > *( we lose the relation field-> term, we don't know anymore where the
>>>> term
>>>> > was coming ! )*
>>>> >
>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>>> >
>>>> > 2) we build the queue that will contain the query terms, at this
>>>> point we
>>>> > connect again there terms to some field, but :
>>>> >
>>>> > ...
>>>> >> // go through all the fields and find the largest document frequency
>>>> >> String topField = fieldNames[0];
>>>> >> int docFreq = 0;
>>>> >> for (String fieldName : fieldNames) {
>>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>>> >> }
>>>> >> ...
>>>> >
>>>> >
>>>> > We identify the topField as the field with the highest document
>>>> frequency
>>>> > for the term t .
>>>> > Then we build the termQuery :
>>>> >
>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>>> >
>>>> > In this way we lose a lot of precision.
>>>> > Not sure why we do that.
>>>> > I would prefer to keep the relation between terms and fields.
>>>> > The MLT query can improve a lot the quality.
>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>>> example.
>>>> > It is likely I want to find documents with similar terms in the
>>>> > description and similar terms in the facilities, without mixing up the
>>>> > things and loosing the semantic of the terms.
>>>> >
>>>> > Let me know your opinion,
>>>> >
>>>> > Cheers
>>>> >
>>>> >
>>>> > --
>>>> > --------------------------
>>>> >
>>>> > Benedetti Alessandro
>>>> > Visiting card : http://about.me/alessandro_benedetti
>>>> >
>>>> > "Tyger, tyger burning bright
>>>> > In the forests of the night,
>>>> > What immortal hand or eye
>>>> > Could frame thy fearful symmetry?"
>>>> >
>>>> > William Blake - Songs of Experience -1794 England
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Anshum Gupta
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

Hi Guys,
is it possible to have any feedback ?
Is there any process to speed up bug resolution / discussions ?
just want to understand if the patch is not good enough, if I need to
improve it or simply no-one took a look ...

https://issues.apache.org/jira/browse/LUCENE-6954

Cheers

On 11 January 2016 at 15:25, Alessandro Benedetti <ab...@apache.org>
wrote:

> Hi guys,
> the patch seems fine to me.
> I didn't spend much more time on the code but I checked the tests and the
> pre-commit checks.
> It seems fine to me.
> Let me know ,
>
> Cheers
>
> On 31 December 2015 at 18:40, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
>> https://issues.apache.org/jira/browse/LUCENE-6954
>>
>> First draft patch available, I will check better the tests new year !
>>
>> On 29 December 2015 at 13:43, Alessandro Benedetti <abenedetti@apache.org
>> > wrote:
>>
>>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>>
>>> In the meantime let's try to collect some additional feedback.
>>>
>>> Cheers
>>>
>>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>>> wrote:
>>>
>>>> Feel free to create a JIRA and put up a patch if you can.
>>>>
>>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>>> abenedetti@apache.org
>>>> > wrote:
>>>>
>>>> > Hi guys,
>>>> > While I was exploring the way we build the More Like This query, I
>>>> > discovered a part I am not convinced of :
>>>> >
>>>> >
>>>> >
>>>> > Let's see how we build the query :
>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>>> >
>>>> > 1) we extract the terms from the interesting fields, adding them to a
>>>> map :
>>>> >
>>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>>> >
>>>> > *( we lose the relation field-> term, we don't know anymore where the
>>>> term
>>>> > was coming ! )*
>>>> >
>>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>>> >
>>>> > 2) we build the queue that will contain the query terms, at this
>>>> point we
>>>> > connect again there terms to some field, but :
>>>> >
>>>> > ...
>>>> >> // go through all the fields and find the largest document frequency
>>>> >> String topField = fieldNames[0];
>>>> >> int docFreq = 0;
>>>> >> for (String fieldName : fieldNames) {
>>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>>> >> }
>>>> >> ...
>>>> >
>>>> >
>>>> > We identify the topField as the field with the highest document
>>>> frequency
>>>> > for the term t .
>>>> > Then we build the termQuery :
>>>> >
>>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>>> >
>>>> > In this way we lose a lot of precision.
>>>> > Not sure why we do that.
>>>> > I would prefer to keep the relation between terms and fields.
>>>> > The MLT query can improve a lot the quality.
>>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>>> example.
>>>> > It is likely I want to find documents with similar terms in the
>>>> > description and similar terms in the facilities, without mixing up the
>>>> > things and loosing the semantic of the terms.
>>>> >
>>>> > Let me know your opinion,
>>>> >
>>>> > Cheers
>>>> >
>>>> >
>>>> > --
>>>> > --------------------------
>>>> >
>>>> > Benedetti Alessandro
>>>> > Visiting card : http://about.me/alessandro_benedetti
>>>> >
>>>> > "Tyger, tyger burning bright
>>>> > In the forests of the night,
>>>> > What immortal hand or eye
>>>> > Could frame thy fearful symmetry?"
>>>> >
>>>> > William Blake - Songs of Experience -1794 England
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Anshum Gupta
>>>>
>>>
>>>
>>>
>>> --
>>> --------------------------
>>>
>>> Benedetti Alessandro
>>> Visiting card : http://about.me/alessandro_benedetti
>>>
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>>
>>> William Blake - Songs of Experience -1794 England
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

Hi guys,
the patch seems fine to me.
I didn't spend much more time on the code but I checked the tests and the
pre-commit checks.
It seems fine to me.
Let me know ,

Cheers

On 31 December 2015 at 18:40, Alessandro Benedetti <ab...@apache.org>
wrote:

> https://issues.apache.org/jira/browse/LUCENE-6954
>
> First draft patch available, I will check better the tests new year !
>
> On 29 December 2015 at 13:43, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>
>> In the meantime let's try to collect some additional feedback.
>>
>> Cheers
>>
>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>> wrote:
>>
>>> Feel free to create a JIRA and put up a patch if you can.
>>>
>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>> abenedetti@apache.org
>>> > wrote:
>>>
>>> > Hi guys,
>>> > While I was exploring the way we build the More Like This query, I
>>> > discovered a part I am not convinced of :
>>> >
>>> >
>>> >
>>> > Let's see how we build the query :
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>> >
>>> > 1) we extract the terms from the interesting fields, adding them to a
>>> map :
>>> >
>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>> >
>>> > *( we lose the relation field-> term, we don't know anymore where the
>>> term
>>> > was coming ! )*
>>> >
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>> >
>>> > 2) we build the queue that will contain the query terms, at this point
>>> we
>>> > connect again there terms to some field, but :
>>> >
>>> > ...
>>> >> // go through all the fields and find the largest document frequency
>>> >> String topField = fieldNames[0];
>>> >> int docFreq = 0;
>>> >> for (String fieldName : fieldNames) {
>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>> >> }
>>> >> ...
>>> >
>>> >
>>> > We identify the topField as the field with the highest document
>>> frequency
>>> > for the term t .
>>> > Then we build the termQuery :
>>> >
>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>> >
>>> > In this way we lose a lot of precision.
>>> > Not sure why we do that.
>>> > I would prefer to keep the relation between terms and fields.
>>> > The MLT query can improve a lot the quality.
>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>> example.
>>> > It is likely I want to find documents with similar terms in the
>>> > description and similar terms in the facilities, without mixing up the
>>> > things and loosing the semantic of the terms.
>>> >
>>> > Let me know your opinion,
>>> >
>>> > Cheers
>>> >
>>> >
>>> > --
>>> > --------------------------
>>> >
>>> > Benedetti Alessandro
>>> > Visiting card : http://about.me/alessandro_benedetti
>>> >
>>> > "Tyger, tyger burning bright
>>> > In the forests of the night,
>>> > What immortal hand or eye
>>> > Could frame thy fearful symmetry?"
>>> >
>>> > William Blake - Songs of Experience -1794 England
>>> >
>>>
>>>
>>>
>>> --
>>> Anshum Gupta
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

Hi guys,
the patch seems fine to me.
I didn't spend much more time on the code but I checked the tests and the
pre-commit checks.
It seems fine to me.
Let me know ,

Cheers

On 31 December 2015 at 18:40, Alessandro Benedetti <ab...@apache.org>
wrote:

> https://issues.apache.org/jira/browse/LUCENE-6954
>
> First draft patch available, I will check better the tests new year !
>
> On 29 December 2015 at 13:43, Alessandro Benedetti <ab...@apache.org>
> wrote:
>
>> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>>
>> In the meantime let's try to collect some additional feedback.
>>
>> Cheers
>>
>> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net>
>> wrote:
>>
>>> Feel free to create a JIRA and put up a patch if you can.
>>>
>>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>>> abenedetti@apache.org
>>> > wrote:
>>>
>>> > Hi guys,
>>> > While I was exploring the way we build the More Like This query, I
>>> > discovered a part I am not convinced of :
>>> >
>>> >
>>> >
>>> > Let's see how we build the query :
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>>> >
>>> > 1) we extract the terms from the interesting fields, adding them to a
>>> map :
>>> >
>>> > Map<String, Int> termFreqMap = new HashMap<>();
>>> >
>>> > *( we lose the relation field-> term, we don't know anymore where the
>>> term
>>> > was coming ! )*
>>> >
>>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>>> >
>>> > 2) we build the queue that will contain the query terms, at this point
>>> we
>>> > connect again there terms to some field, but :
>>> >
>>> > ...
>>> >> // go through all the fields and find the largest document frequency
>>> >> String topField = fieldNames[0];
>>> >> int docFreq = 0;
>>> >> for (String fieldName : fieldNames) {
>>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>>> >>   topField = (freq > docFreq) ? fieldName : topField;
>>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>>> >> }
>>> >> ...
>>> >
>>> >
>>> > We identify the topField as the field with the highest document
>>> frequency
>>> > for the term t .
>>> > Then we build the termQuery :
>>> >
>>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>>> >
>>> > In this way we lose a lot of precision.
>>> > Not sure why we do that.
>>> > I would prefer to keep the relation between terms and fields.
>>> > The MLT query can improve a lot the quality.
>>> > If i run the MLT on 2 fields : *description* and *facilities* for
>>> example.
>>> > It is likely I want to find documents with similar terms in the
>>> > description and similar terms in the facilities, without mixing up the
>>> > things and loosing the semantic of the terms.
>>> >
>>> > Let me know your opinion,
>>> >
>>> > Cheers
>>> >
>>> >
>>> > --
>>> > --------------------------
>>> >
>>> > Benedetti Alessandro
>>> > Visiting card : http://about.me/alessandro_benedetti
>>> >
>>> > "Tyger, tyger burning bright
>>> > In the forests of the night,
>>> > What immortal hand or eye
>>> > Could frame thy fearful symmetry?"
>>> >
>>> > William Blake - Songs of Experience -1794 England
>>> >
>>>
>>>
>>>
>>> --
>>> Anshum Gupta
>>>
>>
>>
>>
>> --
>> --------------------------
>>
>> Benedetti Alessandro
>> Visiting card : http://about.me/alessandro_benedetti
>>
>> "Tyger, tyger burning bright
>> In the forests of the night,
>> What immortal hand or eye
>> Could frame thy fearful symmetry?"
>>
>> William Blake - Songs of Experience -1794 England
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

https://issues.apache.org/jira/browse/LUCENE-6954

First draft patch available, I will check better the tests new year !

On 29 December 2015 at 13:43, Alessandro Benedetti <ab...@apache.org>
wrote:

> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>
> In the meantime let's try to collect some additional feedback.
>
> Cheers
>
> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net> wrote:
>
>> Feel free to create a JIRA and put up a patch if you can.
>>
>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>> abenedetti@apache.org
>> > wrote:
>>
>> > Hi guys,
>> > While I was exploring the way we build the More Like This query, I
>> > discovered a part I am not convinced of :
>> >
>> >
>> >
>> > Let's see how we build the query :
>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>> >
>> > 1) we extract the terms from the interesting fields, adding them to a
>> map :
>> >
>> > Map<String, Int> termFreqMap = new HashMap<>();
>> >
>> > *( we lose the relation field-> term, we don't know anymore where the
>> term
>> > was coming ! )*
>> >
>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>> >
>> > 2) we build the queue that will contain the query terms, at this point
>> we
>> > connect again there terms to some field, but :
>> >
>> > ...
>> >> // go through all the fields and find the largest document frequency
>> >> String topField = fieldNames[0];
>> >> int docFreq = 0;
>> >> for (String fieldName : fieldNames) {
>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>> >>   topField = (freq > docFreq) ? fieldName : topField;
>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>> >> }
>> >> ...
>> >
>> >
>> > We identify the topField as the field with the highest document
>> frequency
>> > for the term t .
>> > Then we build the termQuery :
>> >
>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>> >
>> > In this way we lose a lot of precision.
>> > Not sure why we do that.
>> > I would prefer to keep the relation between terms and fields.
>> > The MLT query can improve a lot the quality.
>> > If i run the MLT on 2 fields : *description* and *facilities* for
>> example.
>> > It is likely I want to find documents with similar terms in the
>> > description and similar terms in the facilities, without mixing up the
>> > things and loosing the semantic of the terms.
>> >
>> > Let me know your opinion,
>> >
>> > Cheers
>> >
>> >
>> > --
>> > --------------------------
>> >
>> > Benedetti Alessandro
>> > Visiting card : http://about.me/alessandro_benedetti
>> >
>> > "Tyger, tyger burning bright
>> > In the forests of the night,
>> > What immortal hand or eye
>> > Could frame thy fearful symmetry?"
>> >
>> > William Blake - Songs of Experience -1794 England
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

https://issues.apache.org/jira/browse/LUCENE-6954

First draft patch available, I will check better the tests new year !

On 29 December 2015 at 13:43, Alessandro Benedetti <ab...@apache.org>
wrote:

> Sure, I will proceed tomorrow with the Jira and the simple patch + tests.
>
> In the meantime let's try to collect some additional feedback.
>
> Cheers
>
> On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net> wrote:
>
>> Feel free to create a JIRA and put up a patch if you can.
>>
>> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
>> abenedetti@apache.org
>> > wrote:
>>
>> > Hi guys,
>> > While I was exploring the way we build the More Like This query, I
>> > discovered a part I am not convinced of :
>> >
>> >
>> >
>> > Let's see how we build the query :
>> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>> >
>> > 1) we extract the terms from the interesting fields, adding them to a
>> map :
>> >
>> > Map<String, Int> termFreqMap = new HashMap<>();
>> >
>> > *( we lose the relation field-> term, we don't know anymore where the
>> term
>> > was coming ! )*
>> >
>> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>> >
>> > 2) we build the queue that will contain the query terms, at this point
>> we
>> > connect again there terms to some field, but :
>> >
>> > ...
>> >> // go through all the fields and find the largest document frequency
>> >> String topField = fieldNames[0];
>> >> int docFreq = 0;
>> >> for (String fieldName : fieldNames) {
>> >>   int freq = ir.docFreq(new Term(fieldName, word));
>> >>   topField = (freq > docFreq) ? fieldName : topField;
>> >>   docFreq = (freq > docFreq) ? freq : docFreq;
>> >> }
>> >> ...
>> >
>> >
>> > We identify the topField as the field with the highest document
>> frequency
>> > for the term t .
>> > Then we build the termQuery :
>> >
>> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>> >
>> > In this way we lose a lot of precision.
>> > Not sure why we do that.
>> > I would prefer to keep the relation between terms and fields.
>> > The MLT query can improve a lot the quality.
>> > If i run the MLT on 2 fields : *description* and *facilities* for
>> example.
>> > It is likely I want to find documents with similar terms in the
>> > description and similar terms in the facilities, without mixing up the
>> > things and loosing the semantic of the terms.
>> >
>> > Let me know your opinion,
>> >
>> > Cheers
>> >
>> >
>> > --
>> > --------------------------
>> >
>> > Benedetti Alessandro
>> > Visiting card : http://about.me/alessandro_benedetti
>> >
>> > "Tyger, tyger burning bright
>> > In the forests of the night,
>> > What immortal hand or eye
>> > Could frame thy fearful symmetry?"
>> >
>> > William Blake - Songs of Experience -1794 England
>> >
>>
>>
>>
>> --
>> Anshum Gupta
>>
>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

Sure, I will proceed tomorrow with the Jira and the simple patch + tests.

In the meantime let's try to collect some additional feedback.

Cheers

On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net> wrote:

> Feel free to create a JIRA and put up a patch if you can.
>
> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> abenedetti@apache.org
> > wrote:
>
> > Hi guys,
> > While I was exploring the way we build the More Like This query, I
> > discovered a part I am not convinced of :
> >
> >
> >
> > Let's see how we build the query :
> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >
> > 1) we extract the terms from the interesting fields, adding them to a
> map :
> >
> > Map<String, Int> termFreqMap = new HashMap<>();
> >
> > *( we lose the relation field-> term, we don't know anymore where the
> term
> > was coming ! )*
> >
> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >
> > 2) we build the queue that will contain the query terms, at this point we
> > connect again there terms to some field, but :
> >
> > ...
> >> // go through all the fields and find the largest document frequency
> >> String topField = fieldNames[0];
> >> int docFreq = 0;
> >> for (String fieldName : fieldNames) {
> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >> }
> >> ...
> >
> >
> > We identify the topField as the field with the highest document frequency
> > for the term t .
> > Then we build the termQuery :
> >
> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
> >
> > In this way we lose a lot of precision.
> > Not sure why we do that.
> > I would prefer to keep the relation between terms and fields.
> > The MLT query can improve a lot the quality.
> > If i run the MLT on 2 fields : *description* and *facilities* for
> example.
> > It is likely I want to find documents with similar terms in the
> > description and similar terms in the facilities, without mixing up the
> > things and loosing the semantic of the terms.
> >
> > Let me know your opinion,
> >
> > Cheers
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Anshum Gupta
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Alessandro Benedetti <ab...@apache.org>.

Sure, I will proceed tomorrow with the Jira and the simple patch + tests.

In the meantime let's try to collect some additional feedback.

Cheers

On 29 December 2015 at 12:43, Anshum Gupta <an...@anshumgupta.net> wrote:

> Feel free to create a JIRA and put up a patch if you can.
>
> On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <
> abenedetti@apache.org
> > wrote:
>
> > Hi guys,
> > While I was exploring the way we build the More Like This query, I
> > discovered a part I am not convinced of :
> >
> >
> >
> > Let's see how we build the query :
> > org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
> >
> > 1) we extract the terms from the interesting fields, adding them to a
> map :
> >
> > Map<String, Int> termFreqMap = new HashMap<>();
> >
> > *( we lose the relation field-> term, we don't know anymore where the
> term
> > was coming ! )*
> >
> > org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
> >
> > 2) we build the queue that will contain the query terms, at this point we
> > connect again there terms to some field, but :
> >
> > ...
> >> // go through all the fields and find the largest document frequency
> >> String topField = fieldNames[0];
> >> int docFreq = 0;
> >> for (String fieldName : fieldNames) {
> >>   int freq = ir.docFreq(new Term(fieldName, word));
> >>   topField = (freq > docFreq) ? fieldName : topField;
> >>   docFreq = (freq > docFreq) ? freq : docFreq;
> >> }
> >> ...
> >
> >
> > We identify the topField as the field with the highest document frequency
> > for the term t .
> > Then we build the termQuery :
> >
> > queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
> >
> > In this way we lose a lot of precision.
> > Not sure why we do that.
> > I would prefer to keep the relation between terms and fields.
> > The MLT query can improve a lot the quality.
> > If i run the MLT on 2 fields : *description* and *facilities* for
> example.
> > It is likely I want to find documents with similar terms in the
> > description and similar terms in the facilities, without mixing up the
> > things and loosing the semantic of the terms.
> >
> > Let me know your opinion,
> >
> > Cheers
> >
> >
> > --
> > --------------------------
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
> >
>
>
>
> --
> Anshum Gupta
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: [More Like This] Query building

Posted by Anshum Gupta <an...@anshumgupta.net>.

Feel free to create a JIRA and put up a patch if you can.

On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <abenedetti@apache.org
> wrote:

> Hi guys,
> While I was exploring the way we build the More Like This query, I
> discovered a part I am not convinced of :
>
>
>
> Let's see how we build the query :
> org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>
> 1) we extract the terms from the interesting fields, adding them to a map :
>
> Map<String, Int> termFreqMap = new HashMap<>();
>
> *( we lose the relation field-> term, we don't know anymore where the term
> was coming ! )*
>
> org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>
> 2) we build the queue that will contain the query terms, at this point we
> connect again there terms to some field, but :
>
> ...
>> // go through all the fields and find the largest document frequency
>> String topField = fieldNames[0];
>> int docFreq = 0;
>> for (String fieldName : fieldNames) {
>>   int freq = ir.docFreq(new Term(fieldName, word));
>>   topField = (freq > docFreq) ? fieldName : topField;
>>   docFreq = (freq > docFreq) ? freq : docFreq;
>> }
>> ...
>
>
> We identify the topField as the field with the highest document frequency
> for the term t .
> Then we build the termQuery :
>
> queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>
> In this way we lose a lot of precision.
> Not sure why we do that.
> I would prefer to keep the relation between terms and fields.
> The MLT query can improve a lot the quality.
> If i run the MLT on 2 fields : *description* and *facilities* for example.
> It is likely I want to find documents with similar terms in the
> description and similar terms in the facilities, without mixing up the
> things and loosing the semantic of the terms.
>
> Let me know your opinion,
>
> Cheers
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Anshum Gupta

Re: [More Like This] Query building

Posted by Anshum Gupta <an...@anshumgupta.net>.

Feel free to create a JIRA and put up a patch if you can.

On Tue, Dec 29, 2015 at 4:26 PM, Alessandro Benedetti <abenedetti@apache.org
> wrote:

> Hi guys,
> While I was exploring the way we build the More Like This query, I
> discovered a part I am not convinced of :
>
>
>
> Let's see how we build the query :
> org.apache.lucene.queries.mlt.MoreLikeThis#retrieveTerms(int)
>
> 1) we extract the terms from the interesting fields, adding them to a map :
>
> Map<String, Int> termFreqMap = new HashMap<>();
>
> *( we lose the relation field-> term, we don't know anymore where the term
> was coming ! )*
>
> org.apache.lucene.queries.mlt.MoreLikeThis#createQueue
>
> 2) we build the queue that will contain the query terms, at this point we
> connect again there terms to some field, but :
>
> ...
>> // go through all the fields and find the largest document frequency
>> String topField = fieldNames[0];
>> int docFreq = 0;
>> for (String fieldName : fieldNames) {
>>   int freq = ir.docFreq(new Term(fieldName, word));
>>   topField = (freq > docFreq) ? fieldName : topField;
>>   docFreq = (freq > docFreq) ? freq : docFreq;
>> }
>> ...
>
>
> We identify the topField as the field with the highest document frequency
> for the term t .
> Then we build the termQuery :
>
> queue.add(new ScoreTerm(word, *topField*, score, idf, docFreq, tf));
>
> In this way we lose a lot of precision.
> Not sure why we do that.
> I would prefer to keep the relation between terms and fields.
> The MLT query can improve a lot the quality.
> If i run the MLT on 2 fields : *description* and *facilities* for example.
> It is likely I want to find documents with similar terms in the
> description and similar terms in the facilities, without mixing up the
> things and loosing the semantic of the terms.
>
> Let me know your opinion,
>
> Cheers
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
Anshum Gupta