You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Mark Bennett <mb...@ideaeng.com> on 2010/09/10 02:42:11 UTC

Relevancy, Phrase Boosting, Shingles and Long Tail Curves

I want to boost the relevancy of some Question and Answer content. I'm using
stop words, Dismax, and I'm already a fan of Phrase Boosting and have
cranked that up a bit. But I'm considering using long Shingles to make use
of some of the normally stopped out "junk words" in the content to help
relevancy further.

Reminder: "Shingles" are artificial tokens created by gluing together
adjacent words.
    Input text: This is a sentence
    Normal tokens: this, is, a, sentence  (without removing stop words)
    2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence

A few questions on relevance and shingles:

1: How similar are the relevancy calculations compare between Shingles and
exact phrases?

I've seen material saying that shingles can give better performance than
normal phrase searching, and I'm assuming this is exact phrase (vs. allowing
for phrase slop)

But do the relevancy calculations for normal exact phrase and Shingles wind
up being *identical*, for the same documents and searches?  That would seem
an unlikely coincidence, but possibly it could have been engineered to
intentionally behave that way.

2: What's the latest on Shingles and Dismax?

The low front end low level tokenization in Dismax would seem to be a
problem, but does the new parser stuff help with this?

3: I'm thinking of a minimum 3 word shingle, does anybody have comments on
shingle length?

Eyeballing the 2 word shingles, they don't seem much better than stop
words.  Obviously my shingle field bypasses stop words.

But the 3 word shingles start to look more useful, expressing more intent,
such as "how do i", "do i need" and "it work with", etc.

Has there been any Lucene/Solr studies specifically on shingle length?

and finally...

4: Is it useful to examine your token occurrences against a Power-Law
log-log curve?

So, with either single words, or shingles, you do a histogram, and then plot
the histogram in an X-Y graph, with both axis being logarithmic. Then see if
the resulting graph follows (or diverges) from a straight line.  This "Long
Tail" / Pareto / powerlaw mathematics were very popular a few years ago for
looking at histograms of DVD rentals and human activities, and prior to the
web, the power law and 80/20 rules has been observed in many other
situations, both man made and natural.

Also of interest, when a distribution is expected to follow a power line,
but the actual data deviates from that theoretical line, then this might
indicate some other factors at work, or so the theory goes.

So if users' searches follow any type of histogram with a hidden powerlaw
line, then it makes sense to me that the source content might also follow a
similar distribution.  Is the normal IDF ranking inspired by that type of
curve?

And *if* word occurrences, in either searches or source documents, were
expected to follow a power law distribution, then possible shingles would
follow such a curve as well.

Thinking that document text, like many other things in nature, might follow
such a curve, I used the Lucene index to generate such a curve. And I did
the same thing for 3 word tokens. The 2 curves do have different slopes, but
neither is very straight.

So I was wondering if anybody else has looked at IDF curves (actually
non-inverted document frequency curves) or raw word instance counts and
power law graphs?  I haven't found a smoking gun in my online searches, but
I'm thinking some of you would know this.


--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

Posted by Lance Norskog <go...@gmail.com>.

This sounds like a Mahout problem, not a Lucene problem- there are some 
text analysis tools that might help you.

mark harwood wrote:
> >>What is the "best practices" formula for determining above average 
> correlations of adjacent terms
>
> I gave this some thought in 
> https://issues.apache.org/jira/browse/LUCENE-474
> I found the Jaccard cooefficient favoured rare words too strongly and 
> so went for a blend as shown below:
>
>
>     public float getScore()
>     {
>         float overallIntersectionPercent = coIncidenceDocCount
>                 / (float) (termADocFreq + termBDocFreq);
>         float termBIntersectionPercent = coIncidenceDocCount
>                 / (float) (termBDocFreq);
>
>         //using just the termB intersection favours common words as
>         // coincidents eg "new" food
>         //      return termBIntersectionPercent;
>         //using just the overall intersection favours rare words as
>         // coincidents eg "scezchuan" food
>         //        return overallIntersectionPercent;
>         // so here we take an average of the two:
>         return (termBIntersectionPercent + overallIntersectionPercent) 
> / 2;
>     }
>
>
> ------------------------------------------------------------------------
> *From:* Mark Bennett <mb...@ideaeng.com>
> *To:* dev@lucene.apache.org
> *Sent:* Fri, 10 September, 2010 18:44:31
> *Subject:* Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
>
> Thanks Mark H,
>
> Maybe I'll look at MLT (More Like This) again.  I'll also check out zipf.
>
> It's claimed that Question and Answer wording is different enough for 
> generic text content that different techniques might be indicated. 
> From what I remember:
> 1: Though nouns normally convey 60% of relevancy in general text, Q&A 
> content is skewed a bit more towards verbs.
> 2: Questions may contain more noise words (though perhaps in useful 
> groupings)
> 3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q 
> vs A)
> 4: Vocabulary mismatch of novices vs experts (Q vs A)
>
> It was item 2 that I was hoping to capitalize on with NGrams / Shingles.
>
> Still waiting for the relevancy math nerds to chime in about the 
> log-log and IDF stuff ... ;-)
>
> I was thinking a bit more about the math involved here....
>
> What is the "best practices" formula for determining above average 
> correlations of adjacent terms, beyond what random chance would give. 
> So you notice that "white" and "house" appear next to each other more 
> than what chance distribution would explain, so you decide it's an 
> important NGram.
>
> The "noise floor" isn't too bad for the typical shopping cart items 
> calculation.
> You analyze the items present or not present in 1,000 shopping cart 
> receipts.
>     If grocery items were completely independent then "random" level 
> is just the odds of the 2 items multiplied together:
>         1,000 shopping carts
>         200 have cereal
>         250 have milk
>     chance of
>         cereal = 200/1,000 = 20%
>         milk = 250/1,000 = 25%
>     IF independent then
>         P(cereal AND milk) = P(cereal) * P(milk)
>         20% * 25% = 5%
>         So 50 carts likely to have both cereal and milk
>         And if MORE than 50 carts have cereal and milk, then it's 
> worth noting.
> The classic example is diapers and beer, which is a bit apocryphal and 
> NOT expected, but I like the breakfast cereal and milk example better 
> because it IS expected.
>
> Now back to word-A appearing directly before word-B, and finding the 
> base level number of times you'd expect just from random chance.
>
> Although Lucene/Luke gives you total word instances and document 
> counts, what you'd really want is the number of possible N-Grams, 
> which is affected by document boundaries, so it gets a little weird.
>
> Some other differences between the word-A word-B calculation vs milk 
> and cereal:
> 1: I want ordered pairs, "white" before "house"
> 2: A document is NOT like a shopping cart in that I DO care how many 
> times "white" appears before "house", whereas in the shopping carts I 
> only cared about present or not present, so document count is less 
> helpful here.
>
> I'm sure some companies and PHD's have super secret formulas for this, 
> but I'd be content to just compare it to baseline random chance.
>
> Mark B
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com 
> <ma...@ideaeng.com>
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
> On Fri, Sep 10, 2010 at 3:17 AM, mark harwood <markharw00d@yahoo.co.uk 
> <ma...@yahoo.co.uk>> wrote:
>
>     Hi Mark
>     I've played with Shingles recently in some auto-categorisation
>     work where my starting assumption was that multi-word terms will
>     hold more information value than individual words and that phrase
>     queries on seperate terms will not give these term combos their
>     true reward (in terms of IDF) - or if they did compute the true
>     IDF,  would require lots of disk IO to do this. Shingles present a
>     conveniently pre-aggregated score for these combos.
>     Looking at the results of MoreLikeThis queries based on a
>     shingling analyzers the results I saw generally seemed good but
>     did not formally bench mark this against non-shingled indexes. Not
>     everything was rosy in that I did see some tendency to over-reward
>     certain rare shingles (e.g. a shared mention of "New Years Eve
>     Party" pulled otherwise mostly unrelated news articles together).
>     This led me to look at using the links in resulting documents to
>     help identify clusters of on-topic and potentially off-topic
>     results to tune these discrepancies out but that's another topic.
>     BTW, the Luke tool has a "Zipf" plugin that you may find useful in
>     examining index term distributions in Lucene indexes..
>
>     Cheers
>     Mark
>
>     ------------------------------------------------------------------------
>     *From:* Mark Bennett <mbennett@ideaeng.com
>     <ma...@ideaeng.com>>
>     *To:* java-dev@lucene.apache.org <ma...@lucene.apache.org>
>     *Sent:* Fri, 10 September, 2010 1:42:11
>     *Subject:* Relevancy, Phrase Boosting, Shingles and Long Tail Curves
>
>     I want to boost the relevancy of some Question and Answer content.
>     I'm using stop words, Dismax, and I'm already a fan of Phrase
>     Boosting and have cranked that up a bit. But I'm considering using
>     long Shingles to make use of some of the normally stopped out
>     "junk words" in the content to help relevancy further.
>
>     Reminder: "Shingles" are artificial tokens created by gluing
>     together adjacent words.
>         Input text: This is a sentence
>         Normal tokens: this, is, a, sentence  (without removing stop
>     words)
>         2+3 word shingles: this-is, is-a, a-sentence, this-is-a,
>     is-a-sentence
>
>     A few questions on relevance and shingles:
>
>     1: How similar are the relevancy calculations compare between
>     Shingles and exact phrases?
>
>     I've seen material saying that shingles can give better
>     performance than normal phrase searching, and I'm assuming this is
>     exact phrase (vs. allowing for phrase slop)
>
>     But do the relevancy calculations for normal exact phrase and
>     Shingles wind up being *identical*, for the same documents and
>     searches?  That would seem an unlikely coincidence, but possibly
>     it could have been engineered to intentionally behave that way.
>
>     2: What's the latest on Shingles and Dismax?
>
>     The low front end low level tokenization in Dismax would seem to
>     be a problem, but does the new parser stuff help with this?
>
>     3: I'm thinking of a minimum 3 word shingle, does anybody have
>     comments on shingle length?
>
>     Eyeballing the 2 word shingles, they don't seem much better than
>     stop words.  Obviously my shingle field bypasses stop words.
>
>     But the 3 word shingles start to look more useful, expressing more
>     intent, such as "how do i", "do i need" and "it work with", etc.
>
>     Has there been any Lucene/Solr studies specifically on shingle length?
>
>     and finally...
>
>     4: Is it useful to examine your token occurrences against a
>     Power-Law log-log curve?
>
>     So, with either single words, or shingles, you do a histogram, and
>     then plot the histogram in an X-Y graph, with both axis being
>     logarithmic. Then see if the resulting graph follows (or diverges)
>     from a straight line.  This "Long Tail" / Pareto / powerlaw
>     mathematics were very popular a few years ago for looking at
>     histograms of DVD rentals and human activities, and prior to the
>     web, the power law and 80/20 rules has been observed in many other
>     situations, both man made and natural.
>
>     Also of interest, when a distribution is expected to follow a
>     power line, but the actual data deviates from that theoretical
>     line, then this might indicate some other factors at work, or so
>     the theory goes.
>
>     So if users' searches follow any type of histogram with a hidden
>     powerlaw line, then it makes sense to me that the source content
>     might also follow a similar distribution.  Is the normal IDF
>     ranking inspired by that type of curve?
>
>     And *if* word occurrences, in either searches or source documents,
>     were expected to follow a power law distribution, then possible
>     shingles would follow such a curve as well.
>
>     Thinking that document text, like many other things in nature,
>     might follow such a curve, I used the Lucene index to generate
>     such a curve. And I did the same thing for 3 word tokens. The 2
>     curves do have different slopes, but neither is very straight.
>
>     So I was wondering if anybody else has looked at IDF curves
>     (actually non-inverted document frequency curves) or raw word
>     instance counts and power law graphs?  I haven't found a smoking
>     gun in my online searches, but I'm thinking some of you would know
>     this.
>
>
>     --
>     Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>     <ma...@ideaeng.com>
>     Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

Posted by mark harwood <ma...@yahoo.co.uk>.

>>What is the "best practices" formula for determining above average correlations 
>>of adjacent terms

I gave this some thought in https://issues.apache.org/jira/browse/LUCENE-474
I found the Jaccard cooefficient favoured rare words too strongly and so went 
for a blend as shown below:


    public float getScore()
    {
        float overallIntersectionPercent = coIncidenceDocCount
                / (float) (termADocFreq + termBDocFreq);
        float termBIntersectionPercent = coIncidenceDocCount
                / (float) (termBDocFreq);

        //using just the termB intersection favours common words as
        // coincidents eg "new" food
        //      return termBIntersectionPercent;
        //using just the overall intersection favours rare words as
        // coincidents eg "scezchuan" food
        //        return overallIntersectionPercent;
        // so here we take an average of the two:
        return (termBIntersectionPercent + overallIntersectionPercent) / 2;
    }




________________________________
From: Mark Bennett <mb...@ideaeng.com>
To: dev@lucene.apache.org
Sent: Fri, 10 September, 2010 18:44:31
Subject: Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

Thanks Mark H,

Maybe I'll look at MLT (More Like This) again.  I'll also check out zipf.

It's claimed that Question and Answer wording is different enough for generic 
text content that different techniques might be indicated. From what I remember:
1: Though nouns normally convey 60% of relevancy in general text, Q&A content is 
skewed a bit more towards verbs.
2: Questions may contain more noise words (though perhaps in useful groupings)
3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q vs A)
4: Vocabulary mismatch of novices vs experts (Q vs A)

It was item 2 that I was hoping to capitalize on with NGrams / Shingles.

Still waiting for the relevancy math nerds to chime in about the log-log and IDF 
stuff ... ;-)

I was thinking a bit more about the math involved here....

What is the "best practices" formula for determining above average correlations 
of adjacent terms, beyond what random chance would give. So you notice that 
"white" and "house" appear next to each other more than what chance distribution 
would explain, so you decide it's an important NGram.

The "noise floor" isn't too bad for the typical shopping cart items calculation.
You analyze the items present or not present in 1,000 shopping cart receipts.
    If grocery items were completely independent then "random" level is  just 
the odds of the 2 items multiplied together:
        1,000 shopping carts
        200 have cereal
        250 have milk
    chance of
        cereal = 200/1,000 = 20%
        milk = 250/1,000 = 25%
    IF independent then
        P(cereal AND milk) = P(cereal) * P(milk)
        20% * 25% = 5%
        So 50 carts likely to have both cereal and milk
        And if MORE than 50 carts have cereal and milk, then it's worth  noting.
The classic example is diapers and beer, which is a bit apocryphal and NOT 
expected, but I like the breakfast cereal and milk example better because it IS 
expected.

Now back to word-A appearing directly before word-B, and finding the base level 
number of times you'd expect just from random chance.

Although Lucene/Luke gives you total word instances and document counts, what 
you'd really want is the number of possible N-Grams, which is affected by 
document boundaries, so it gets a little weird.

Some other differences between the word-A word-B calculation vs milk and cereal:
1: I want ordered pairs, "white" before "house"
2: A document is NOT like a shopping cart in that I DO care how many times 
"white" appears before "house", whereas in the shopping carts I only cared about 
present or not present, so document count is less helpful here.

I'm sure some companies and PHD's have super secret formulas for this, but I'd 
be content to just compare it to baseline random chance.

Mark B

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513



On Fri, Sep 10, 2010 at 3:17 AM, mark harwood <ma...@yahoo.co.uk> wrote:

Hi Mark
>I've played with Shingles recently in some auto-categorisation work where my 
>starting assumption was that multi-word terms will hold more information value 
>than individual words and that phrase queries on seperate terms will not give 
>these term combos their true reward (in terms of IDF) - or if they did compute 
>the true IDF,  would require lots of disk IO to do this. Shingles present a 
>conveniently pre-aggregated score for these combos.
>Looking at the results of MoreLikeThis queries based on a shingling analyzers 
>the results I saw generally seemed good but did not formally bench mark this 
>against non-shingled indexes. Not everything was rosy in that I did see some 
>tendency to over-reward certain rare shingles  (e.g. a shared mention of "New 
>Years Eve Party" pulled otherwise mostly unrelated news articles together). This 
>led me to look at using the links in resulting documents to help identify 
>clusters of on-topic and potentially off-topic results to tune these 
>discrepancies out but that's another topic.
>BTW, the Luke tool has a "Zipf" plugin that you may find useful in examining 
>index term distributions in Lucene indexes..
>
>Cheers
>Mark
>
>
________________________________
From: Mark Bennett <mb...@ideaeng.com>
>To: java-dev@lucene.apache.org
>Sent: Fri, 10 September, 2010 1:42:11
>Subject: Relevancy, Phrase Boosting, Shingles and Long Tail Curves
>
>
>I want to boost the  relevancy of some Question and Answer content. I'm using 
>stop words, Dismax, and I'm already a fan of Phrase Boosting and have cranked 
>that up a bit. But I'm considering using long Shingles to make use of some of 
>the normally stopped out "junk words" in the content to help relevancy further.
>
>Reminder: "Shingles" are artificial tokens created by gluing together adjacent 
>words.
>    Input text: This is a sentence
>    Normal tokens: this, is, a, sentence  (without removing stop words)
>    2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence
>
>A few questions on relevance and shingles:
>
>1: How similar are the relevancy calculations compare between Shingles and exact 
>phrases?
>
>I've seen material saying that shingles can give better performance than normal 
>phrase searching, and I'm assuming this is exact phrase (vs. allowing for phrase 
>slop)
>
>But do the relevancy calculations for normal exact phrase and Shingles wind up 
>being *identical*, for the same documents and searches?  That would seem an 
>unlikely coincidence, but possibly it could have been engineered to 
>intentionally behave that way.
>
>2: What's the latest on Shingles and Dismax?
>
>The low front end low level tokenization in Dismax would seem to be a problem, 
>but does the new parser stuff help with this?
>
>3: I'm thinking of a minimum 3 word shingle, does anybody have comments on 
>shingle length?
>
>Eyeballing the 2 word shingles, they don't seem much better than stop words.  
>Obviously my shingle field bypasses stop words.
>
>But the 3 word shingles start to look more useful, expressing more intent, such 
>as "how do i", "do i need" and "it work with", etc.
>
>Has there been any Lucene/Solr studies specifically on shingle length?
>
>and finally...
>
>4: Is it useful to examine your token occurrences against a Power-Law log-log 
>curve?
>
>So, with either single words, or shingles, you do a histogram, and then plot the 
>histogram in an X-Y graph, with both axis being logarithmic. Then see if the 
>resulting graph follows (or diverges) from a straight line.  This "Long Tail" / 
>Pareto / powerlaw mathematics were very popular a few years ago for looking at 
>histograms of DVD rentals and human activities, and prior to the web, the power 
>law and 80/20 rules has been observed in many other situations, both man made 
>and natural.
>
>Also of interest, when a distribution is expected to follow a power line, but 
>the actual data deviates from that theoretical line, then this might indicate 
>some other factors at work, or so the theory goes.
>
>So if users' searches follow any type of histogram with a hidden powerlaw line, 
>then it makes sense to me that the source content might also follow a similar 
>distribution.  Is the normal IDF ranking inspired by that type of curve?
>
>And *if* word occurrences, in either searches or source documents, were expected 
>to follow a power law distribution, then possible shingles would follow such a 
>curve as well.
>
>Thinking that document text, like many other things in nature, might follow such 
>a curve, I used the Lucene index to generate such a curve. And I did the same 
>thing for 3 word tokens. The 2 curves do have different slopes, but neither is 
>very straight.
>
>So I was wondering if anybody else has looked at IDF curves (actually 
>non-inverted document frequency curves) or raw word instance counts and power 
>law graphs?  I haven't found a smoking gun in my online searches, but I'm 
>thinking some of you would know this.
>
>
>--
>Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
>Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

Posted by Mark Bennett <mb...@ideaeng.com>.

Thanks Mark H,

Maybe I'll look at MLT (More Like This) again.  I'll also check out zipf.

It's claimed that Question and Answer wording is different enough for
generic text content that different techniques might be indicated. From what
I remember:
1: Though nouns normally convey 60% of relevancy in general text, Q&A
content is skewed a bit more towards verbs.
2: Questions may contain more noise words (though perhaps in useful
groupings)
3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q vs A)
4: Vocabulary mismatch of novices vs experts (Q vs A)

It was item 2 that I was hoping to capitalize on with NGrams / Shingles.

Still waiting for the relevancy math nerds to chime in about the log-log and
IDF stuff ... ;-)

I was thinking a bit more about the math involved here....

What is the "best practices" formula for determining above average
correlations of adjacent terms, beyond what random chance would give. So you
notice that "white" and "house" appear next to each other more than what
chance distribution would explain, so you decide it's an important NGram.

The "noise floor" isn't too bad for the typical shopping cart items
calculation.
You analyze the items present or not present in 1,000 shopping cart
receipts.
    If grocery items were completely independent then "random" level is just
the odds of the 2 items multiplied together:
        1,000 shopping carts
        200 have cereal
        250 have milk
    chance of
        cereal = 200/1,000 = 20%
        milk = 250/1,000 = 25%
    IF independent then
        P(cereal AND milk) = P(cereal) * P(milk)
        20% * 25% = 5%
        So 50 carts likely to have both cereal and milk
        And if MORE than 50 carts have cereal and milk, then it's worth
noting.
The classic example is diapers and beer, which is a bit apocryphal and NOT
expected, but I like the breakfast cereal and milk example better because it
IS expected.

Now back to word-A appearing directly before word-B, and finding the base
level number of times you'd expect just from random chance.

Although Lucene/Luke gives you total word instances and document counts,
what you'd really want is the number of possible N-Grams, which is affected
by document boundaries, so it gets a little weird.

Some other differences between the word-A word-B calculation vs milk and
cereal:
1: I want ordered pairs, "white" before "house"
2: A document is NOT like a shopping cart in that I DO care how many times
"white" appears before "house", whereas in the shopping carts I only cared
about present or not present, so document count is less helpful here.

I'm sure some companies and PHD's have super secret formulas for this, but
I'd be content to just compare it to baseline random chance.

Mark B

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Fri, Sep 10, 2010 at 3:17 AM, mark harwood <ma...@yahoo.co.uk>wrote:

> Hi Mark
> I've played with Shingles recently in some auto-categorisation work where
> my starting assumption was that multi-word terms will hold more information
> value than individual words and that phrase queries on seperate terms will
> not give these term combos their true reward (in terms of IDF) - or if they
> did compute the true IDF,  would require lots of disk IO to do this.
> Shingles present a conveniently pre-aggregated score for these combos.
> Looking at the results of MoreLikeThis queries based on a shingling
> analyzers the results I saw generally seemed good but did not formally bench
> mark this against non-shingled indexes. Not everything was rosy in that I
> did see some tendency to over-reward certain rare shingles (e.g. a shared
> mention of "New Years Eve Party" pulled otherwise mostly unrelated news
> articles together). This led me to look at using the links in resulting
> documents to help identify clusters of on-topic and potentially off-topic
> results to tune these discrepancies out but that's another topic.
> BTW, the Luke tool has a "Zipf" plugin that you may find useful in
> examining index term distributions in Lucene indexes..
>
> Cheers
> Mark
>
> ------------------------------
> *From:* Mark Bennett <mb...@ideaeng.com>
> *To:* java-dev@lucene.apache.org
> *Sent:* Fri, 10 September, 2010 1:42:11
> *Subject:* Relevancy, Phrase Boosting, Shingles and Long Tail Curves
>
> I want to boost the relevancy of some Question and Answer content. I'm
> using stop words, Dismax, and I'm already a fan of Phrase Boosting and have
> cranked that up a bit. But I'm considering using long Shingles to make use
> of some of the normally stopped out "junk words" in the content to help
> relevancy further.
>
> Reminder: "Shingles" are artificial tokens created by gluing together
> adjacent words.
>     Input text: This is a sentence
>     Normal tokens: this, is, a, sentence  (without removing stop words)
>     2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence
>
> A few questions on relevance and shingles:
>
> 1: How similar are the relevancy calculations compare between Shingles and
> exact phrases?
>
> I've seen material saying that shingles can give better performance than
> normal phrase searching, and I'm assuming this is exact phrase (vs. allowing
> for phrase slop)
>
> But do the relevancy calculations for normal exact phrase and Shingles wind
> up being *identical*, for the same documents and searches?  That would seem
> an unlikely coincidence, but possibly it could have been engineered to
> intentionally behave that way.
>
> 2: What's the latest on Shingles and Dismax?
>
> The low front end low level tokenization in Dismax would seem to be a
> problem, but does the new parser stuff help with this?
>
> 3: I'm thinking of a minimum 3 word shingle, does anybody have comments on
> shingle length?
>
> Eyeballing the 2 word shingles, they don't seem much better than stop
> words.  Obviously my shingle field bypasses stop words.
>
> But the 3 word shingles start to look more useful, expressing more intent,
> such as "how do i", "do i need" and "it work with", etc.
>
> Has there been any Lucene/Solr studies specifically on shingle length?
>
> and finally...
>
> 4: Is it useful to examine your token occurrences against a Power-Law
> log-log curve?
>
> So, with either single words, or shingles, you do a histogram, and then
> plot the histogram in an X-Y graph, with both axis being logarithmic. Then
> see if the resulting graph follows (or diverges) from a straight line.  This
> "Long Tail" / Pareto / powerlaw mathematics were very popular a few years
> ago for looking at histograms of DVD rentals and human activities, and prior
> to the web, the power law and 80/20 rules has been observed in many other
> situations, both man made and natural.
>
> Also of interest, when a distribution is expected to follow a power line,
> but the actual data deviates from that theoretical line, then this might
> indicate some other factors at work, or so the theory goes.
>
> So if users' searches follow any type of histogram with a hidden powerlaw
> line, then it makes sense to me that the source content might also follow a
> similar distribution.  Is the normal IDF ranking inspired by that type of
> curve?
>
> And *if* word occurrences, in either searches or source documents, were
> expected to follow a power law distribution, then possible shingles would
> follow such a curve as well.
>
> Thinking that document text, like many other things in nature, might follow
> such a curve, I used the Lucene index to generate such a curve. And I did
> the same thing for 3 word tokens. The 2 curves do have different slopes, but
> neither is very straight.
>
> So I was wondering if anybody else has looked at IDF curves (actually
> non-inverted document frequency curves) or raw word instance counts and
> power law graphs?  I haven't found a smoking gun in my online searches, but
> I'm thinking some of you would know this.
>
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

Posted by mark harwood <ma...@yahoo.co.uk>.

Hi Mark
I've played with Shingles recently in some auto-categorisation work where my 
starting assumption was that multi-word terms will hold more information value 
than individual words and that phrase queries on seperate terms will not give 
these term combos their true reward (in terms of IDF) - or if they did compute 
the true IDF,  would require lots of disk IO to do this. Shingles present a 
conveniently pre-aggregated score for these combos.
Looking at the results of MoreLikeThis queries based on a shingling analyzers 
the results I saw generally seemed good but did not formally bench mark this 
against non-shingled indexes. Not everything was rosy in that I did see some 
tendency to over-reward certain rare shingles (e.g. a shared mention of "New 
Years Eve Party" pulled otherwise mostly unrelated news articles together). This 
led me to look at using the links in resulting documents to help identify 
clusters of on-topic and potentially off-topic results to tune these 
discrepancies out but that's another topic.
BTW, the Luke tool has a "Zipf" plugin that you may find useful in examining 
index term distributions in Lucene indexes..

Cheers
Mark


________________________________
From: Mark Bennett <mb...@ideaeng.com>
To: java-dev@lucene.apache.org
Sent: Fri, 10 September, 2010 1:42:11
Subject: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

I want to boost the relevancy of some Question and Answer content. I'm using 
stop words, Dismax, and I'm already a fan of Phrase Boosting and have cranked 
that up a bit. But I'm considering using long Shingles to make use of some of 
the normally stopped out "junk words" in the content to help relevancy further.

Reminder: "Shingles" are artificial tokens created by gluing together adjacent 
words.
    Input text: This is a sentence
    Normal tokens: this, is, a, sentence  (without removing stop words)
    2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence

A few questions on relevance and shingles:

1: How similar are the relevancy calculations compare between Shingles and exact 
phrases?

I've seen material saying that shingles can give better performance than normal 
phrase searching, and I'm assuming this is exact phrase (vs. allowing for phrase 
slop)

But do the relevancy calculations for normal exact phrase and Shingles wind up 
being *identical*, for the same documents and searches?  That would seem an 
unlikely coincidence, but possibly it could have been engineered to 
intentionally behave that way.

2: What's the latest on Shingles and Dismax?

The low front end low level tokenization in Dismax would seem to be a problem, 
but does the new parser stuff help with this?

3: I'm thinking of a minimum 3 word shingle, does anybody have comments on 
shingle length?

Eyeballing the 2 word shingles, they don't seem much better than stop words.  
Obviously my shingle field bypasses stop words.

But the 3 word shingles start to look more useful, expressing more intent, such 
as "how do i", "do i need" and "it work with", etc.

Has there been any Lucene/Solr studies specifically on shingle length?

and finally...

4: Is it useful to examine your token occurrences against a Power-Law log-log 
curve?

So, with either single words, or shingles, you do a histogram, and then plot the 
histogram in an X-Y graph, with both axis being logarithmic. Then see if the 
resulting graph follows (or diverges) from a straight line.  This "Long Tail" / 
Pareto / powerlaw mathematics were very popular a few years ago for looking at 
histograms of DVD rentals and human activities, and prior to the web, the power 
law and 80/20 rules has been observed in many other situations, both man made 
and natural.

Also of interest, when a distribution is expected to follow a power line, but 
the actual data deviates from that theoretical line, then this might indicate 
some other factors at work, or so the theory goes.

So if users' searches follow any type of histogram with a hidden powerlaw line, 
then it makes sense to me that the source content might also follow a similar 
distribution.  Is the normal IDF ranking inspired by that type of curve?

And *if* word occurrences, in either searches or source documents, were expected 
to follow a power law distribution, then possible shingles would follow such a 
curve as well.

Thinking that document text, like many other things in nature, might follow such 
a curve, I used the Lucene index to generate such a curve. And I did the same 
thing for 3 word tokens. The 2 curves do have different slopes, but neither is 
very straight.

So I was wondering if anybody else has looked at IDF curves (actually 
non-inverted document frequency curves) or raw word instance counts and power 
law graphs?  I haven't found a smoking gun in my online searches, but I'm 
thinking some of you would know this.


--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513