You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/07/25 23:23:19 UTC

TF in wide internet crawls

Hi,

I've done several projects where term frequency yields bad result sets and 
worse relevancy. These projects all had one similarity; user-generated content 
with a competitive edge. The latter means classifieds web sites such as e-bay 
etc. The internet is something similar. It contains edited content, classifieds 
and spam or other garbage. 

What do you do with tf in your wide internet index? Do you impose a threshold 
or are you emitting 1.0f for each match?
For now i emit 1.0f for each match and rely on matches in multiple fields with 
varying boosts to improve relevancy and various other methods. 

Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen 
great relevancy with good content but really bad relevance in several cases.

Thanks!

Re: TF in wide internet crawls

Posted by Markus Jelsma <ma...@openindex.io>.


On Wednesday 27 July 2011 18:31:57 lewis john mcgibbney wrote:
> Hi Markus,
> 
> I am getting you until the last parts of your comments.
> 
> "cope with non-edited..." edited by whom? and for what purpose? To give a
> better relative tf score...

Wtih edited content i mean content written by editors and other persons 
creating proper content. 

> 
> To comment on the first part, and please ignore or correct me if I am
> wrong, but do we not give each page and therefore each document an initial
> score of 1.0 which is then subsequently used by whichever scoring
> algorithm we plugin? If this is the case then how are we specifying score
> for a page and tf of some term with a document or tf-idf of that term over
> the entire document collection to determine relevance? How can be
> accurately
> disambiguate between these entities?

Link score is only a small part of the math. It's multiplied with tf, idf, 
norms, boosts, functions etc.

> 
> As I said I'm loosing you towards the end however it would be good
> discussion to explore behind the surface architecture.
> 
> 
> On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Hi,
> > 
> > I've done several projects where term frequency yields bad result sets
> > and worse relevancy. These projects all had one similarity;
> > user-generated content
> > with a competitive edge. The latter means classifieds web sites such as
> > e-bay
> > etc. The internet is something similar. It contains edited content,
> > classifieds
> > and spam or other garbage.
> > 
> > What do you do with tf in your wide internet index? Do you impose a
> > threshold
> > or are you emitting 1.0f for each match?
> > For now i emit 1.0f for each match and rely on matches in multiple fields
> > with
> > varying boosts to improve relevancy and various other methods.
> > 
> > Can tf*idf cope with non-edited (and untrusted) documents at all? I've
> > seen great relevancy with good content but really bad relevance in
> > several cases.
> > 
> > Thanks!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: TF in wide internet crawls

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Markus,

I am getting you until the last parts of your comments.

"cope with non-edited..." edited by whom? and for what purpose? To give a
better relative tf score...

To comment on the first part, and please ignore or correct me if I am wrong,
but do we not give each page and therefore each document an initial score of
1.0 which is then subsequently used by whichever scoring algorithm we
plugin? If this is the case then how are we specifying score for a page and
tf of some term with a document or tf-idf of that term over the entire
document collection to determine relevance? How can be accurately
disambiguate between these entities?

As I said I'm loosing you towards the end however it would be good
discussion to explore behind the surface architecture.

On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> I've done several projects where term frequency yields bad result sets and
> worse relevancy. These projects all had one similarity; user-generated
> content
> with a competitive edge. The latter means classifieds web sites such as
> e-bay
> etc. The internet is something similar. It contains edited content,
> classifieds
> and spam or other garbage.
>
> What do you do with tf in your wide internet index? Do you impose a
> threshold
> or are you emitting 1.0f for each match?
> For now i emit 1.0f for each match and rely on matches in multiple fields
> with
> varying boosts to improve relevancy and various other methods.
>
> Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen
> great relevancy with good content but really bad relevance in several
> cases.
>
> Thanks!
>

-- 
*Lewis*