You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/08/01 13:16:21 UTC

Re: TF in wide internet crawls


On Wednesday 27 July 2011 18:31:57 lewis john mcgibbney wrote:
> Hi Markus,
> 
> I am getting you until the last parts of your comments.
> 
> "cope with non-edited..." edited by whom? and for what purpose? To give a
> better relative tf score...

Wtih edited content i mean content written by editors and other persons 
creating proper content. 

> 
> To comment on the first part, and please ignore or correct me if I am
> wrong, but do we not give each page and therefore each document an initial
> score of 1.0 which is then subsequently used by whichever scoring
> algorithm we plugin? If this is the case then how are we specifying score
> for a page and tf of some term with a document or tf-idf of that term over
> the entire document collection to determine relevance? How can be
> accurately
> disambiguate between these entities?

Link score is only a small part of the math. It's multiplied with tf, idf, 
norms, boosts, functions etc.

> 
> As I said I'm loosing you towards the end however it would be good
> discussion to explore behind the surface architecture.
> 
> 
> On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Hi,
> > 
> > I've done several projects where term frequency yields bad result sets
> > and worse relevancy. These projects all had one similarity;
> > user-generated content
> > with a competitive edge. The latter means classifieds web sites such as
> > e-bay
> > etc. The internet is something similar. It contains edited content,
> > classifieds
> > and spam or other garbage.
> > 
> > What do you do with tf in your wide internet index? Do you impose a
> > threshold
> > or are you emitting 1.0f for each match?
> > For now i emit 1.0f for each match and rely on matches in multiple fields
> > with
> > varying boosts to improve relevancy and various other methods.
> > 
> > Can tf*idf cope with non-edited (and untrusted) documents at all? I've
> > seen great relevancy with good content but really bad relevance in
> > several cases.
> > 
> > Thanks!

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350