You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/07/25 23:23:19 UTC
TF in wide internet crawls
Hi,
I've done several projects where term frequency yields bad result sets and
worse relevancy. These projects all had one similarity; user-generated content
with a competitive edge. The latter means classifieds web sites such as e-bay
etc. The internet is something similar. It contains edited content, classifieds
and spam or other garbage.
What do you do with tf in your wide internet index? Do you impose a threshold
or are you emitting 1.0f for each match?
For now i emit 1.0f for each match and rely on matches in multiple fields with
varying boosts to improve relevancy and various other methods.
Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen
great relevancy with good content but really bad relevance in several cases.
Thanks!
Re: TF in wide internet crawls
Posted by Markus Jelsma <ma...@openindex.io>.
On Wednesday 27 July 2011 18:31:57 lewis john mcgibbney wrote:
> Hi Markus,
>
> I am getting you until the last parts of your comments.
>
> "cope with non-edited..." edited by whom? and for what purpose? To give a
> better relative tf score...
Wtih edited content i mean content written by editors and other persons
creating proper content.
>
> To comment on the first part, and please ignore or correct me if I am
> wrong, but do we not give each page and therefore each document an initial
> score of 1.0 which is then subsequently used by whichever scoring
> algorithm we plugin? If this is the case then how are we specifying score
> for a page and tf of some term with a document or tf-idf of that term over
> the entire document collection to determine relevance? How can be
> accurately
> disambiguate between these entities?
Link score is only a small part of the math. It's multiplied with tf, idf,
norms, boosts, functions etc.
>
> As I said I'm loosing you towards the end however it would be good
> discussion to explore behind the surface architecture.
>
>
> On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma
>
> <ma...@openindex.io>wrote:
> > Hi,
> >
> > I've done several projects where term frequency yields bad result sets
> > and worse relevancy. These projects all had one similarity;
> > user-generated content
> > with a competitive edge. The latter means classifieds web sites such as
> > e-bay
> > etc. The internet is something similar. It contains edited content,
> > classifieds
> > and spam or other garbage.
> >
> > What do you do with tf in your wide internet index? Do you impose a
> > threshold
> > or are you emitting 1.0f for each match?
> > For now i emit 1.0f for each match and rely on matches in multiple fields
> > with
> > varying boosts to improve relevancy and various other methods.
> >
> > Can tf*idf cope with non-edited (and untrusted) documents at all? I've
> > seen great relevancy with good content but really bad relevance in
> > several cases.
> >
> > Thanks!
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: TF in wide internet crawls
Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Markus,
I am getting you until the last parts of your comments.
"cope with non-edited..." edited by whom? and for what purpose? To give a
better relative tf score...
To comment on the first part, and please ignore or correct me if I am wrong,
but do we not give each page and therefore each document an initial score of
1.0 which is then subsequently used by whichever scoring algorithm we
plugin? If this is the case then how are we specifying score for a page and
tf of some term with a document or tf-idf of that term over the entire
document collection to determine relevance? How can be accurately
disambiguate between these entities?
As I said I'm loosing you towards the end however it would be good
discussion to explore behind the surface architecture.
On Mon, Jul 25, 2011 at 10:23 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> Hi,
>
> I've done several projects where term frequency yields bad result sets and
> worse relevancy. These projects all had one similarity; user-generated
> content
> with a competitive edge. The latter means classifieds web sites such as
> e-bay
> etc. The internet is something similar. It contains edited content,
> classifieds
> and spam or other garbage.
>
> What do you do with tf in your wide internet index? Do you impose a
> threshold
> or are you emitting 1.0f for each match?
> For now i emit 1.0f for each match and rely on matches in multiple fields
> with
> varying boosts to improve relevancy and various other methods.
>
> Can tf*idf cope with non-edited (and untrusted) documents at all? I've seen
> great relevancy with good content but really bad relevance in several
> cases.
>
> Thanks!
>
--
*Lewis*