You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Mladen Adamovic <ad...@blic.net> on 2006/08/29 17:26:43 UTC
books (and articles) about search engine algorithms
Hi!
I want to get more insight into various search engine algorithms. I have
wide knowledge of standard data structures & algorithms (hashvalues,
trees, graphs, etc.). I thought that Lucene would be good place to
start to seek for information and indeed I've found some decent
information at Nutch website. However, I decided to post here some
personal opinions regarding this issue thinking that someone might give
me even more information.
As far as I understand I should read books about Informational Retrieval
(i.e. Modern Information Retrieval by Balza-Yates, Ribero-Neto). Any update?
I also found using one article about link spam and citeseer wide
articles about link spam techniques, namely:
1. Undue Influence: Eliminating the Impact of Link Plagiarism on Web
Search Rankings
2. Using Rank Propagation and Probabilistic Counting for LinkBased Spam
Detection
3. SpamRank Fully Automatic Link Spam Detection
4. Identifying Link Farm Spam Pages
5. Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam
If you have some more opinions about valuable literature about search
engine algorithms (primary books but also nice articles might work, let
me know).
Thanks and keep on good work.
--
Mladen Adamovic
http://www.online-utility.org http://www.cheapvps.info
http://www.vpsreview.com http://www.vpsdeal.com
Re: books (and articles) about search engine algorithms
Posted by Incze Lajos <in...@axelero.hu>.
> If you have some more opinions about valuable literature about search
> engine algorithms (primary books but also nice articles might work, let
> me know).
http://www-csli.stanford.edu/~schuetze/information-retrieval.html
incze
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: books (and articles) about search engine algorithms
Posted by Thomas Delnoij <di...@gmail.com>.
I found "Mining the web - discovering knowledge from Hypertext Data"
by Soumen Ckakrabarti a usefull reference.
http://www.amazon.com/gp/product/1558607544/103-9548474-1631829?v=glance&n=283155
Rgrds, Thomas
On 8/29/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> Mladen Adamovic wrote:
> > Hi!
> >
> > I want to get more insight into various search engine algorithms. I
> > have wide knowledge of standard data structures & algorithms
> > (hashvalues, trees, graphs, etc.). I thought that Lucene would be
> > good place to start to seek for information and indeed I've found some
> > decent information at Nutch website. However, I decided to post here
> > some personal opinions regarding this issue thinking that someone
> > might give me even more information.
> >
> > As far as I understand I should read books about Informational
> > Retrieval (i.e. Modern Information Retrieval by Balza-Yates,
> > Ribero-Neto). Any update?
> >
> > I also found using one article about link spam and citeseer wide
> > articles about link spam techniques, namely:
> > 1. Undue Influence: Eliminating the Impact of Link Plagiarism on Web
> > Search Rankings
> > 2. Using Rank Propagation and Probabilistic Counting for LinkBased
> > Spam Detection
> > 3. SpamRank Fully Automatic Link Spam Detection
> > 4. Identifying Link Farm Spam Pages
> > 5. Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam
>
> Yes, good references. At this moment most of my working knowledge about
> search engines comes either from the book you cited above, or from
> papers found on Citeseer - play around with IR related terms, you will
> find a LOT of papers to read... ;). And then follow references from
> those papers ...
>
> I also found that other printed books are either too outdated or not so
> relevant to web-scale IR.
>
> In the end (as usually) the best way to really dig into the subject is
> to try and solve a real-life problem, combining the tools you already
> have and what you have learned.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
Re: books (and articles) about search engine algorithms
Posted by Andrzej Bialecki <ab...@getopt.org>.
Mladen Adamovic wrote:
> Hi!
>
> I want to get more insight into various search engine algorithms. I
> have wide knowledge of standard data structures & algorithms
> (hashvalues, trees, graphs, etc.). I thought that Lucene would be
> good place to start to seek for information and indeed I've found some
> decent information at Nutch website. However, I decided to post here
> some personal opinions regarding this issue thinking that someone
> might give me even more information.
>
> As far as I understand I should read books about Informational
> Retrieval (i.e. Modern Information Retrieval by Balza-Yates,
> Ribero-Neto). Any update?
>
> I also found using one article about link spam and citeseer wide
> articles about link spam techniques, namely:
> 1. Undue Influence: Eliminating the Impact of Link Plagiarism on Web
> Search Rankings
> 2. Using Rank Propagation and Probabilistic Counting for LinkBased
> Spam Detection
> 3. SpamRank Fully Automatic Link Spam Detection
> 4. Identifying Link Farm Spam Pages
> 5. Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam
Yes, good references. At this moment most of my working knowledge about
search engines comes either from the book you cited above, or from
papers found on Citeseer - play around with IR related terms, you will
find a LOT of papers to read... ;). And then follow references from
those papers ...
I also found that other printed books are either too outdated or not so
relevant to web-scale IR.
In the end (as usually) the best way to really dig into the subject is
to try and solve a real-life problem, combining the tools you already
have and what you have learned.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com