You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jon Shoberg <jo...@shoberg.net> on 2005/08/24 21:17:30 UTC

Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Has anyone looked at modifying the Fetcher code for checking of content 
duplicates?  Not to any supprise, when allowing query strings in the URL 
there is a ton of duplicate content and re-fetching going on.

The Wiki provided a brief overview of the Fetcher and what calls are 
made.  I modified the outoutPage function from Fetcher.java to use a 
MySQL DB to track hashes (MD5) of URLs and content from 
ParseText.getText().  This works "OK" and is nothing more than an 
obvious hack.

Has anyone has significant sucess with modifying the Fetcher or plugins 
to activly manage content duplication and fetcher performance in a 
better manner?

Thoughts? Ideas?

-j

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Posted by Jon Shoberg <jo...@shoberg.net>.

Jon Shoberg wrote:
> Has anyone looked at modifying the Fetcher code for checking of content 
> duplicates?  Not to any supprise, when allowing query strings in the URL 
> there is a ton of duplicate content and re-fetching going on.
 >
> Has anyone has significant sucess with modifying the Fetcher or plugins 
> to activly manage content duplication and fetcher performance in a 
> better manner? 

Nutch will dedup on a merge but I am talking about managing 
deduplication of content durring the fetching process.

-j

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Posted by Lukas Vlcek <lu...@gmail.com>.

Hi,
I need to solve related problem. I have URLs with dynamic queries and
I need to filter specific parameters only because they impact order of
items on the result page only (so from HTML point of view the result
page it is not duplicate but from information point of view it is).

Is there any easy way how can I filter specific parameters (like
"&orderBy=name") from URL before indexing?

Lukas

On 8/25/05, Michael Ji <fj...@yahoo.com> wrote:
> Hi Jon:
> 
> You have an interesting approach.
> 
> We are in the similar effort to save the unneccessary
> indexing and data duplication for the pages with the
> same content since last successful fetching.
> 
> I am thinking to add an extra data field in
> "fetchlist" data structure, which contained  content
> MD5 hashing value for the previous fetching.
> 
> If the current fetching step gets same content, I will
> skip parsing and indexing process.
> 
> Any comments?
> 
> Michael Ji,
> 
> --- Jon Shoberg <jo...@shoberg.net> wrote:
> 
> > Has anyone looked at modifying the Fetcher code for
> > checking of content
> > duplicates?  Not to any supprise, when allowing
> > query strings in the URL
> > there is a ton of duplicate content and re-fetching
> > going on.
> >
> > The Wiki provided a brief overview of the Fetcher
> > and what calls are
> > made.  I modified the outoutPage function from
> > Fetcher.java to use a
> > MySQL DB to track hashes (MD5) of URLs and content
> > from
> > ParseText.getText().  This works "OK" and is nothing
> > more than an
> > obvious hack.
> >
> > Has anyone has significant sucess with modifying the
> > Fetcher or plugins
> > to activly manage content duplication and fetcher
> > performance in a
> > better manner?
> >
> > Thoughts? Ideas?
> >
> > -j
> >
> >
> 
> 
> 
> 
> ____________________________________________________
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
> 
>

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Posted by Michael Ji <fj...@yahoo.com>.

hi Andrezj:

That is exactly what I try to implement! I guess the
patch is not included in new nutch 07, right? coz at
least, I didn't find 
"src/java/org/apache/nutch/db/FetchSchedule.java" 
in SVN source code;

I will try to embed the patch code by myself and test
the it.

thanks,

Michael Ji,


--- Andrzej Bialecki <ab...@getopt.org> wrote:

> Michael Ji wrote:
> > Hi Jon:
> > 
> > You have an interesting approach. 
> > 
> > We are in the similar effort to save the
> unneccessary
> > indexing and data duplication for the pages with
> the
> > same content since last successful fetching. 
> > 
> > I am thinking to add an extra data field in
> > "fetchlist" data structure, which contained 
> content
> > MD5 hashing value for the previous fetching.
> > 
> > If the current fetching step gets same content, I
> will
> > skip parsing and indexing process.
> 
> Please see the patches in
> http://issues.apache.org/jira/browse/NUTCH-61 .
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 



		
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Michael Ji wrote:
> Hi Jon:
> 
> You have an interesting approach. 
> 
> We are in the similar effort to save the unneccessary
> indexing and data duplication for the pages with the
> same content since last successful fetching. 
> 
> I am thinking to add an extra data field in
> "fetchlist" data structure, which contained  content
> MD5 hashing value for the previous fetching.
> 
> If the current fetching step gets same content, I will
> skip parsing and indexing process.

Please see the patches in http://issues.apache.org/jira/browse/NUTCH-61 .


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fetcher, Query Strings,and Duplicate Hashes (Nutch 0.7)

Posted by Michael Ji <fj...@yahoo.com>.

Hi Jon:

You have an interesting approach. 

We are in the similar effort to save the unneccessary
indexing and data duplication for the pages with the
same content since last successful fetching. 

I am thinking to add an extra data field in
"fetchlist" data structure, which contained  content
MD5 hashing value for the previous fetching.

If the current fetching step gets same content, I will
skip parsing and indexing process.

Any comments?

Michael Ji,

--- Jon Shoberg <jo...@shoberg.net> wrote:

> Has anyone looked at modifying the Fetcher code for
> checking of content 
> duplicates?  Not to any supprise, when allowing
> query strings in the URL 
> there is a ton of duplicate content and re-fetching
> going on.
> 
> The Wiki provided a brief overview of the Fetcher
> and what calls are 
> made.  I modified the outoutPage function from
> Fetcher.java to use a 
> MySQL DB to track hashes (MD5) of URLs and content
> from 
> ParseText.getText().  This works "OK" and is nothing
> more than an 
> obvious hack.
> 
> Has anyone has significant sucess with modifying the
> Fetcher or plugins 
> to activly manage content duplication and fetcher
> performance in a 
> better manner?
> 
> Thoughts? Ideas?
> 
> -j
> 
> 

____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs