You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/03/22 18:39:39 UTC

Removing urls from webdb

We've got a website that is causing our crawler to slow down (from 
20mbits down to 3-5) - 400K pages that are basically not available, 
we're just getting 404's.  I'd like to remove them from the DB to get 
our crawl speed back up again.

Here's what our developer told me - I'm stumped, that seems really odd.  
Is there a better way to remove a URL so that it doesn't get crawled?

Running nutch 0.71 on a dual xeon with 8 gigs of ram. 

-------------------------
There are more than 400,000  urls in the webdb.  It takes  ~4 hours 
to remove a url from the webdb. That means that it'll take  ~1,600,000 
hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000 CAA 
urls from the webdb. Do you really want to remove them in this way?

Re: Removing urls from webdb

Posted by keren nutch <ke...@yahoo.ca>.

Hi sudhendra,

 Thans for reply. It's   src/java/org/apache/nutch/tools.PruneDB, not   src/java/org/apache/nutch/toos.PruneDB

 Best regards,

 Keren

sudhendra seshachala <su...@yahoo.com> wrote: I guess the problem is with the package name 
  src/java/org/apache/nutch/tools.PruneDB and
  src/java/org/apache/nutch/toos.PruneDB...

  Can you please verify again. It seems to be a typo mistake....

  Thanks 

keren nutch  wrote:
  Hi Matt,

Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/tools/PruneDB

Please let me know where I'm wrong.

Keren

Matt Kangas wrote: I'm puzzled by the claim that "It takes ~4 hours to remove a url from 
the webdb.". If you're removing them one at a time, yes, because you 
have to rewrite the entire webdb for any change. But you want to 
process them in bulk. So it should only take:
= (time to rewrite webdb) + (time to process 11M urls through 
URLFilter chain)
= 4 hrs + X

X depends on the complexity of your URLFilter chain. You only need 
RegexURLFilter with two patterns defined. (a minus for a bad site, 
and a plus for all else).

Using my PruneDBTool, as discussed earlier, you can eliminate all of 
those urls in a single pass over the webdb.

http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

HTH,
--Matt

On Mar 22, 2006, at 12:55 PM, keren nutch wrote:

> Actually, we have 11,000,000 urls in the webdb.
>
> Keren
>
> "Insurance Squared Inc." wrote: We've 
> got a website that is causing our crawler to slow down (from
> 20mbits down to 3-5) - 400K pages that are basically not available,
> we're just getting 404's. I'd like to remove them from the DB to get
> our crawl speed back up again.
>
> Here's what our developer told me - I'm stumped, that seems really 
> odd.
> Is there a better way to remove a URL so that it doesn't get crawled?
>
> Running nutch 0.71 on a dual xeon with 8 gigs of ram.
>
> -------------------------
> There are more than 400,000 urls in the webdb. It takes ~4 hours
> to remove a url from the webdb. That means that it'll take ~1,600,000
> hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000 
> CAA
> urls from the webdb. Do you really want to remove them in this way?
>

--
Matt Kangas / kangas@gmail.com

---------------------------------
Have a question? Yahoo! Canada Answers. Go to Yahoo! Canada Answers 

  Sudhi Seshachala
  http://sudhilogs.blogspot.com/

---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

---------------------------------
7 bucks a month. This is Huge Yahoo! Music Unlimited

Re: Removing urls from webdb

Posted by sudhendra seshachala <su...@yahoo.com>.

I guess the problem is with the package name 
  src/java/org/apache/nutch/tools.PruneDB and
  src/java/org/apache/nutch/toos.PruneDB...

  Can you please verify again. It seems to be a typo mistake....

  Thanks 

keren nutch <ke...@yahoo.ca> wrote:
  Hi Matt,

Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/tools/PruneDB

Please let me know where I'm wrong.

Keren

Matt Kangas wrote: I'm puzzled by the claim that "It takes ~4 hours to remove a url from 
the webdb.". If you're removing them one at a time, yes, because you 
have to rewrite the entire webdb for any change. But you want to 
process them in bulk. So it should only take:
= (time to rewrite webdb) + (time to process 11M urls through 
URLFilter chain)
= 4 hrs + X

X depends on the complexity of your URLFilter chain. You only need 
RegexURLFilter with two patterns defined. (a minus for a bad site, 
and a plus for all else).

Using my PruneDBTool, as discussed earlier, you can eliminate all of 
those urls in a single pass over the webdb.

http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

HTH,
--Matt

On Mar 22, 2006, at 12:55 PM, keren nutch wrote:

> Actually, we have 11,000,000 urls in the webdb.
>
> Keren
>
> "Insurance Squared Inc." wrote: We've 
> got a website that is causing our crawler to slow down (from
> 20mbits down to 3-5) - 400K pages that are basically not available,
> we're just getting 404's. I'd like to remove them from the DB to get
> our crawl speed back up again.
>
> Here's what our developer told me - I'm stumped, that seems really 
> odd.
> Is there a better way to remove a URL so that it doesn't get crawled?
>
> Running nutch 0.71 on a dual xeon with 8 gigs of ram.
>
> -------------------------
> There are more than 400,000 urls in the webdb. It takes ~4 hours
> to remove a url from the webdb. That means that it'll take ~1,600,000
> hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000 
> CAA
> urls from the webdb. Do you really want to remove them in this way?
>

--
Matt Kangas / kangas@gmail.com

---------------------------------
Have a question? Yahoo! Canada Answers. Go to Yahoo! Canada Answers 

  Sudhi Seshachala
  http://sudhilogs.blogspot.com/

---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.

Re: parsing pdf file

Posted by Ravi Chintakunta <ra...@gmail.com>.

Hi Michael,

The default value for the content limit in nutch-default.xml is 65536.
This is set in these properties:

http.content.limit
file.content.limit
ftp.content.limit

So irrespective of the file size,  the download is limited to this value.

To allow parsing of the files that exceed this limit, copy the above 3
properties into nutch-site.xml and increase them to your desired
number.

- Ravi Chintakunta

On 3/24/06, Michael Ji <fj...@yahoo.com> wrote:
> Hi there,
>
> I got the following errors;
>
> 060324 095216 http.max.delays = 10000
> 060324 095217 fetch okay, but can't parse
> http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
> reason: failed(2,202): Content truncated at 69266
> bytes. Parser can't handle incomplete pdf file.
>
> Seems fetching is successfully, but not for parsing; I
> expanding delays to 10000, still not enough?
>
> thanks,
>
> Michael
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: fetching https pages

Posted by kauu <ba...@gmail.com>.

i think u need a protocol to parse the https
so u need to change this in ur nutch-site.xml if u hava the
protocol-https plugin


<name>plugin.includes</name>
  <value>nutch-extensionpoints|protocol-http|protocol-https
|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>

<description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

On 3/27/06, Michael Ji <fj...@yahoo.com> wrote:
>
> hi there:
>
> Does the following lines in nutch-site.xml will let
> nutch to fetch https page down?
>
> "protocol-(http|https)"
>
> I tried that but gives me error message of
>
> "
> failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol
> not found for url=https
> "
>
> Any idea how to fix it?
>
> thanks,
>
> Michael
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>



--
www.babatu.com