You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Insurance Squared Inc." <gc...@insurancesquared.com> on 2006/03/22 18:39:39 UTC

Removing urls from webdb

We've got a website that is causing our crawler to slow down (from 
20mbits down to 3-5) - 400K pages that are basically not available, 
we're just getting 404's.  I'd like to remove them from the DB to get 
our crawl speed back up again.

Here's what our developer told me - I'm stumped, that seems really odd.  
Is there a better way to remove a URL so that it doesn't get crawled?

Running nutch 0.71 on a dual xeon with 8 gigs of ram. 

-------------------------
There are more than 400,000  urls in the webdb.  It takes  ~4 hours 
to remove a url from the webdb. That means that it'll take  ~1,600,000 
hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000 CAA 
urls from the webdb. Do you really want to remove them in this way?



Re: Removing urls from webdb

Posted by keren nutch <ke...@yahoo.ca>.
Hi sudhendra,
 
 Thans for reply. It's   src/java/org/apache/nutch/tools.PruneDB, not   src/java/org/apache/nutch/toos.PruneDB
 
 Best regards,
 
 Keren
 
sudhendra seshachala <su...@yahoo.com> wrote: I guess the problem is with the package name 
  src/java/org/apache/nutch/tools.PruneDB and
  src/java/org/apache/nutch/toos.PruneDB...
   
  Can you please verify again. It seems to be a typo mistake....
   
  Thanks 

keren nutch  wrote:
  Hi Matt,

Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/tools/PruneDB

Please let me know where I'm wrong.

Keren

Matt Kangas wrote: I'm puzzled by the claim that "It takes ~4 hours to remove a url from 
the webdb.". If you're removing them one at a time, yes, because you 
have to rewrite the entire webdb for any change. But you want to 
process them in bulk. So it should only take:
= (time to rewrite webdb) + (time to process 11M urls through 
URLFilter chain)
= 4 hrs + X

X depends on the complexity of your URLFilter chain. You only need 
RegexURLFilter with two patterns defined. (a minus for a bad site, 
and a plus for all else).

Using my PruneDBTool, as discussed earlier, you can eliminate all of 
those urls in a single pass over the webdb.

http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

HTH,
--Matt

On Mar 22, 2006, at 12:55 PM, keren nutch wrote:

> Actually, we have 11,000,000 urls in the webdb.
>
> Keren
>
> "Insurance Squared Inc." wrote: We've 
> got a website that is causing our crawler to slow down (from
> 20mbits down to 3-5) - 400K pages that are basically not available,
> we're just getting 404's. I'd like to remove them from the DB to get
> our crawl speed back up again.
>
> Here's what our developer told me - I'm stumped, that seems really 
> odd.
> Is there a better way to remove a URL so that it doesn't get crawled?
>
> Running nutch 0.71 on a dual xeon with 8 gigs of ram.
>
> -------------------------
> There are more than 400,000 urls in the webdb. It takes ~4 hours
> to remove a url from the webdb. That means that it'll take ~1,600,000
> hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000 
> CAA
> urls from the webdb. Do you really want to remove them in this way?
>

--
Matt Kangas / kangas@gmail.com






---------------------------------
Have a question? Yahoo! Canada Answers. Go to Yahoo! Canada Answers 


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


  
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

		
---------------------------------
7 bucks a month. This is Huge Yahoo! Music Unlimited  

Re: Removing urls from webdb

Posted by sudhendra seshachala <su...@yahoo.com>.
I guess the problem is with the package name 
  src/java/org/apache/nutch/tools.PruneDB and
  src/java/org/apache/nutch/toos.PruneDB...
   
  Can you please verify again. It seems to be a typo mistake....
   
  Thanks 

keren nutch <ke...@yahoo.ca> wrote:
  Hi Matt,

Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the error:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/tools/PruneDB

Please let me know where I'm wrong.

Keren

Matt Kangas wrote: I'm puzzled by the claim that "It takes ~4 hours to remove a url from 
the webdb.". If you're removing them one at a time, yes, because you 
have to rewrite the entire webdb for any change. But you want to 
process them in bulk. So it should only take:
= (time to rewrite webdb) + (time to process 11M urls through 
URLFilter chain)
= 4 hrs + X

X depends on the complexity of your URLFilter chain. You only need 
RegexURLFilter with two patterns defined. (a minus for a bad site, 
and a plus for all else).

Using my PruneDBTool, as discussed earlier, you can eliminate all of 
those urls in a single pass over the webdb.

http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

HTH,
--Matt

On Mar 22, 2006, at 12:55 PM, keren nutch wrote:

> Actually, we have 11,000,000 urls in the webdb.
>
> Keren
>
> "Insurance Squared Inc." wrote: We've 
> got a website that is causing our crawler to slow down (from
> 20mbits down to 3-5) - 400K pages that are basically not available,
> we're just getting 404's. I'd like to remove them from the DB to get
> our crawl speed back up again.
>
> Here's what our developer told me - I'm stumped, that seems really 
> odd.
> Is there a better way to remove a URL so that it doesn't get crawled?
>
> Running nutch 0.71 on a dual xeon with 8 gigs of ram.
>
> -------------------------
> There are more than 400,000 urls in the webdb. It takes ~4 hours
> to remove a url from the webdb. That means that it'll take ~1,600,000
> hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000 
> CAA
> urls from the webdb. Do you really want to remove them in this way?
>

--
Matt Kangas / kangas@gmail.com






---------------------------------
Have a question? Yahoo! Canada Answers. Go to Yahoo! Canada Answers 


  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

Re: parsing pdf file

Posted by Ravi Chintakunta <ra...@gmail.com>.
Hi Michael,

The default value for the content limit in nutch-default.xml is 65536.
This is set in these properties:

http.content.limit
file.content.limit
ftp.content.limit

So irrespective of the file size,  the download is limited to this value.

To allow parsing of the files that exceed this limit, copy the above 3
properties into nutch-site.xml and increase them to your desired
number.


- Ravi Chintakunta



On 3/24/06, Michael Ji <fj...@yahoo.com> wrote:
> Hi there,
>
> I got the following errors;
>
> 060324 095216 http.max.delays = 10000
> 060324 095217 fetch okay, but can't parse
> http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
> reason: failed(2,202): Content truncated at 69266
> bytes. Parser can't handle incomplete pdf file.
>
> Seems fetching is successfully, but not for parsing; I
> expanding delays to 10000, still not enough?
>
> thanks,
>
> Michael
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: fetching https pages

Posted by kauu <ba...@gmail.com>.
i think u need a protocol to parse the https
so u need to change this in ur nutch-site.xml if u hava the
protocol-https plugin


<name>plugin.includes</name>
  <value>nutch-extensionpoints|protocol-http|protocol-https
|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>

<description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

On 3/27/06, Michael Ji <fj...@yahoo.com> wrote:
>
> hi there:
>
> Does the following lines in nutch-site.xml will let
> nutch to fetch https page down?
>
> "protocol-(http|https)"
>
> I tried that but gives me error message of
>
> "
> failed with:
> org.apache.nutch.protocol.ProtocolNotFound: protocol
> not found for url=https
> "
>
> Any idea how to fix it?
>
> thanks,
>
> Michael
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>



--
www.babatu.com

Re: fetching https pages

Posted by Andrzej Bialecki <ab...@getopt.org>.
Michael Ji wrote:
> hi there:
>
> Does the following lines in nutch-site.xml will let
> nutch to fetch https page down?
>
> "protocol-(http|https)"
>   

No. There is no plugin named "protocol-https". In order to handle HTTPS 
you need to use the "protocol-httpclient" plugin, which handles both 
HTTP and HTTPS - and then you should remove "protocol-http" from your 
config.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



fetching https pages

Posted by Michael Ji <fj...@yahoo.com>.
hi there:

Does the following lines in nutch-site.xml will let
nutch to fetch https page down?

"protocol-(http|https)"

I tried that but gives me error message of 

"
failed with:
org.apache.nutch.protocol.ProtocolNotFound: protocol
not found for url=https
"

Any idea how to fix it?

thanks,

Michael




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

a way to fetch, parse, index and query pdf/msword

Posted by Michael Ji <fj...@yahoo.com>.
hi there,

Within nutch-site.xml, I added pdf|msword for
parse-/index-/query-

I wonder if it is the proper way to tell nutch to
fetch,index and query these two file formats?

thanks,

Michael,

---------------------------------------------------

<property>
<name>plugin.includes</name>

<value>

nutch-extensionpoints|protocol-http|
urlfilter-regex|
parse-(text|html|pdf|msword)|
index-(basic|pdf|msword)|
query-(basic|site|url|pdf|msword)

</value>
  <description> </description>
</property>


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: search word file

Posted by Michael Ji <fj...@yahoo.com>.
I found my index.done file has 0 size; Is it wrong?

but I can't find the error in the indexing log;

"060324 095226 * Moving index to NFS if needed...
060324 095226 DONE indexing segment 20060324095213:
total 1 records in 0.688 s (Infinity rec/s).
060324 095226 done indexing
"

thanks,

Michael,

--- Michael Ji <fj...@yahoo.com> wrote:

> hi there,
> 
> I can fetch the word and parse the file
> successfully,
> 
> "060324 094040 fetching
>
http://www.ala.org/ala/rusa/rusaprotools/referenceguide/illformprint.doc
> 060324 094040 http.proxy.host = null
> 060324 094040 http.proxy.port = 8080
> 060324 094040 http.timeout = 10000
> "
> 
> I can use the lukeAll to check the content of
> segment
> and could see the letter.
> 
> But I can't search the letter in nutch search page.
> Do
> I need more configuration to let word file been
> searchable?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

search word file

Posted by Michael Ji <fj...@yahoo.com>.
hi there,

I can fetch the word and parse the file successfully,

"060324 094040 fetching
http://www.ala.org/ala/rusa/rusaprotools/referenceguide/illformprint.doc
060324 094040 http.proxy.host = null
060324 094040 http.proxy.port = 8080
060324 094040 http.timeout = 10000
"

I can use the lukeAll to check the content of segment
and could see the letter.

But I can't search the letter in nutch search page. Do
I need more configuration to let word file been
searchable?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

parsing pdf file

Posted by Michael Ji <fj...@yahoo.com>.
Hi there,

I got the following errors;

060324 095216 http.max.delays = 10000
060324 095217 fetch okay, but can't parse
http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
reason: failed(2,202): Content truncated at 69266
bytes. Parser can't handle incomplete pdf file.

Seems fetching is successfully, but not for parsing; I
expanding delays to 10000, still not enough?

thanks,

Michael


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: crawling pdf and word file

Posted by Michael Ji <fj...@yahoo.com>.
hi Sudhendra:

I use the same configuration as you suggested in
nutch-site.xml

I did a testing and after look at the fetch log, found
the following error message

"
fetch okay, but can't parse
http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf,
reason: failed(2,203): Content-Type not text/html:
application/pdf
"

Does that mean pdf is downloaded but doesn't parse
successfully? So we can't search the word in pdf file
directly?

thanks,

Michael,

By the way, I use nutch 07 to do testing.



--- sudhendra seshachala <su...@yahoo.com> wrote:

> In Nutch-default.xml,
> Include plugin for word and PDF as below.
> 
> <property>
>   <name>plugin.includes</name>
>  
>
<value>protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)</value>
>   <description>Regular expression naming plugin
> directory names to
>   include.  Any plugin not matching this expression
> is excluded.
>   In any case you need at least include the
> nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and
> plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> But reco is to include the property in
> nutch-site.xml
> 
> Hope this helps.
> 
> Michael Ji <fj...@yahoo.com> wrote: 
> hi there,
> 
> Is there any specific setting need to be added in
> configuration file in order to crawl and index pdf
> and
> word file?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
> 		
> ---------------------------------
> Blab-away for as little as 1¢/min. Make  PC-to-Phone
> Calls using Yahoo! Messenger with Voice.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: crawling pdf and word file

Posted by sudhendra seshachala <su...@yahoo.com>.
In Nutch-default.xml,
Include plugin for word and PDF as below.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>
But reco is to include the property in nutch-site.xml

Hope this helps.

Michael Ji <fj...@yahoo.com> wrote: 
hi there,

Is there any specific setting need to be added in
configuration file in order to crawl and index pdf and
word file?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


		
---------------------------------
Blab-away for as little as 1¢/min. Make  PC-to-Phone Calls using Yahoo! Messenger with Voice.

crawling pdf and word file

Posted by Michael Ji <fj...@yahoo.com>.
hi there,

Is there any specific setting need to be added in
configuration file in order to crawl and index pdf and
word file?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Removing urls from webdb

Posted by keren nutch <ke...@yahoo.ca>.
Hi Matt,
 
 Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the error:
 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/nutch/tools/PruneDB
 
 Please let me know where I'm wrong.
 
 Keren

Matt Kangas <ka...@gmail.com> wrote: I'm puzzled by the claim that "It takes ~4 hours to remove a url from  
the webdb.". If you're removing them one at a time, yes, because you  
have to rewrite the entire webdb for any change. But you want to  
process them in bulk. So it should only take:
  = (time to rewrite webdb) + (time to process 11M urls through  
URLFilter chain)
  = 4 hrs + X

X depends on the complexity of your URLFilter chain. You only need  
RegexURLFilter with two patterns defined. (a minus for a bad site,  
and a plus for all else).

Using my PruneDBTool, as discussed earlier, you can eliminate all of  
those urls in a single pass over the webdb.

http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

HTH,
--Matt

On Mar 22, 2006, at 12:55 PM, keren nutch wrote:

> Actually, we have 11,000,000 urls in the webdb.
>
>  Keren
>
> "Insurance Squared Inc."  wrote: We've  
> got a website that is causing our crawler to slow down (from
> 20mbits down to 3-5) - 400K pages that are basically not available,
> we're just getting 404's.  I'd like to remove them from the DB to get
> our crawl speed back up again.
>
> Here's what our developer told me - I'm stumped, that seems really  
> odd.
> Is there a better way to remove a URL so that it doesn't get crawled?
>
> Running nutch 0.71 on a dual xeon with 8 gigs of ram.
>
> -------------------------
> There are more than 400,000  urls in the webdb.  It takes  ~4 hours
> to remove a url from the webdb. That means that it'll take  ~1,600,000
> hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000  
> CAA
> urls from the webdb. Do you really want to remove them in this way?
>

--
Matt Kangas / kangas@gmail.com





		
---------------------------------
Have a question? Yahoo! Canada Answers. Go to Yahoo! Canada Answers  

Re: Removing urls from webdb

Posted by Matt Kangas <ka...@gmail.com>.
I'm puzzled by the claim that "It takes ~4 hours to remove a url from  
the webdb.". If you're removing them one at a time, yes, because you  
have to rewrite the entire webdb for any change. But you want to  
process them in bulk. So it should only take:
  = (time to rewrite webdb) + (time to process 11M urls through  
URLFilter chain)
  = 4 hrs + X

X depends on the complexity of your URLFilter chain. You only need  
RegexURLFilter with two patterns defined. (a minus for a bad site,  
and a plus for all else).

Using my PruneDBTool, as discussed earlier, you can eliminate all of  
those urls in a single pass over the webdb.

http://blog.busytonight.com/2006/03/nutch_07_prunedb_tool.html

HTH,
--Matt

On Mar 22, 2006, at 12:55 PM, keren nutch wrote:

> Actually, we have 11,000,000 urls in the webdb.
>
>  Keren
>
> "Insurance Squared Inc." <gc...@insurancesquared.com> wrote: We've  
> got a website that is causing our crawler to slow down (from
> 20mbits down to 3-5) - 400K pages that are basically not available,
> we're just getting 404's.  I'd like to remove them from the DB to get
> our crawl speed back up again.
>
> Here's what our developer told me - I'm stumped, that seems really  
> odd.
> Is there a better way to remove a URL so that it doesn't get crawled?
>
> Running nutch 0.71 on a dual xeon with 8 gigs of ram.
>
> -------------------------
> There are more than 400,000  urls in the webdb.  It takes  ~4 hours
> to remove a url from the webdb. That means that it'll take  ~1,600,000
> hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000  
> CAA
> urls from the webdb. Do you really want to remove them in this way?
>

--
Matt Kangas / kangas@gmail.com




Re: Removing urls from webdb

Posted by keren nutch <ke...@yahoo.ca>.
Actually, we have 11,000,000 urls in the webdb. 
 
 Keren

"Insurance Squared Inc." <gc...@insurancesquared.com> wrote: We've got a website that is causing our crawler to slow down (from 
20mbits down to 3-5) - 400K pages that are basically not available, 
we're just getting 404's.  I'd like to remove them from the DB to get 
our crawl speed back up again.

Here's what our developer told me - I'm stumped, that seems really odd.  
Is there a better way to remove a URL so that it doesn't get crawled?

Running nutch 0.71 on a dual xeon with 8 gigs of ram. 

-------------------------
There are more than 400,000  urls in the webdb.  It takes  ~4 hours 
to remove a url from the webdb. That means that it'll take  ~1,600,000 
hours (~66,666 days, or ~2222 months, ~185 years) to remove 400,000 CAA 
urls from the webdb. Do you really want to remove them in this way?




		
---------------------------------
Enrich your life at Yahoo! Canada Finance