You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Earl Cahill <ca...@yahoo.com> on 2005/09/17 01:49:50 UTC

HTTP 1.1

Maybe way ahead of me here, but it was just hitting me
that it would be pretty cool to group urls to fetch my
host and then perhaps use http 1.1 to reuse the
connection and save initial handshaking overheard. 
Not a huge deal for a couple hits, but it I think it
would make sense for large crawls.

Or maybe keep a pool of http connections to the last x
sites open somewhere and check there first.

Sound reasonable?  Already doing it?  I would be
willing to help.

Just a thought.

Earl


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

crawl-urlfilter.txt VS regex-urlfiter.txt

Posted by Michael Ji <fj...@yahoo.com>.

Hi,

I found I can use crawl-urlfilter.txt to define the
domain limitation by 
"
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
"

But, I found when I didn't use bin/nutch crawl...,
crawl-urlfilter.txt won't help me to filter out the
domain I don't want.

Can I use regex-urlfiter.txt to define the domain as
crawl-urlfiter.txt does? 

thanks,

Michael Ji


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

possibility of adding customerized data field in nutch Page Class

Posted by Michael Ji <fj...@yahoo.com>.

hi there,

I am trying to add a new data field to Page class, a
simple String.

So, I follow the URL field in Page Class as template.
But when I do WebDBInject, it gives me following error
messages. Seems the readFields() is not reading in the
right position.

I wonder if it is feasible to make a change in Page
Class, as I understand nutch webdb has advanced
structure and operations. From OO view, all the Page
fields should be accessed by Page Class Interface, but
I just met something weird.

thanks,

Michael Ji,

--------------------------------------------- 


Exception in thread "main" java.io.EOFException
	at
java.io.DataInputStream.readUnsignedShort(DataInputStream.java:310)
	at org.apache.nutch.io.UTF8.readFields(UTF8.java:101)
	at org.apache.nutch.db.Page.readFields(Page.java:146)
	at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:278)
	at
org.apache.nutch.io.MapFile$Reader.next(MapFile.java:349)
	at
org.apache.nutch.db.WebDBWriter$PagesByURLProcessor.mergeEdits(WebDBWriter.java:618)
	at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:557)
	at
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
	at
org.apache.nutch.db.WebDBInjector.close(WebDBInjector.java:336)
	at
org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:581)



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: HTTP 1.1

Posted by Kelvin Tan <ke...@relevanz.com>.

Hey Earl, the Nutch-84 enhancement suggestion in JIRA does just this. There is also support for request pipelining, which rather unfortunately, isn't a good idea when working with dynamic sites. 

Check out a previous post on this: http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

kelvin

On Fri, 16 Sep 2005 16:49:50 -0700 (PDT), Earl Cahill wrote:
> Maybe way ahead of me here, but it was just hitting me that it
> would be pretty cool to group urls to fetch my host and then
> perhaps use http 1.1 to reuse the connection and save initial
> handshaking overheard. Not a huge deal for a couple hits, but it I
> think it would make sense for large crawls.
>
> Or maybe keep a pool of http connections to the last x sites open
> somewhere and check there first.
>
> Sound reasonable?  Already doing it?  I would be willing to help.
>
> Just a thought.
>
> Earl
>
>
> __________________________________ Yahoo! Mail - PC Magazine
> Editors' Choice 2005 http://mail.yahoo.com