You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Earl Cahill <ca...@yahoo.com> on 2005/09/09 11:06:27 UTC

bug in bin/nutch?

Trying to get mapred stuff to work, and I find it hard
to believe that this is a bug, but just trying to go
through the tutorial, I enter

bin/nutch admin db -create

and get

Exception in thread "main"
java.lang.NoClassDefFoundError: admin

Looking through bin/nutch, sure enough there isn't a
chunk for admin.  But there is in trunk.  If I add it
back in as per my patch below, then it seems to work.

But that sure seems like it would be broken for every
person that walks through the tutorial on mapred.

Earl

 ~/nutch/branches/mapred $ svn diff bin/nutch
Index: bin/nutch
===================================================================
--- bin/nutch   (revision 279726)
+++ bin/nutch   (working copy)
@@ -124,6 +124,8 @@
 # figure out which class to run
 if [ "$COMMAND" = "crawl" ] ; then
   CLASS=org.apache.nutch.crawl.Crawl
+elif [ "$COMMAND" = "admin" ] ; then
+  CLASS=org.apache.nutch.tools.WebDBAdminTool
 elif [ "$COMMAND" = "inject" ] ; then
   CLASS=org.apache.nutch.crawl.Injector
 elif [ "$COMMAND" = "generate" ] ; then


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

crawl-urlfilter.txt VS regex-urlfiter.txt

Posted by Michael Ji <fj...@yahoo.com>.
Hi,

I found I can use crawl-urlfilter.txt to define the
domain limitation by 
"
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
"

But, I found when I didn't use bin/nutch crawl...,
crawl-urlfilter.txt won't help me to filter out the
domain I don't want.

Can I use regex-urlfiter.txt to define the domain as
crawl-urlfiter.txt does? 

thanks,

Michael Ji


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

possibility of adding customerized data field in nutch Page Class

Posted by Michael Ji <fj...@yahoo.com>.
hi there,

I am trying to add a new data field to Page class, a
simple String.

So, I follow the URL field in Page Class as template.
But when I do WebDBInject, it gives me following error
messages. Seems the readFields() is not reading in the
right position.

I wonder if it is feasible to make a change in Page
Class, as I understand nutch webdb has advanced
structure and operations. From OO view, all the Page
fields should be accessed by Page Class Interface, but
I just met something weird.

thanks,

Michael Ji,

--------------------------------------------- 


Exception in thread "main" java.io.EOFException
	at
java.io.DataInputStream.readUnsignedShort(DataInputStream.java:310)
	at org.apache.nutch.io.UTF8.readFields(UTF8.java:101)
	at org.apache.nutch.db.Page.readFields(Page.java:146)
	at
org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:278)
	at
org.apache.nutch.io.MapFile$Reader.next(MapFile.java:349)
	at
org.apache.nutch.db.WebDBWriter$PagesByURLProcessor.mergeEdits(WebDBWriter.java:618)
	at
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:557)
	at
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
	at
org.apache.nutch.db.WebDBInjector.close(WebDBInjector.java:336)
	at
org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:581)



		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: HTTP 1.1

Posted by Kelvin Tan <ke...@relevanz.com>.
Hey Earl, the Nutch-84 enhancement suggestion in JIRA does just this. There is also support for request pipelining, which rather unfortunately, isn't a good idea when working with dynamic sites. 

Check out a previous post on this: http://marc.theaimsgroup.com/?l=nutch-developers&m=112476980602585&w=2

kelvin

On Fri, 16 Sep 2005 16:49:50 -0700 (PDT), Earl Cahill wrote:
> Maybe way ahead of me here, but it was just hitting me that it
> would be pretty cool to group urls to fetch my host and then
> perhaps use http 1.1 to reuse the connection and save initial
> handshaking overheard. Not a huge deal for a couple hits, but it I
> think it would make sense for large crawls.
>
> Or maybe keep a pool of http connections to the last x sites open
> somewhere and check there first.
>
> Sound reasonable?  Already doing it?  I would be willing to help.
>
> Just a thought.
>
> Earl
>
>
> __________________________________ Yahoo! Mail - PC Magazine
> Editors' Choice 2005 http://mail.yahoo.com



HTTP 1.1

Posted by Earl Cahill <ca...@yahoo.com>.
Maybe way ahead of me here, but it was just hitting me
that it would be pretty cool to group urls to fetch my
host and then perhaps use http 1.1 to reuse the
connection and save initial handshaking overheard. 
Not a huge deal for a couple hits, but it I think it
would make sense for large crawls.

Or maybe keep a pool of http connections to the last x
sites open somewhere and check there first.

Sound reasonable?  Already doing it?  I would be
willing to help.

Just a thought.

Earl


		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

Re: bug in bin/nutch?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Earl Cahill wrote:

> Guess I figured as much.  Can I suggest that someone
> typing 
> 
> bin/nutch admin ...
> 
> in the mappred branch, should get pointed to the
> proper command, or at least a message saying that

There is no separate command - for now the DB is created when you run 
Injector or Crawl (which calls Injector as the first step). Other 
commands from the script should work very similarly, even though they 
use now different implementations:

* inject - runs Injector to add urls from a plaintext file (one url per 
line, there may be many input files, and they must be placed inside a 
directory). This creates the CrawlDB in the destination directory if it 
didn't exist before, or updates the existing one. Note that the new 
CrawlDB does NOT contain links - they are stored separately in a LinkDB, 
and CrawlDB just stores the equivalents of Page in the former WebDB.

* generate - runs Generate to create new fetchlists to be fetched

* fetch - runs the modified Fetcher to fetch segments

* updatedb - runs CrawlDB.update() to update the CrawlDB with new page 
information, and to add new unfetched pages.

* invertlinks - creates or updates a LinkDB, containing incoming link 
information. Note that it takes as an argument the top level dir, where 
the new segments are contained, and not the dir names of segments...

* index - runs the new modified Indexer to create an index of the 
fetched segments.

The above commands read the mapred configuration, and for now it 
defaults to "local", which means that all Jobs execute within the same 
JVM, and NDFS also defaults to local. The rest of the commands in 
bin/nutch have to do with a distributed setup.

> admin doesn't exist in the mapred branch, just to save
> some confusion.  There is a dumb patch below that
> would change the usage line.
> 
> I think such differences are all the more reason to
> have a nice mapred tutorial, which I would be more
> than willing to help with.  I thought I was close, but

Yes, I agree. But there are still some command-line tools missing, or 
not yet ported to use mapred. At this point a general tutorial would be 
difficult... unless it would be simply "you need to run ./nutch crawl" ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: bug in bin/nutch?

Posted by Earl Cahill <ca...@yahoo.com>.
> The DB format in mapred branch is completely
> different. So, what you 
> create with "admin db -create" is the old DB format,
> not used in the 
> mapred branch.
> 
> Please study the code to the Crawl command, this
> should help... Mapred 
> stuff is powerful, but it is also very different
> from the current way of 
> doing things, so there will be alot to learn...

Guess I figured as much.  Can I suggest that someone
typing 

bin/nutch admin ...

in the mappred branch, should get pointed to the
proper command, or at least a message saying that
admin doesn't exist in the mapred branch, just to save
some confusion.  There is a dumb patch below that
would change the usage line.

I think such differences are all the more reason to
have a nice mapred tutorial, which I would be more
than willing to help with.  I thought I was close, but
I have yet to get a mapred crawl/index/search
completed.  Your comment makes me think I am still
aways off.

Thanks,
Earl

Index: bin/nutch
===================================================================
--- bin/nutch   (revision 279734)
+++ bin/nutch   (working copy)
@@ -29,7 +29,7 @@
   echo "Usage: nutch COMMAND"
   echo "where COMMAND is one of:"
   echo "  crawl             one-step crawler for
intranets"
-  echo "  admin             database administration,
including creation"
+  echo "  admin             not used in mapred"
   echo "  inject            inject new urls into the
database"
   echo "  generate          generate new segments to
fetch"
   echo "  fetch             fetch a segment's pages"


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: bug in bin/nutch?

Posted by Andrzej Bialecki <ab...@getopt.org>.
Earl Cahill wrote:
> Trying to get mapred stuff to work, and I find it hard
> to believe that this is a bug, but just trying to go
> through the tutorial, I enter
> 
> bin/nutch admin db -create

The DB format in mapred branch is completely different. So, what you 
create with "admin db -create" is the old DB format, not used in the 
mapred branch.

Please study the code to the Crawl command, this should help... Mapred 
stuff is powerful, but it is also very different from the current way of 
doing things, so there will be alot to learn...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com