You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Hasan Diwan <ha...@gmail.com> on 2006/03/06 02:31:23 UTC

NullPointerException

I've followed the nutch tutorial for crawling and started tomcat from
the crawl directory. When I run the crawl, it ends with:
060305 171044 crawl finished: crawl
which looks as it should, then I start tomcat. Access nutch using
index.jsp in the root context and whatever searches I do yield 0
results. I know for a fact that my blog, which is my test dataset
contains at least one instance of my name, which is what I'm searching
for. Thanks a bunch for the help!
--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Howie Wang <ho...@hotmail.com>.
Hi, Hasan,

Looking more carefully at the query-more plugin, it seems that it
only adds functionality for date queries and type queries. I think
you need to add query-basic to the list also to get it to search
the default content. Can you try adding query-basic and running:

bin/nutch search http

Howie

>On 06/03/06, Howie Wang <ho...@hotmail.com> wrote:
> > Is query-basic or query-more included in your nutch-default.xml?
>
>It is indeed included in my nutch-site.xml :-
>
>  <property>
>   <name>plugin.includes</name>
>   
><value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-more|query-(more|site|url)</value>
>  </property>
>Thanks for the help!
>--
>Cheers,
>Hasan Diwan <ha...@gmail.com>



Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
On 06/03/06, Howie Wang <ho...@hotmail.com> wrote:
> Is query-basic or query-more included in your nutch-default.xml?

It is indeed included in my nutch-site.xml :-

 <property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-more|query-(more|site|url)</value>
 </property>
Thanks for the help!
--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
Right then.. compiled the svn version of nutch. Tried running the
crawl with it and this is the log:
server: 11:32pm % ./bin/nutch crawl ../SpectraSearch/urls -dir
../SpectraSearch/crawl -depth 2 -threads 20
060305 233255 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060305 233255 parsing file:/home/hdiwan/nutch/conf/nutch-default.xml
060305 233255 parsing file:/home/hdiwan/nutch/conf/crawl-tool.xml
060305 233255 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060305 233255 parsing file:/home/hdiwan/nutch/conf/nutch-site.xml
060305 233255 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml
060305 233256 crawl started in: ../SpectraSearch/crawl
060305 233256 rootUrlDir = ../SpectraSearch/urls
060305 233256 threads = 20
060305 233256 depth = 2
060305 233256 Injector: starting
060305 233256 Injector: crawlDb: ../SpectraSearch/crawl/crawldb
060305 233256 Injector: urlDir: ../SpectraSearch/urls
060305 233256 Injector: Converting injected urls to crawl db entries.
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-default.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/crawl-tool.xml
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-site.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-default.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/crawl-tool.xml
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/nutch-site.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml
060305 233256 Running job: job_7n6bsm
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060305 233256 parsing
jar:file:/home/hdiwan/nutch/lib/hadoop-0.1-dev.jar!/mapred-default.xml
060305 233256 parsing /tmp/hadoop/mapred/local/localRunner/job_7n6bsm.xml
060305 233256 parsing file:/home/hdiwan/nutch/conf/hadoop-site.xml
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/tmp/hadoop/mapred/local/localRunner/job_7n6bsm.xmlfinal:
hadoop-site.xml
        at org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.java:84)
        at org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:94)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:70)
060305 233257  map 0%  reduce 0%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
I need to sleep now, so I'll check back tomorrow. Thanks for all the help!
--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> > You can still build it on local file system:)
>
> Build, yes, but what of deployment? Can I use it in the same way?
Of course yes.

> At
> present, I don't have enough resources to run a distributed crawl.
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> You can still build it on local file system:)

Build, yes, but what of deployment? Can I use it in the same way? At
present, I don't have enough resources to run a distributed crawl.
--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> > I am not sure what's wrong in nutch-0.7.1 indexing, but now it is
> > possible to upgrade to nutch 0.8(svn version)?
>
> It is possible, but I was under the assumption that 0.8 required NDFS?
You can still build it on local file system:)
Good luck!

/Jack
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> I am not sure what's wrong in nutch-0.7.1 indexing, but now it is
> possible to upgrade to nutch 0.8(svn version)?

It is possible, but I was under the assumption that 0.8 required NDFS?
--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
Hasan

It seems your index is not completed.
If you get whole(correct) indices, index dir  should include
1. segements file
2. deletable file
3. other files

I am not sure what's wrong in nutch-0.7.1 indexing, but now it is
possible to upgrade to nutch 0.8(svn version)?

/Jack

On 3/6/06, Jack Tang <hi...@gmail.com> wrote:
> On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> > On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> > >
> > > ok. As stepan said, can you get any hit when you try to search "http" or "www"?
> >
> > No
> Hey, can you zip the index and send it to me directly?
>
> > --
> > Cheers,
> > Hasan Diwan <ha...@gmail.com>
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> >
> > ok. As stepan said, can you get any hit when you try to search "http" or "www"?
>
> No
Hey, can you zip the index and send it to me directly?

> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
>
> ok. As stepan said, can you get any hit when you try to search "http" or "www"?

No

--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> Mr Tang:
> On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> > Weird! You are running nutch on local file system or distributed file system?
> Local file system
>
> > And can you find the same query "hasan" via luke?
> Nope

ok. As stepan said, can you get any hit when you try to search "http" or "www"?

> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
Mr Tang:
On 05/03/06, Jack Tang <hi...@gmail.com> wrote:
> Weird! You are running nutch on local file system or distributed file system?
Local file system

> And can you find the same query "hasan" via luke?
Nope

--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
> Total hits: 0
Weird! You are running nutch on local file system or distributed file system?
And can you find the same query "hasan" via luke?


PS: you can install luke from http://www.getopt.org/luke/
/Jack

On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> Mr Tang:
> > Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
> > [your-query-string] in shell/cmd?
>
> server: 7:20pm % ./bin/nutch org.apache.nutch.searcher.NutchBean hasan
> 060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
> 060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
> 060305 192042 10 opening merged index in /home/hdiwan/SpectraSearch/crawl/index
> 060305 192042 10 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
> 060305 192042 10 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
> 060305 192042 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
> 060305 192042 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
> 060305 192042 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
> 060305 192042 10 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
> 060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 192042 10 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
> 060305 192042 10 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutc
> che.nutch.searcher.more.TypeQueryFilter
> 060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.DateQueryFilter
> 060305 192043 10 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
> 060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060305 192043 10 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
> 060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060305 192043 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
> 060305 192043 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
> 060305 192043 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
> 060305 192043 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
> 060305 192043 10 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
> 060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology

> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: NullPointerException

Posted by Howie Wang <ho...@hotmail.com>.
I didn't see query-basic/query-more on your list of plugins included. This 
is what
handles most queries usually. query-url will only handle parts of the
query that look like url:http://www.google.com, and query-site handles
site:www.google.com.  Nothing seems to be handling just regular
text in the content.

Is query-basic or query-more included in your nutch-default.xml?

I'm not sure why you don't see anything in Luke though.

Howie

>From: "Hasan Diwan" <ha...@gmail.com>

>Mr Tang:
> > Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
> > [your-query-string] in shell/cmd?
>
>server: 7:20pm % ./bin/nutch org.apache.nutch.searcher.NutchBean hasan
>060305 192042 10 parsing 
>file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
>060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
>060305 192042 10 opening merged index in 
>/home/hdiwan/SpectraSearch/crawl/index
>060305 192042 10 Plugins: looking in: 
>/home/hdiwan/nutch-0.7.1/build/plugins
>060305 192042 10 parsing:
>/home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
>060305 192042 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
>060305 192042 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
>060305 192042 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
>060305 192042 10 parsing:
>/home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
>060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol
>class=org.apache.nutch.protocol.httpclient.Http
>060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol
>class=org.apache.nutch.protocol.httpclient.Http
>060305 192042 10 parsing:
>/home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
>060305 192042 10 impl: point=org.apache.nutch.parse.Parser 
>class=org.apache.nutc
>che.nutch.searcher.more.TypeQueryFilter
>060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
>class=org.apache.nutch.searcher.more.DateQueryFilter
>060305 192043 10 parsing:
>/home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
>060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
>class=org.apache.nutch.searcher.site.SiteQueryFilter
>060305 192043 10 parsing:
>/home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
>060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
>class=org.apache.nutch.searcher.url.URLQueryFilter
>060305 192043 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
>060305 192043 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
>060305 192043 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
>060305 192043 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
>060305 192043 10 not including:
>/home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
>060305 192043 10 not including: 
>/home/hdiwan/nutch-0.7.1/build/plugins/ontology
>Total hits: 0
>--
>Cheers,
>Hasan Diwan <ha...@gmail.com>



Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
Mr Tang:
> Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
> [your-query-string] in shell/cmd?

server: 7:20pm % ./bin/nutch org.apache.nutch.searcher.NutchBean hasan
060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060305 192042 10 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
060305 192042 10 opening merged index in /home/hdiwan/SpectraSearch/crawl/index
060305 192042 10 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
060305 192042 10 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
060305 192042 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
060305 192042 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
060305 192042 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
060305 192042 10 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
060305 192042 10 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
060305 192042 10 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
060305 192042 10 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutc
che.nutch.searcher.more.TypeQueryFilter
060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.more.DateQueryFilter
060305 192043 10 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060305 192043 10 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
060305 192043 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060305 192043 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
060305 192043 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
060305 192043 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
060305 192043 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
060305 192043 10 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
060305 192043 10 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
Total hits: 0
--
Cheers,
Hasan Diwan <ha...@gmail.com>

RE: NullPointerException

Posted by Richard Braman <rb...@bramantax.com>.
It did fetch some urls such as:
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html

I don't know why is is going wrong, make sure that you have the right 

In your web TOMCAT/webapps/ROOT/conf open nutch_site.xml and make sure
searcher.dir make sure it points to your crawl directory. maybe it is
set to a directory of a previous crawl or something like that.


-----Original Message-----
From: Jack Tang [mailto:himars@gmail.com] 
Sent: Sunday, March 05, 2006 9:35 PM
To: nutch-user@lucene.apache.org
Subject: Re: NullPointerException


Hey Hasan

Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
[your-query-string] in shell/cmd?

I guess it works..

/Jack

You fetched one single website, i think

On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> Gentlemen:
> On 05/03/06, Richard Braman <rb...@bramantax.com> wrote:
> > This sounds like your crawl didn't get anything.  I have seen that
> > happen when the url wasn't added right, or the filter was bad.  Pipe

> > the crawl to crawl.log and look in there.  It should show some pages

> > being fecthed.  If none are being fetched, something is definaltely 
> > wrong with your filter or url file.
> 060305 182159 parsing
> file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
> 060305 182200 parsing
file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
> 060305 182200 parsing
file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
> 060305 182200 No FS indicated, using default:local
> 060305 182200 crawl started in: crawl
> 060305 182200 rootUrlFile = urls
> 060305 182200 threads = 15
> 060305 182200 depth = 2
> 060305 182200 Created webdb at
LocalFS,/home/hdiwan/SpectraSearch/crawl/db
> 060305 182200 Starting URL processing
> 060305 182200 Plugins: looking in:
/home/hdiwan/nutch-0.7.1/build/plugins
> 060305 182200 parsing:
>
/home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.html.HtmlParser
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-js
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.text.TextParser
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-rss
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-msword
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-ext
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/index-basic
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter
> class=org.apache.nutch.indexer.more.MoreIndexingFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.TypeQueryFilter
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.DateQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/ontology
> 060305 182200 Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
> 060305 182200 Added 15 pages
> 060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070
seconds.
> 060305 182200 Processing pagesByURL: Sorted 2142.8571428571427
> instructions/second
> 060305 182200 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182200 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040
seconds.
> 060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182200 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0020 seconds
> 060305 182200 Processing pagesByMD5: Merged 7500.0 records/second
> 060305 182200 Processing linksByMD5: Copied file (4096 bytes) in
0.0050 secs.
> 060305 182200 Processing linksByURL: Copied file (4096 bytes) in
0.0030 secs.
> 060305 182200 FetchListTool started
> 060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030
seconds.
> 060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182201 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182201 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040
seconds.
> 060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182201 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0030 seconds
> 060305 182201 Processing pagesByMD5: Merged 5000.0 records/second
> 060305 182201 Processing linksByMD5: Copied file (4096 bytes) in
0.0010 secs.
> 060305 182201 Processing linksByURL: Copied file (4096 bytes) in
0.0020 secs.
> 060305 182201 Processing
>
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsor
ted:
> Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Processing
>
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsor
ted:
> Sorted 5000.0 entries/second
> 060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Overall processing: Sorted 2.0E-4 entries/second
> 060305 182201 FetchListTool completed
> 060305 182201 logging at INFO
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.ht
ml
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographi
c_workfl.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_ma
rch_17.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand
.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_
of_mine.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_
really.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon
_march_8.html
> 060305 182201 http.proxy.host = null
> 060305 182201 http.proxy.port = 8118
> 060305 182201 http.timeout = 10000
> 060305 182201 http.content.limit = -1
> 060305 182201 http.agent = Spectra/200602 (Spectra;
> http://hasan.wits2020.net/typo/public; spectrasearch+agent@gmail.com)
> 060305 182201 http.auth.ntlm.username =
> 060305 182201 fetcher.server.delay = 1000
> 060305 182201 http.max.delays = 100
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html
> 060305 182201 Configured Client
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html
> 060305 182202 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_
2.html
> 060305 182202 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_provide
rs.html
> 060305 182202 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.htm
l
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html
> 060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182203 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182203 Processing document 0
> 060305 182203 Finishing update
> 060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030
seconds.
> 060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182203 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182203 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070
seconds.
> 060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427
> instructions/second
> 060305 182203 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0070 seconds
> 060305 182203 Processing pagesByMD5: Merged 2142.8571428571427
records/second
> 060305 182203 Processing linksByMD5: Copied file (4096 bytes) in
0.0020 secs.
> 060305 182203 Processing linksByURL: Copied file (4096 bytes) in
0.0020 secs.
> 060305 182203 Update finished
> 060305 182203 FetchListTool started
> 060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060305 182203 Overall processing: Sorted NaN entries/second
> 060305 182203 FetchListTool completed
> 060305 182203 logging at INFO
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Finishing update
> 060305 182204 Update finished
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from
> /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204  reading
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204  reading
/home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Sorting pages by url...
> 060305 182204 Getting updated scores and anchors from db...
> 060305 182204 Sorting updates by segment...
> 060305 182204 Updating segments...
> 060305 182204  updating
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments
> from /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182205 * Opening segment 20060305182200
> 060305 182205 * Indexing segment 20060305182200
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182200: total 15 records
> in 0.031 s (Infinity rec/s).
> 060305 182205 done indexing
> 060305 182205 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182205 * Opening segment 20060305182203
> 060305 182205 * Indexing segment 20060305182203
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182203: total 0 records in
> 0.075 s (NaN rec/s).
> 060305 182205 done indexing
> 060305 182205 Reading url hashes...
> 060305 182205 Sorting url hashes...
> 060305 182205 Deleting url duplicates...
> 060305 182205 Deleted 0 url duplicates.
> 060305 182205 Reading content hashes...
> 060305 182205 Sorting content hashes...
> 060305 182205 Deleting content duplicates...
> 060305 182205 Deleted 0 content duplicates.
> 060305 182205 Duplicate deletion complete locally.  Now returning to
NFS...
> 060305 182205 DeleteDuplicates complete
> 060305 182205 Merging segment indexes...
> 060305 182205 crawl finished: crawl
>
> That's the entire log. Hope it helps! My crawl-urlfilter.txt: # The
> url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression #
> prefixed by '+' or '-'.  The first matching pattern in the file # 
> determines whether a URL is included or ignored.  If no pattern # 
> matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz
> |mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept hosts in any domain
> +^http://([a-z0-9]*\.)*/
>
> # skip everything else
> -.
> So, why isn't it fetching anything, if that is indeed the case?
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


RE: NullPointerException

Posted by Richard Braman <rb...@bramantax.com>.
It did fetch some urls:



-----Original Message-----
From: Jack Tang [mailto:himars@gmail.com] 
Sent: Sunday, March 05, 2006 9:35 PM
To: nutch-user@lucene.apache.org
Subject: Re: NullPointerException


Hey Hasan

Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
[your-query-string] in shell/cmd?

I guess it works..

/Jack

You fetched one single website, i think

On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> Gentlemen:
> On 05/03/06, Richard Braman <rb...@bramantax.com> wrote:
> > This sounds like your crawl didn't get anything.  I have seen that 
> > happen when the url wasn't added right, or the filter was bad.  Pipe

> > the crawl to crawl.log and look in there.  It should show some pages

> > being fecthed.  If none are being fetched, something is definaltely 
> > wrong with your filter or url file.
> 060305 182159 parsing 
> file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
> 060305 182200 parsing
file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
> 060305 182200 parsing
file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
> 060305 182200 No FS indicated, using default:local
> 060305 182200 crawl started in: crawl
> 060305 182200 rootUrlFile = urls
> 060305 182200 threads = 15
> 060305 182200 depth = 2
> 060305 182200 Created webdb at
LocalFS,/home/hdiwan/SpectraSearch/crawl/db
> 060305 182200 Starting URL processing
> 060305 182200 Plugins: looking in:
/home/hdiwan/nutch-0.7.1/build/plugins
> 060305 182200 parsing:
>
/home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.html.HtmlParser
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-js
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.text.TextParser
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-rss
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-msword
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-ext
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/index-basic
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter
> class=org.apache.nutch.indexer.more.MoreIndexingFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.TypeQueryFilter
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.DateQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
> 060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/ontology
> 060305 182200 Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
> 060305 182200 Added 15 pages
> 060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070
seconds.
> 060305 182200 Processing pagesByURL: Sorted 2142.8571428571427
> instructions/second
> 060305 182200 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182200 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040
seconds.
> 060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182200 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0020 seconds
> 060305 182200 Processing pagesByMD5: Merged 7500.0 records/second
> 060305 182200 Processing linksByMD5: Copied file (4096 bytes) in
0.0050 secs.
> 060305 182200 Processing linksByURL: Copied file (4096 bytes) in
0.0030 secs.
> 060305 182200 FetchListTool started
> 060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030
seconds.
> 060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182201 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182201 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040
seconds.
> 060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182201 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0030 seconds
> 060305 182201 Processing pagesByMD5: Merged 5000.0 records/second
> 060305 182201 Processing linksByMD5: Copied file (4096 bytes) in
0.0010 secs.
> 060305 182201 Processing linksByURL: Copied file (4096 bytes) in
0.0020 secs.
> 060305 182201 Processing
>
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsor
ted:
> Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Processing
>
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsor
ted:
> Sorted 5000.0 entries/second
> 060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Overall processing: Sorted 2.0E-4 entries/second
> 060305 182201 FetchListTool completed
> 060305 182201 logging at INFO
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.ht
ml
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographi
c_workfl.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_ma
rch_17.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand
.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_
of_mine.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_
really.html
> 060305 182201 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon
_march_8.html
> 060305 182201 http.proxy.host = null
> 060305 182201 http.proxy.port = 8118
> 060305 182201 http.timeout = 10000
> 060305 182201 http.content.limit = -1
> 060305 182201 http.agent = Spectra/200602 (Spectra;
> http://hasan.wits2020.net/typo/public; spectrasearch+agent@gmail.com)
> 060305 182201 http.auth.ntlm.username =
> 060305 182201 fetcher.server.delay = 1000
> 060305 182201 http.max.delays = 100
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html
> 060305 182201 Configured Client
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html
> 060305 182202 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_
2.html
> 060305 182202 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_provide
rs.html
> 060305 182202 fetching
>
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.htm
l
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html
> 060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182203 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182203 Processing document 0
> 060305 182203 Finishing update
> 060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030
seconds.
> 060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182203 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182203 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070
seconds.
> 060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427
> instructions/second
> 060305 182203 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0070 seconds
> 060305 182203 Processing pagesByMD5: Merged 2142.8571428571427
records/second
> 060305 182203 Processing linksByMD5: Copied file (4096 bytes) in
0.0020 secs.
> 060305 182203 Processing linksByURL: Copied file (4096 bytes) in
0.0020 secs.
> 060305 182203 Update finished
> 060305 182203 FetchListTool started
> 060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060305 182203 Overall processing: Sorted NaN entries/second
> 060305 182203 FetchListTool completed
> 060305 182203 logging at INFO
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Finishing update
> 060305 182204 Update finished
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from
> /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204  reading
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204  reading
/home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Sorting pages by url...
> 060305 182204 Getting updated scores and anchors from db...
> 060305 182204 Sorting updates by segment...
> 060305 182204 Updating segments...
> 060305 182204  updating
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments
> from /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182205 * Opening segment 20060305182200
> 060305 182205 * Indexing segment 20060305182200
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182200: total 15 records
> in 0.031 s (Infinity rec/s).
> 060305 182205 done indexing
> 060305 182205 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182205 * Opening segment 20060305182203
> 060305 182205 * Indexing segment 20060305182203
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182203: total 0 records in
> 0.075 s (NaN rec/s).
> 060305 182205 done indexing
> 060305 182205 Reading url hashes...
> 060305 182205 Sorting url hashes...
> 060305 182205 Deleting url duplicates...
> 060305 182205 Deleted 0 url duplicates.
> 060305 182205 Reading content hashes...
> 060305 182205 Sorting content hashes...
> 060305 182205 Deleting content duplicates...
> 060305 182205 Deleted 0 content duplicates.
> 060305 182205 Duplicate deletion complete locally.  Now returning to
NFS...
> 060305 182205 DeleteDuplicates complete
> 060305 182205 Merging segment indexes...
> 060305 182205 crawl finished: crawl
>
> That's the entire log. Hope it helps! My crawl-urlfilter.txt: # The 
> url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression # 
> prefixed by '+' or '-'.  The first matching pattern in the file # 
> determines whether a URL is included or ignored.  If no pattern # 
> matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse 
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz
> |mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc. 
> -[?*!@=]
>
> # accept hosts in any domain
> +^http://([a-z0-9]*\.)*/
>
> # skip everything else
> -.
> So, why isn't it fetching anything, if that is indeed the case?
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars


Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
Hey Hasan

Crawling seems ok. Can you pls try org.apache.nutch.searcher.NutchBean
[your-query-string] in shell/cmd?

I guess it works..

/Jack

You fetched one single website, i think

On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> Gentlemen:
> On 05/03/06, Richard Braman <rb...@bramantax.com> wrote:
> > This sounds like your crawl didn't get anything.  I have seen that
> > happen when the url wasn't added right, or the filter was bad.  Pipe the
> > crawl to crawl.log and look in there.  It should show some pages being
> > fecthed.  If none are being fetched, something is definaltely wrong with
> > your filter or url file.
> 060305 182159 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
> 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
> 060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
> 060305 182200 No FS indicated, using default:local
> 060305 182200 crawl started in: crawl
> 060305 182200 rootUrlFile = urls
> 060305 182200 threads = 15
> 060305 182200 depth = 2
> 060305 182200 Created webdb at LocalFS,/home/hdiwan/SpectraSearch/crawl/db
> 060305 182200 Starting URL processing
> 060305 182200 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 impl: point=org.apache.nutch.protocol.Protocol
> class=org.apache.nutch.protocol.httpclient.Http
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.html.HtmlParser
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.text.TextParser
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/index-basic
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter
> class=org.apache.nutch.indexer.more.MoreIndexingFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.TypeQueryFilter
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.more.DateQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060305 182200 parsing:
> /home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
> 060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
> 060305 182200 not including:
> /home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
> 060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
> 060305 182200 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
> 060305 182200 Added 15 pages
> 060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070 seconds.
> 060305 182200 Processing pagesByURL: Sorted 2142.8571428571427
> instructions/second
> 060305 182200 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182200 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
> 060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182200 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0020 seconds
> 060305 182200 Processing pagesByMD5: Merged 7500.0 records/second
> 060305 182200 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs.
> 060305 182200 Processing linksByURL: Copied file (4096 bytes) in 0.0030 secs.
> 060305 182200 FetchListTool started
> 060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
> 060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182201 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182201 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
> 060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second
> 060305 182201 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0030 seconds
> 060305 182201 Processing pagesByMD5: Merged 5000.0 records/second
> 060305 182201 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs.
> 060305 182201 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
> 060305 182201 Processing
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
> Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Processing
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
> Sorted 5000.0 entries/second
> 060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds.
> 060305 182201 Overall processing: Sorted 2.0E-4 entries/second
> 060305 182201 FetchListTool completed
> 060305 182201 logging at INFO
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_march_17.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_of_mine.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_really.html
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon_march_8.html
> 060305 182201 http.proxy.host = null
> 060305 182201 http.proxy.port = 8118
> 060305 182201 http.timeout = 10000
> 060305 182201 http.content.limit = -1
> 060305 182201 http.agent = Spectra/200602 (Spectra;
> http://hasan.wits2020.net/typo/public; spectrasearch+agent@gmail.com)
> 060305 182201 http.auth.ntlm.username =
> 060305 182201 fetcher.server.delay = 1000
> 060305 182201 http.max.delays = 100
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html
> 060305 182201 Configured Client
> 060305 182201 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_2.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_providers.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.html
> 060305 182202 fetching
> http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html
> 060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182203 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182203 Processing document 0
> 060305 182203 Finishing update
> 060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
> 060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second
> 060305 182203 Processing pagesByURL: Merged to new DB containing 15
> records in 0.0040 seconds
> 060305 182203 Processing pagesByURL: Merged 3750.0 records/second
> 060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070 seconds.
> 060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427
> instructions/second
> 060305 182203 Processing pagesByMD5: Merged to new DB containing 15
> records in 0.0070 seconds
> 060305 182203 Processing pagesByMD5: Merged 2142.8571428571427 records/second
> 060305 182203 Processing linksByMD5: Copied file (4096 bytes) in 0.0020 secs.
> 060305 182203 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
> 060305 182203 Update finished
> 060305 182203 FetchListTool started
> 060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds.
> 060305 182203 Overall processing: Sorted NaN entries/second
> 060305 182203 FetchListTool completed
> 060305 182203 logging at INFO
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 Updating for
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Finishing update
> 060305 182204 Update finished
> 060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from
> /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204  reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204  reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182204 Sorting pages by url...
> 060305 182204 Getting updated scores and anchors from db...
> 060305 182204 Sorting updates by segment...
> 060305 182204 Updating segments...
> 060305 182204  updating /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments
> from /home/hdiwan/SpectraSearch/crawl/db
> 060305 182204 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
> 060305 182205 * Opening segment 20060305182200
> 060305 182205 * Indexing segment 20060305182200
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182200: total 15 records
> in 0.031 s (Infinity rec/s).
> 060305 182205 done indexing
> 060305 182205 indexing segment:
> /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
> 060305 182205 * Opening segment 20060305182203
> 060305 182205 * Indexing segment 20060305182203
> 060305 182205 * Optimizing index...
> 060305 182205 * Moving index to NFS if needed...
> 060305 182205 DONE indexing segment 20060305182203: total 0 records in
> 0.075 s (NaN rec/s).
> 060305 182205 done indexing
> 060305 182205 Reading url hashes...
> 060305 182205 Sorting url hashes...
> 060305 182205 Deleting url duplicates...
> 060305 182205 Deleted 0 url duplicates.
> 060305 182205 Reading content hashes...
> 060305 182205 Sorting content hashes...
> 060305 182205 Deleting content duplicates...
> 060305 182205 Deleted 0 content duplicates.
> 060305 182205 Duplicate deletion complete locally.  Now returning to NFS...
> 060305 182205 DeleteDuplicates complete
> 060305 182205 Merging segment indexes...
> 060305 182205 crawl finished: crawl
>
> That's the entire log. Hope it helps! My crawl-urlfilter.txt:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # accept hosts in any domain
> +^http://([a-z0-9]*\.)*/
>
> # skip everything else
> -.
> So, why isn't it fetching anything, if that is indeed the case?
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: NullPointerException

Posted by Hasan Diwan <ha...@gmail.com>.
Gentlemen:
On 05/03/06, Richard Braman <rb...@bramantax.com> wrote:
> This sounds like your crawl didn't get anything.  I have seen that
> happen when the url wasn't added right, or the filter was bad.  Pipe the
> crawl to crawl.log and look in there.  It should show some pages being
> fecthed.  If none are being fetched, something is definaltely wrong with
> your filter or url file.
060305 182159 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-default.xml
060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/crawl-tool.xml
060305 182200 parsing file:/home/hdiwan/nutch-0.7.1/conf/nutch-site.xml
060305 182200 No FS indicated, using default:local
060305 182200 crawl started in: crawl
060305 182200 rootUrlFile = urls
060305 182200 threads = 15
060305 182200 depth = 2
060305 182200 Created webdb at LocalFS,/home/hdiwan/SpectraSearch/crawl/db
060305 182200 Starting URL processing
060305 182200 Plugins: looking in: /home/hdiwan/nutch-0.7.1/build/plugins
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/nutch-extensionpoints/plugin.xml
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-file
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/protocol-ftp
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-http
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/protocol-httpclient/plugin.xml
060305 182200 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
060305 182200 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.httpclient.Http
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-html/plugin.xml
060305 182200 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-js
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/parse-text/plugin.xml
060305 182200 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-pdf
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-rss
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-msword
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/parse-ext
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/index-basic
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/index-more/plugin.xml
060305 182200 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.more.MoreIndexingFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-basic/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-more/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.more.TypeQueryFilter
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.more.DateQueryFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-site/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060305 182200 parsing:
/home/hdiwan/nutch-0.7.1/build/plugins/query-url/plugin.xml
060305 182200 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-regex
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/urlfilter-prefix
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/creativecommons
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/language-identifier
060305 182200 not including:
/home/hdiwan/nutch-0.7.1/build/plugins/clustering-carrot2
060305 182200 not including: /home/hdiwan/nutch-0.7.1/build/plugins/ontology
060305 182200 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060305 182200 Added 15 pages
060305 182200 Processing pagesByURL: Sorted 15 instructions in 0.0070 seconds.
060305 182200 Processing pagesByURL: Sorted 2142.8571428571427
instructions/second
060305 182200 Processing pagesByURL: Merged to new DB containing 15
records in 0.0040 seconds
060305 182200 Processing pagesByURL: Merged 3750.0 records/second
060305 182200 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
060305 182200 Processing pagesByMD5: Sorted 3750.0 instructions/second
060305 182200 Processing pagesByMD5: Merged to new DB containing 15
records in 0.0020 seconds
060305 182200 Processing pagesByMD5: Merged 7500.0 records/second
060305 182200 Processing linksByMD5: Copied file (4096 bytes) in 0.0050 secs.
060305 182200 Processing linksByURL: Copied file (4096 bytes) in 0.0030 secs.
060305 182200 FetchListTool started
060305 182201 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
060305 182201 Processing pagesByURL: Sorted 5000.0 instructions/second
060305 182201 Processing pagesByURL: Merged to new DB containing 15
records in 0.0040 seconds
060305 182201 Processing pagesByURL: Merged 3750.0 records/second
060305 182201 Processing pagesByMD5: Sorted 15 instructions in 0.0040 seconds.
060305 182201 Processing pagesByMD5: Sorted 3750.0 instructions/second
060305 182201 Processing pagesByMD5: Merged to new DB containing 15
records in 0.0030 seconds
060305 182201 Processing pagesByMD5: Merged 5000.0 records/second
060305 182201 Processing linksByMD5: Copied file (4096 bytes) in 0.0010 secs.
060305 182201 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
060305 182201 Processing
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
Sorted 15 entries in 0.0030 seconds.
060305 182201 Processing
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200/fetchlist.unsorted:
Sorted 5000.0 entries/second
060305 182201 Overall processing: Sorted 15 entries in 0.0030 seconds.
060305 182201 Overall processing: Sorted 2.0E-4 entries/second
060305 182201 FetchListTool completed
060305 182201 logging at INFO
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/04/vacation.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/stand_up_speak_up.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/murder_in_samarkand.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/26/automating_photographic_workfl.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/punching_at_the_sun_march_17.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/directtv_videoondemand.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/05/to_the_dear_neighbour_of_mine.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/bsc_has_no_properties_really.html
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/creative_commons_salon_march_8.html
060305 182201 http.proxy.host = null
060305 182201 http.proxy.port = 8118
060305 182201 http.timeout = 10000
060305 182201 http.content.limit = -1
060305 182201 http.agent = Spectra/200602 (Spectra;
http://hasan.wits2020.net/typo/public; spectrasearch+agent@gmail.com)
060305 182201 http.auth.ntlm.username =
060305 182201 fetcher.server.delay = 1000
060305 182201 http.max.delays = 100
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/02/gmail_whinging.html
060305 182201 Configured Client
060305 182201 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/27/nobody_likes_me.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/virtual_hosts_suck_pt_2.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/01/i_hate_hosting_providers.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/03/03/spectras_challenge.html
060305 182202 fetching
http://hasan.wits2020.net/~hdiwan/blog/2006/02/28/opml.html
060305 182203 Updating /home/hdiwan/SpectraSearch/crawl/db
060305 182203 Updating for
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182203 Processing document 0
060305 182203 Finishing update
060305 182203 Processing pagesByURL: Sorted 15 instructions in 0.0030 seconds.
060305 182203 Processing pagesByURL: Sorted 5000.0 instructions/second
060305 182203 Processing pagesByURL: Merged to new DB containing 15
records in 0.0040 seconds
060305 182203 Processing pagesByURL: Merged 3750.0 records/second
060305 182203 Processing pagesByMD5: Sorted 15 instructions in 0.0070 seconds.
060305 182203 Processing pagesByMD5: Sorted 2142.8571428571427
instructions/second
060305 182203 Processing pagesByMD5: Merged to new DB containing 15
records in 0.0070 seconds
060305 182203 Processing pagesByMD5: Merged 2142.8571428571427 records/second
060305 182203 Processing linksByMD5: Copied file (4096 bytes) in 0.0020 secs.
060305 182203 Processing linksByURL: Copied file (4096 bytes) in 0.0020 secs.
060305 182203 Update finished
060305 182203 FetchListTool started
060305 182203 Overall processing: Sorted 0 entries in 0.0 seconds.
060305 182203 Overall processing: Sorted NaN entries/second
060305 182203 FetchListTool completed
060305 182203 logging at INFO
060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/db
060305 182204 Updating for
/home/hdiwan/SpectraSearch/crawl/segments/20060305182203
060305 182204 Finishing update
060305 182204 Update finished
060305 182204 Updating /home/hdiwan/SpectraSearch/crawl/segments from
/home/hdiwan/SpectraSearch/crawl/db
060305 182204  reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182204  reading /home/hdiwan/SpectraSearch/crawl/segments/20060305182203
060305 182204 Sorting pages by url...
060305 182204 Getting updated scores and anchors from db...
060305 182204 Sorting updates by segment...
060305 182204 Updating segments...
060305 182204  updating /home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182204 Done updating /home/hdiwan/SpectraSearch/crawl/segments
from /home/hdiwan/SpectraSearch/crawl/db
060305 182204 indexing segment:
/home/hdiwan/SpectraSearch/crawl/segments/20060305182200
060305 182205 * Opening segment 20060305182200
060305 182205 * Indexing segment 20060305182200
060305 182205 * Optimizing index...
060305 182205 * Moving index to NFS if needed...
060305 182205 DONE indexing segment 20060305182200: total 15 records
in 0.031 s (Infinity rec/s).
060305 182205 done indexing
060305 182205 indexing segment:
/home/hdiwan/SpectraSearch/crawl/segments/20060305182203
060305 182205 * Opening segment 20060305182203
060305 182205 * Indexing segment 20060305182203
060305 182205 * Optimizing index...
060305 182205 * Moving index to NFS if needed...
060305 182205 DONE indexing segment 20060305182203: total 0 records in
0.075 s (NaN rec/s).
060305 182205 done indexing
060305 182205 Reading url hashes...
060305 182205 Sorting url hashes...
060305 182205 Deleting url duplicates...
060305 182205 Deleted 0 url duplicates.
060305 182205 Reading content hashes...
060305 182205 Sorting content hashes...
060305 182205 Deleting content duplicates...
060305 182205 Deleted 0 content duplicates.
060305 182205 Duplicate deletion complete locally.  Now returning to NFS...
060305 182205 DeleteDuplicates complete
060305 182205 Merging segment indexes...
060305 182205 crawl finished: crawl

That's the entire log. Hope it helps! My crawl-urlfilter.txt:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in any domain
+^http://([a-z0-9]*\.)*/

# skip everything else
-.
So, why isn't it fetching anything, if that is indeed the case?
--
Cheers,
Hasan Diwan <ha...@gmail.com>

Re: NullPointerException

Posted by Stefan Groschupf <sg...@media-style.com>.
>  If none are being fetched, something is definaltely wrong with
> your filter or url file.

Yes, since it is blog it may has dynamic pages like foo.com?entry=23  
this definitely filtered by default.

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com



RE: NullPointerException

Posted by Richard Braman <rb...@bramantax.com>.
If you don't have the searcher.dir set right it will usually throw a
servlet error.
This sounds like your crawl didn't get anything.  I have seen that
happen when the url wasn't added right, or the filter was bad.  Pipe the
crawl to crawl.log and look in there.  It should show some pages being
fecthed.  If none are being fetched, something is definaltely wrong with
your filter or url file.
Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org
Free Open Source Tax Software


-----Original Message-----
From: Stefan Groschupf [mailto:sg@media-style.com] 
Sent: Sunday, March 05, 2006 8:36 PM
To: nutch-user@lucene.apache.org
Subject: Re: NullPointerException


Hi,
http or www are very good test queries.
double check that the nutch-default.xml which inside the nutch.war  
points to the correct folder   <name>searcher.dir</name>.
Stefan
Am 06.03.2006 um 02:31 schrieb Hasan Diwan:

> I've followed the nutch tutorial for crawling and started tomcat from 
> the crawl directory. When I run the crawl, it ends with: 060305 171044

> crawl finished: crawl which looks as it should, then I start tomcat. 
> Access nutch using index.jsp in the root context and whatever searches

> I do yield 0 results. I know for a fact that my blog, which is my test

> dataset contains at least one instance of my name, which is what I'm 
> searching for. Thanks a bunch for the help!
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Re: NullPointerException

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
http or www are very good test queries.
double check that the nutch-default.xml which inside the nutch.war  
points to the correct folder   <name>searcher.dir</name>.
Stefan
Am 06.03.2006 um 02:31 schrieb Hasan Diwan:

> I've followed the nutch tutorial for crawling and started tomcat from
> the crawl directory. When I run the crawl, it ends with:
> 060305 171044 crawl finished: crawl
> which looks as it should, then I start tomcat. Access nutch using
> index.jsp in the root context and whatever searches I do yield 0
> results. I know for a fact that my blog, which is my test dataset
> contains at least one instance of my name, which is what I'm searching
> for. Thanks a bunch for the help!
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com



Re: NullPointerException

Posted by Jack Tang <hi...@gmail.com>.
On 3/6/06, Hasan Diwan <ha...@gmail.com> wrote:
> I've followed the nutch tutorial for crawling and started tomcat from
> the crawl directory. When I run the crawl, it ends with:
> 060305 171044 crawl finished: crawl
Hey, nutch crawler with normally end does not mean everything is OK.
Maybe it fetched nothing, make sure your url filters works.

And finally the crawler logs tell everything:)

/Jack
> which looks as it should, then I start tomcat. Access nutch using
> index.jsp in the root context and whatever searches I do yield 0
> results. I know for a fact that my blog, which is my test dataset
> contains at least one instance of my name, which is what I'm searching
> for. Thanks a bunch for the help!
> --
> Cheers,
> Hasan Diwan <ha...@gmail.com>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars