You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by arul velusamy <ar...@gmail.com> on 2009/02/03 21:34:40 UTC

Crawl process seems to complete but all output files seem to be empty

Dear All,

I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS: windows
vista).


(1) I have set in crawl-urlfilter.txt the following -

+^http://([a-z0-9]*\.)*cricinfo.com/

(2) I have created NUTCH_HOME/urls/nutch with the content -

http://www.cricinfo.com

(3) My command line parameters -

urls -dir crawl -depth 3 -topN 50

When I run Crawl using Eclipse, I see all ouput directories and files
created. _BUT_ I don't see any useful crawled content in it.
Infact, running SegmentReader with the command line parameters - "-list -dir
crawl/segments/" is giving the following output -

NAME GENERATED FETCHER START FETCHER END FETCHED PARSED

20090203201844 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0

20090203201851 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0

20090203201857 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0

What is going wrong? Please help on this.

Thanks,

Arul.

Re: Crawl process seems to complete but all output files seem to be empty

Posted by arul velusamy <ar...@gmail.com>.
Saurabh,

I see the following lines in the log file -

*2009-02-03 20:18:36,646 INFO  crawl.Injector - Injector: starting
2009-02-03 20:18:36,646 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2009-02-03 20:18:36,646 INFO  crawl.Injector - Injector: urlDir: urls
2009-02-03 20:18:36,678 INFO  crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2009-02-03 20:18:39,627 INFO  net.UrlNormalizerFactory - Using URL
normalizer: org.apache.nutch.net.BasicUrlNormalizer
2009-02-03 20:18:39,746 INFO  plugin.PluginRepository - Plugins: looking in:
C:\Installs\nutch\.\src\plugin
2009-02-03 20:18:39,764 WARN  plugin.PluginRepository -
java.io.FileNotFoundException:
C:\Installs\nutch\.\src\plugin\.svn\plugin.xml (The system cannot find the
file specified)
2009-02-03 20:18:41,207 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-02-03 20:18:41,207 INFO  plugin.PluginRepository - Registered Plugins:
2009-02-03 20:18:41,207 INFO  plugin.PluginRepository -  Pdf Parse Plug-in
(parse-pdf)
2009-02-03 20:18:41,207 INFO  plugin.PluginRepository -  Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2009-02-03 20:18:41,207 INFO  plugin.PluginRepository -  Creative Commons
Plugins (creativecommons)
*
Do you think the inability to find plugin.xml is a critical error?

Thanks,
Arul.

On Thu, Feb 12, 2009 at 5:47 AM, Saurabh Bhutyani <sa...@in.com> wrote:

>
> Check the logs under nutch-x.x/logs/hadoop.log file. If you have indexing
> of files properly then the folders should not be empty. Also in case any
> error occurred you will be able to see it in the logs.
>
>
>
> ---------- Original message ----------
> From:arul velusamy< arulvelusamy@gmail.com >
> Date: 09 Feb 09 17:48:51
> Subject: Re: Crawl process seems to complete but all output files seem to
> be empty
> To: nutch-user@lucene.apache.org
>
> >
> > Dear All,
> >
> > I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS:
> windows
> > vista).
> >
> >
> > (1) I have set in crawl-urlfilter.txt the following -
> >
> > +^http://([a-z0-9]*\.)*cricinfo.com/
> >
> > (2) I have created NUTCH_HOME/urls/nutch with the content -
> >
> > http://www.cricinfo.com
> >
> > (3) My command line parameters -
> >
> > urls -dir crawl -depth 3 -topN 50
> >
> > When I run Crawl using Eclipse, I see all ouput directories and files
> > created. _BUT_ I don't see any useful crawled content in it.
> > Infact, running SegmentReader with the command line parameters - "-list
> > -dir crawl/segments/" is giving the following output -
> >
> > NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> >
> > 20090203201844 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
> >
> > 20090203201851 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
> >
> > 20090203201857 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
> >
> > What is going wrong? Please help on this.
> >
> > Thanks,
> >
> > Arul.
> >
>
>

Re: Crawl process seems to complete but all output files seem to be empty

Posted by Saurabh Bhutyani <sa...@in.com>.
 Check the logs under nutchx.x/logs/hadoop.log file. If you have indexing of files properly then the folders should not be empty. Also in case any error occurred you will be able to see it in the logs. Original message From:arul velusamy< arulvelusamy@gmail.com >Date: 09 Feb 09 17:48:51Subject:Re: Crawl process seems to complete but all output files seem to be emptyTo: nutchuser@lucene.apache.org> > Dear All, > > I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS: windows > vista). > > > (1) I have set in crawlurlfilter.txt the following  > > +^http://([az09]*\.)*cricinfo.com/ > > (2) I have created NUTCHHOME/urls/nutch with the content  > > http://www.cricinfo.com > > (3) My command line parameters  > > urls dir crawl depth 3 topN 50 > > When I run Crawl using Eclipse, I see all ouput directories and files > created. BUT I don't see any useful crawled content in it. > Infact, running SegmentReader with the command line parameters  "list > dir crawl/segments/" 
 is giving the following output  > > NAME GENERATED FETCHER START FETCHER END FETCHED PARSED > > 20090203201844 0 2922789940817T07:12:55 2922690551202T16:47:04 0 0 > > 20090203201851 0 2922789940817T07:12:55 2922690551202T16:47:04 0 0 > > 20090203201857 0 2922789940817T07:12:55 2922690551202T16:47:04 0 0 > > What is going wrong? Please help on this. > > Thanks, > > Arul. >

Re: Crawl process seems to complete but all output files seem to be empty

Posted by arul velusamy <ar...@gmail.com>.
>
> Dear All,
>
> I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS: windows
> vista).
>
>
> (1) I have set in crawl-urlfilter.txt the following -
>
> +^http://([a-z0-9]*\.)*cricinfo.com/
>
> (2) I have created NUTCH_HOME/urls/nutch with the content -
>
> http://www.cricinfo.com
>
> (3) My command line parameters -
>
> urls -dir crawl -depth 3 -topN 50
>
> When I run Crawl using Eclipse, I see all ouput directories and files
> created. _BUT_ I don't see any useful crawled content in it.
> Infact, running SegmentReader with the command line parameters - "-list
> -dir crawl/segments/" is giving the following output -
>
> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
>
> 20090203201844 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
>
> 20090203201851 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
>
> 20090203201857 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
>
> What is going wrong? Please help on this.
>
> Thanks,
>
> Arul.
>