You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by arul velusamy <ar...@gmail.com> on 2009/02/03 21:34:40 UTC
Crawl process seems to complete but all output files seem to be empty
Dear All,
I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS: windows
vista).
(1) I have set in crawl-urlfilter.txt the following -
+^http://([a-z0-9]*\.)*cricinfo.com/
(2) I have created NUTCH_HOME/urls/nutch with the content -
http://www.cricinfo.com
(3) My command line parameters -
urls -dir crawl -depth 3 -topN 50
When I run Crawl using Eclipse, I see all ouput directories and files
created. _BUT_ I don't see any useful crawled content in it.
Infact, running SegmentReader with the command line parameters - "-list -dir
crawl/segments/" is giving the following output -
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20090203201844 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
20090203201851 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
20090203201857 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
What is going wrong? Please help on this.
Thanks,
Arul.
Re: Crawl process seems to complete but all output files seem to be
empty
Posted by arul velusamy <ar...@gmail.com>.
Saurabh,
I see the following lines in the log file -
*2009-02-03 20:18:36,646 INFO crawl.Injector - Injector: starting
2009-02-03 20:18:36,646 INFO crawl.Injector - Injector: crawlDb:
crawl/crawldb
2009-02-03 20:18:36,646 INFO crawl.Injector - Injector: urlDir: urls
2009-02-03 20:18:36,678 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2009-02-03 20:18:39,627 INFO net.UrlNormalizerFactory - Using URL
normalizer: org.apache.nutch.net.BasicUrlNormalizer
2009-02-03 20:18:39,746 INFO plugin.PluginRepository - Plugins: looking in:
C:\Installs\nutch\.\src\plugin
2009-02-03 20:18:39,764 WARN plugin.PluginRepository -
java.io.FileNotFoundException:
C:\Installs\nutch\.\src\plugin\.svn\plugin.xml (The system cannot find the
file specified)
2009-02-03 20:18:41,207 INFO plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2009-02-03 20:18:41,207 INFO plugin.PluginRepository - Registered Plugins:
2009-02-03 20:18:41,207 INFO plugin.PluginRepository - Pdf Parse Plug-in
(parse-pdf)
2009-02-03 20:18:41,207 INFO plugin.PluginRepository - Jakarta POI - Java
API To Access Microsoft Format Files (lib-jakarta-poi)
2009-02-03 20:18:41,207 INFO plugin.PluginRepository - Creative Commons
Plugins (creativecommons)
*
Do you think the inability to find plugin.xml is a critical error?
Thanks,
Arul.
On Thu, Feb 12, 2009 at 5:47 AM, Saurabh Bhutyani <sa...@in.com> wrote:
>
> Check the logs under nutch-x.x/logs/hadoop.log file. If you have indexing
> of files properly then the folders should not be empty. Also in case any
> error occurred you will be able to see it in the logs.
>
>
>
> ---------- Original message ----------
> From:arul velusamy< arulvelusamy@gmail.com >
> Date: 09 Feb 09 17:48:51
> Subject: Re: Crawl process seems to complete but all output files seem to
> be empty
> To: nutch-user@lucene.apache.org
>
> >
> > Dear All,
> >
> > I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS:
> windows
> > vista).
> >
> >
> > (1) I have set in crawl-urlfilter.txt the following -
> >
> > +^http://([a-z0-9]*\.)*cricinfo.com/
> >
> > (2) I have created NUTCH_HOME/urls/nutch with the content -
> >
> > http://www.cricinfo.com
> >
> > (3) My command line parameters -
> >
> > urls -dir crawl -depth 3 -topN 50
> >
> > When I run Crawl using Eclipse, I see all ouput directories and files
> > created. _BUT_ I don't see any useful crawled content in it.
> > Infact, running SegmentReader with the command line parameters - "-list
> > -dir crawl/segments/" is giving the following output -
> >
> > NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> >
> > 20090203201844 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
> >
> > 20090203201851 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
> >
> > 20090203201857 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
> >
> > What is going wrong? Please help on this.
> >
> > Thanks,
> >
> > Arul.
> >
>
>
Re: Crawl process seems to complete but all output files seem to be empty
Posted by Saurabh Bhutyani <sa...@in.com>.
Check the logs under nutchx.x/logs/hadoop.log file. If you have indexing of files properly then the folders should not be empty. Also in case any error occurred you will be able to see it in the logs. Original message From:arul velusamy< arulvelusamy@gmail.com >Date: 09 Feb 09 17:48:51Subject:Re: Crawl process seems to complete but all output files seem to be emptyTo: nutchuser@lucene.apache.org> > Dear All, > > I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS: windows > vista). > > > (1) I have set in crawlurlfilter.txt the following > > +^http://([az09]*\.)*cricinfo.com/ > > (2) I have created NUTCHHOME/urls/nutch with the content > > http://www.cricinfo.com > > (3) My command line parameters > > urls dir crawl depth 3 topN 50 > > When I run Crawl using Eclipse, I see all ouput directories and files > created. BUT I don't see any useful crawled content in it. > Infact, running SegmentReader with the command line parameters "list > dir crawl/segments/"
is giving the following output > > NAME GENERATED FETCHER START FETCHER END FETCHED PARSED > > 20090203201844 0 2922789940817T07:12:55 2922690551202T16:47:04 0 0 > > 20090203201851 0 2922789940817T07:12:55 2922690551202T16:47:04 0 0 > > 20090203201857 0 2922789940817T07:12:55 2922690551202T16:47:04 0 0 > > What is going wrong? Please help on this. > > Thanks, > > Arul. >
Re: Crawl process seems to complete but all output files seem to be
empty
Posted by arul velusamy <ar...@gmail.com>.
>
> Dear All,
>
> I downloaded Nutch 0.8.1 and am running it with Eclipse 3.4.1 (OS: windows
> vista).
>
>
> (1) I have set in crawl-urlfilter.txt the following -
>
> +^http://([a-z0-9]*\.)*cricinfo.com/
>
> (2) I have created NUTCH_HOME/urls/nutch with the content -
>
> http://www.cricinfo.com
>
> (3) My command line parameters -
>
> urls -dir crawl -depth 3 -topN 50
>
> When I run Crawl using Eclipse, I see all ouput directories and files
> created. _BUT_ I don't see any useful crawled content in it.
> Infact, running SegmentReader with the command line parameters - "-list
> -dir crawl/segments/" is giving the following output -
>
> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
>
> 20090203201844 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
>
> 20090203201851 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
>
> 20090203201857 0 292278994-08-17T07:12:55 292269055-12-02T16:47:04 0 0
>
> What is going wrong? Please help on this.
>
> Thanks,
>
> Arul.
>