You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by carmmello <ca...@globo.com> on 2005/12/22 21:12:25 UTC

New Tutorial Needed

I have donwloaded the last Nutch nightly version, from 2005-12-18, and
tried to run it,  as usual (using the crawl method).  As a result, I got
the following error messages: 

"051222 175202 parsing file:/usr/nutch-nightly/conf/nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-
default.xml , /tmp/nutch/mapred/local/localRunner/job_xom6lb.xml ,
nutch-site.xml
        at org.apache.nutch.mapred.InputFormatBase.listFiles
(InputFormatBase.java:85)
        at org.apache.nutch.mapred.InputFormatBase.getSplits
(InputFormatBase.java:95)
        at org.apache.nutch.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:63)
051222 175203  map 0%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
[root@localhost nutch-nightly]# "

Also, reviewing some posts I came across the following statement:

"The next version will be map reduce based in any case. So 0.7 is
already the 'old' one and people will not continue to develop it (may
just some maintenance releases).Map reduce doesn't mean that you need
more than one computer or need the ndfs. It is just a technology to
process large data sets." (Stefan Groschupf)

I know that the official new release of Nutch (0.8, I think) is not
released yet, but, I think,  a new tutorial is needed on how to set up
Nutch to run properly, as the  tutorial on the Nutch site,it seems,  can
not cope with the new nigthly distributions and the new ones that will
be released.

Thanks






Re: New Tutorial Needed

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
for clarification
The next release will be a 0.7.2 this will mostly fix a set of bugs  
but also have some new feature, but does not use map reduce.
However the actually source stream (trunk)  is 0.8 (map reduce )and  
there is a branch 0.7 to be able fix bugs in 0.7.x and may there will  
be more maintenance releases.
Anyway developers are focused on map reduce based based tool  
development.

Stefan


Am 22.12.2005 um 21:12 schrieb carmmello:

> I have donwloaded the last Nutch nightly version, from 2005-12-18, and
> tried to run it,  as usual (using the crawl method).  As a result,  
> I got
> the following error messages:
>
> "051222 175202 parsing file:/usr/nutch-nightly/conf/nutch-site.xml
> java.io.IOException: No input directories specified in: NutchConf:
> nutch-default.xml , mapred-
> default.xml , /tmp/nutch/mapred/local/localRunner/job_xom6lb.xml ,
> nutch-site.xml
>         at org.apache.nutch.mapred.InputFormatBase.listFiles
> (InputFormatBase.java:85)
>         at org.apache.nutch.mapred.InputFormatBase.getSplits
> (InputFormatBase.java:95)
>         at org.apache.nutch.mapred.LocalJobRunner$Job.run
> (LocalJobRunner.java:63)
> 051222 175203  map 0%
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java: 
> 308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
> [root@localhost nutch-nightly]# "
>
> Also, reviewing some posts I came across the following statement:
>
> "The next version will be map reduce based in any case. So 0.7 is
> already the 'old' one and people will not continue to develop it (may
> just some maintenance releases).Map reduce doesn't mean that you need
> more than one computer or need the ndfs. It is just a technology to
> process large data sets." (Stefan Groschupf)
>
> I know that the official new release of Nutch (0.8, I think) is not
> released yet, but, I think,  a new tutorial is needed on how to set up
> Nutch to run properly, as the  tutorial on the Nutch site,it  
> seems,  can
> not cope with the new nigthly distributions and the new ones that will
> be released.
>
> Thanks
>
>
>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: New Tutorial Needed

Posted by Doug Cutting <cu...@nutch.org>.
carmmello wrote:
> I have donwloaded the last Nutch nightly version, from 2005-12-18, and
> tried to run it,  as usual (using the crawl method).  As a result, I got
> the following error messages: 

Where are you finding the tutorial?  The version on the website is for 
the currently released version, 0.7.1.  There is a newer tutorial 
bundled with 0.8.  Please refer to the tutorial included with the code 
you download.

Doug

Re: New Tutorial Needed

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi guys

It is the same problem
In the new nutch ,the folder can contain many flat files having url names in
then

In the nutch-0-7.1 version,it was a single file

We can have a separate tutorial for each of the things

And the latest nutch also makes use of the new crawl

Whereas the nutch-0-7.1 use org.apache.nutch.tools.CrawlTool for crawl

Hope this helps

Raghavendra Prabhu

On 1/4/06, Lukas Vlcek <lu...@gmail.com> wrote:
>
> Hi,
>
> I found the solution to the original problem at the beginning of this
> mail thread. I am not sure if anybody is still interested in it,
> anyway, here it comes:
>
> The problem is very simple. The current nutch-trunk version requires
> initial url list to be stored in folder. In other words when usign
> crawl command (like the follwoing example "bin/nutch crawl urls -dir
> some_dir -depth n") then that urls MUST stands for folder and not for
> flat file.
>
> This is one of (minor) changes made to nutch when it matured from
> nutch-0.7.x to nutch-0.8. I didn't originally notice this. Again, this
> is simple issue but if anybody is still interested....
>
> Regards,
> Lukas
>
> On 12/23/05, Stefan Groschupf <sg...@media-style.com> wrote:
> > Hi
> >
> > > I have been struggling the same problem two days ago. I posted problem
> > > to nutch-dev maillist under the following sublect: "nutch-0.8-dev
> > > *mapred.input.subdir* problem
> >
> > As soon I found some time over the next days I will try to reproduce
> > the problem.
> > Meanwhile it would be good to know if you guys note that problem with
> > the nightly
> > build and if this also occurs when using a build form the latest
> > sources in subversion.
> >
> > Stefan
> >
>

Re: New Tutorial Needed

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,

I found the solution to the original problem at the beginning of this
mail thread. I am not sure if anybody is still interested in it,
anyway, here it comes:

The problem is very simple. The current nutch-trunk version requires
initial url list to be stored in folder. In other words when usign
crawl command (like the follwoing example "bin/nutch crawl urls -dir
some_dir -depth n") then that urls MUST stands for folder and not for
flat file.

This is one of (minor) changes made to nutch when it matured from
nutch-0.7.x to nutch-0.8. I didn't originally notice this. Again, this
is simple issue but if anybody is still interested....

Regards,
Lukas

On 12/23/05, Stefan Groschupf <sg...@media-style.com> wrote:
> Hi
>
> > I have been struggling the same problem two days ago. I posted problem
> > to nutch-dev maillist under the following sublect: "nutch-0.8-dev
> > *mapred.input.subdir* problem
>
> As soon I found some time over the next days I will try to reproduce
> the problem.
> Meanwhile it would be good to know if you guys note that problem with
> the nightly
> build and if this also occurs when using a build form the latest
> sources in subversion.
>
> Stefan
>

Re: New Tutorial Needed

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi

> I have been struggling the same problem two days ago. I posted problem
> to nutch-dev maillist under the following sublect: "nutch-0.8-dev
> *mapred.input.subdir* problem

As soon I found some time over the next days I will try to reproduce  
the problem.
Meanwhile it would be good to know if you guys note that problem with  
the nightly
build and if this also occurs when using a build form the latest  
sources in subversion.

Stefan

Re: New Tutorial Needed

Posted by Lukas Vlcek <lu...@gmail.com>.
Hi,

I have been struggling the same problem two days ago. I posted problem
to nutch-dev maillist under the following sublect: "nutch-0.8-dev
*mapred.input.subdir* problem ?".
Stefan and Paul responded by I am not sure this will solve my problem
(truly I didn have a time to fully test their suggestions).

To me it seems that the problem can be related to setting the
mapred.input.subdir property (I was looking into code). But I was not
able to find anything about mapred.input.subdir property on web.

Now I know that I am not alone who have this problem so either both I
and you are doing something wrong or there is a real problem in the
lates nutch trunk package.

Anybody else is facing this problem?
Lukas

On 12/22/05, carmmello <ca...@globo.com> wrote:
> I have donwloaded the last Nutch nightly version, from 2005-12-18, and
> tried to run it,  as usual (using the crawl method).  As a result, I got
> the following error messages:
>
> "051222 175202 parsing file:/usr/nutch-nightly/conf/nutch-site.xml
> java.io.IOException: No input directories specified in: NutchConf:
> nutch-default.xml , mapred-
> default.xml , /tmp/nutch/mapred/local/localRunner/job_xom6lb.xml ,
> nutch-site.xml
>         at org.apache.nutch.mapred.InputFormatBase.listFiles
> (InputFormatBase.java:85)
>         at org.apache.nutch.mapred.InputFormatBase.getSplits
> (InputFormatBase.java:95)
>         at org.apache.nutch.mapred.LocalJobRunner$Job.run
> (LocalJobRunner.java:63)
> 051222 175203  map 0%
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
> [root@localhost nutch-nightly]# "
>
> Also, reviewing some posts I came across the following statement:
>
> "The next version will be map reduce based in any case. So 0.7 is
> already the 'old' one and people will not continue to develop it (may
> just some maintenance releases).Map reduce doesn't mean that you need
> more than one computer or need the ndfs. It is just a technology to
> process large data sets." (Stefan Groschupf)
>
> I know that the official new release of Nutch (0.8, I think) is not
> released yet, but, I think,  a new tutorial is needed on how to set up
> Nutch to run properly, as the  tutorial on the Nutch site,it seems,  can
> not cope with the new nigthly distributions and the new ones that will
> be released.
>
> Thanks
>
>
>
>
>
>
>