You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jeremy Bensley <jb...@gmail.com> on 2005/08/16 20:56:52 UTC

(mapred branch) Job.xml as a directory instead of a file, other issues.

I have been attempting to get the mapred branch version of the crawler
working and have hit some snags.

First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.

Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:

050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
        at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
        at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
        at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
        at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
        at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
        at org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
        at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
        at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
        at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
        at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
        ... 8 more

Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.

drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml


Any help / observation of these issues is most appreciated.

Thanks,

Jeremy

Re: Merge Lucene to Nutch

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Nutch simply uses the Lucene JAR file.  Upgrading Nutch to use a new  
Lucene release would involve replacing the JAR file with the new  
version, and depending on the changes to Lucene itself it may involve  
rebuilding indexes (to ensure normalization factors and such changes  
are incorporated), but Lucene remains quite backwards compatible with  
indexes built with previous versions and reindexing likely wouldn't  
be needed.  The specifics are in the details of what versions of  
Lucene we're talking about, of course.

     Erik


On Aug 17, 2005, at 8:29 PM, Michael Ji wrote:

> As I understand, Nutch is a crawling/searching
> application based on Lucene;
>
> Just a curious question, when Lucene has a new
> version/release, how to merge Lucene to Nutch?
>
> I didn't see an explicity Lucene Java source in Nutch
> source tree. I don't think Nutch and Lucene do low
> level API independently.
>
> Thanks,
>
> Michael Ji
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>


RE: Merge Lucene to Nutch

Posted by Fuad Efendi <fu...@efendi.ca>.
Nutch (distribution, "ant package") has a folder /lib/ containing lucene
jar files... As usual, specific versions of library files were tested
for production use, any upgrade in libs is not recommended...


-----Original Message-----
From: Michael Ji [mailto:fji_00@yahoo.com] 
Sent: Wednesday, August 17, 2005 8:30 PM
To: nutch-dev@lucene.apache.org
Subject: Merge Lucene to Nutch


As I understand, Nutch is a crawling/searching
application based on Lucene;

Just a curious question, when Lucene has a new
version/release, how to merge Lucene to Nutch? 

I didn't see an explicity Lucene Java source in Nutch
source tree. I don't think Nutch and Lucene do low
level API independently. 

Thanks,

Michael Ji

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 



Merge Lucene to Nutch

Posted by Michael Ji <fj...@yahoo.com>.
As I understand, Nutch is a crawling/searching
application based on Lucene;

Just a curious question, when Lucene has a new
version/release, how to merge Lucene to Nutch? 

I didn't see an explicity Lucene Java source in Nutch
source tree. I don't think Nutch and Lucene do low
level API independently. 

Thanks,

Michael Ji

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

result tuning

Posted by webmaster <we...@www.poundwebhosting.com>.
where is it that I would change the query results to every search only making 
10 results, instead of 100? so it wont cache 10 pages of sub-results? I take 
it that it is not the io.sort.factor option!!!
-Jay

Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

Posted by Doug Cutting <cu...@nutch.org>.
Jeremy Bensley wrote:
> After going through your checklist, I realized that my view on how the
> MapReduce function behaves was slightly flawed, as I did not realize
> that the temporary storage phase between map and reduce had to be in a
> shared location.

The temprorary storage between map and reduce is actually not stored in 
NDFS, but on node's local disks.  But the input (the url file in this 
case) must be shared.

> So, my process for running crawl is now:
> 1. Set up / start NDFS name and data nodes
> 2. Copy url file into NDFS 
> 3. Set up / start job and task trackers
> 4. run crawl with arguments referencing the NDFS positions of my
> inputs and outputs

That looks right to me.

We really need a mapred & ndfs-based tutorial...

> The only lasting issue I have is that, whenever I attempt to start a
> tasktracker or jobtracker and have the configuration parameters for
> mapred specified only in mapred-default.xml, I get the following
> error:
> 
> 050816 164343 parsing
> file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
> 050816 164343 parsing
> file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
> Exception in thread "main" java.lang.RuntimeException: Bad
> mapred.job.tracker: local
>         at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
>         at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
>         at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)
> 
> It is as if the mapred-default.xml is not being parsed for its
> options. If I specify the same options in nutch-site.xml it works just
> fine.

The config files are a bit confusing.  mapred-default.xml is for stuff 
that may be reasonably overidden by applications, while nutch-site.xml 
is for stuff that should not be overridden by applications.  So the name 
of the shared filesystem and of the job tracker should be in 
nutch-site.xml, since they should not be overridden.  But, e.g., the 
default number of map and reduce tasks should be in mapred-default.xml, 
since applications do sometimes change these.

The "local" job tracker should only be used in standalone 
configurations, when everything runs in the same process.  It doesn't 
make sense to start a task tracker process configured with a "local" job 
tracker.  If you want to run them on the same host then you might 
configure "localhost:xxxx" as the job tracker.

Doug

Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

Posted by Jeremy Bensley <jb...@gmail.com>.
After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location. So, my process for running crawl is now:

1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS 
3. Set up / start job and task trackers
4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs

Following these steps I was able to get it to work as expected.


The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:

050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
        at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
        at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
        at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)

It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.

I appreciate the help, and look forward to experimenting with the software.

Jeremy


On 8/16/05, Doug Cutting <cu...@nutch.org> wrote:
> Jeremy Bensley wrote:
> > First, I have observed the same behavior as a previous poster from
> > yesterday who, instead of specifying a file for the URLs to be read
> > from, must now specify a directory (full path) to which a file
> > containing the URL list is stored. From the response to that thread I
> > am gathering that it isn't desired behavior to specify a directory
> > instead of a file.
> 
> A directory is required.  For consistency, all inputs and outputs are
> now directories of files rather than individual files.
> 
> > Second, and more importantly, I am having issues with task trackers. I
> > have three machines running task tracker, and a fourth running the job
> > tracker, and they seem to be talking well. Whenever I try to invoke
> > crawl using the job tracker, however, all of my task trackers
> > continually fail with this:
> >
> > 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
> > [Fatal Error] :-1:-1: Premature end of file.
> > 050816 134532 SEVERE error parsing conf file:
> > org.xml.sax.SAXParseException: Premature end of file.
> > java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
> > end of file.
> >         at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
> >         at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
> >         at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
> >         at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
> >         at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
> >         at org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
> >         at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
> >         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
> >         at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
> > Caused by: org.xml.sax.SAXParseException: Premature end of file.
> >         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> >         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> >         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
> >         at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
> >         ... 8 more
> >
> > Whenever I look at the job.xml file specified by this location, it
> > turns out that it is a directory, not a file.
> >
> > drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml
> 
> I have not seen this before.  If you remove everything in /tmp/nutch, is
> this reproducible?  Are you using NDFS?  If not, how are you sharing
> files between task trackers?  Is this on Win32, Linux or what?  Are you
> running the latest mapred code?  If your troubles continue, please post
> your nutch-site.xml and mapred-default.xml.
> 
> Doug
>

Re: (mapred branch) Job.xml as a directory instead of a file, other issues.

Posted by Doug Cutting <cu...@nutch.org>.
Jeremy Bensley wrote:
> First, I have observed the same behavior as a previous poster from
> yesterday who, instead of specifying a file for the URLs to be read
> from, must now specify a directory (full path) to which a file
> containing the URL list is stored. From the response to that thread I
> am gathering that it isn't desired behavior to specify a directory
> instead of a file.

A directory is required.  For consistency, all inputs and outputs are 
now directories of files rather than individual files.

> Second, and more importantly, I am having issues with task trackers. I
> have three machines running task tracker, and a fourth running the job
> tracker, and they seem to be talking well. Whenever I try to invoke
> crawl using the job tracker, however, all of my task trackers
> continually fail with this:
> 
> 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
> [Fatal Error] :-1:-1: Premature end of file.
> 050816 134532 SEVERE error parsing conf file:
> org.xml.sax.SAXParseException: Premature end of file.
> java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
> end of file.
>         at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
>         at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
>         at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
>         at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
>         at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
>         at org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
>         at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
>         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
>         at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
> Caused by: org.xml.sax.SAXParseException: Premature end of file.
>         at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>         at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>         at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
>         at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
>         ... 8 more
> 
> Whenever I look at the job.xml file specified by this location, it
> turns out that it is a directory, not a file.
> 
> drwxrwxr-x  2 jeremy  users 4096 Aug 16 13:45 job.xml

I have not seen this before.  If you remove everything in /tmp/nutch, is 
this reproducible?  Are you using NDFS?  If not, how are you sharing 
files between task trackers?  Is this on Win32, Linux or what?  Are you 
running the latest mapred code?  If your troubles continue, please post 
your nutch-site.xml and mapred-default.xml.

Doug