You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jeremy Bensley <jb...@gmail.com> on 2005/08/16 20:56:52 UTC
(mapred branch) Job.xml as a directory instead of a file, other issues.
I have been attempting to get the mapred branch version of the crawler
working and have hit some snags.
First, I have observed the same behavior as a previous poster from
yesterday who, instead of specifying a file for the URLs to be read
from, must now specify a directory (full path) to which a file
containing the URL list is stored. From the response to that thread I
am gathering that it isn't desired behavior to specify a directory
instead of a file.
Second, and more importantly, I am having issues with task trackers. I
have three machines running task tracker, and a fourth running the job
tracker, and they seem to be talking well. Whenever I try to invoke
crawl using the job tracker, however, all of my task trackers
continually fail with this:
050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
[Fatal Error] :-1:-1: Premature end of file.
050816 134532 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
end of file.
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
at org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
... 8 more
Whenever I look at the job.xml file specified by this location, it
turns out that it is a directory, not a file.
drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml
Any help / observation of these issues is most appreciated.
Thanks,
Jeremy
Re: Merge Lucene to Nutch
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Nutch simply uses the Lucene JAR file. Upgrading Nutch to use a new
Lucene release would involve replacing the JAR file with the new
version, and depending on the changes to Lucene itself it may involve
rebuilding indexes (to ensure normalization factors and such changes
are incorporated), but Lucene remains quite backwards compatible with
indexes built with previous versions and reindexing likely wouldn't
be needed. The specifics are in the details of what versions of
Lucene we're talking about, of course.
Erik
On Aug 17, 2005, at 8:29 PM, Michael Ji wrote:
> As I understand, Nutch is a crawling/searching
> application based on Lucene;
>
> Just a curious question, when Lucene has a new
> version/release, how to merge Lucene to Nutch?
>
> I didn't see an explicity Lucene Java source in Nutch
> source tree. I don't think Nutch and Lucene do low
> level API independently.
>
> Thanks,
>
> Michael Ji
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
RE: Merge Lucene to Nutch
Posted by Fuad Efendi <fu...@efendi.ca>.
Nutch (distribution, "ant package") has a folder /lib/ containing lucene
jar files... As usual, specific versions of library files were tested
for production use, any upgrade in libs is not recommended...
-----Original Message-----
From: Michael Ji [mailto:fji_00@yahoo.com]
Sent: Wednesday, August 17, 2005 8:30 PM
To: nutch-dev@lucene.apache.org
Subject: Merge Lucene to Nutch
As I understand, Nutch is a crawling/searching
application based on Lucene;
Just a curious question, when Lucene has a new
version/release, how to merge Lucene to Nutch?
I didn't see an explicity Lucene Java source in Nutch
source tree. I don't think Nutch and Lucene do low
level API independently.
Thanks,
Michael Ji
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Merge Lucene to Nutch
Posted by Michael Ji <fj...@yahoo.com>.
As I understand, Nutch is a crawling/searching
application based on Lucene;
Just a curious question, when Lucene has a new
version/release, how to merge Lucene to Nutch?
I didn't see an explicity Lucene Java source in Nutch
source tree. I don't think Nutch and Lucene do low
level API independently.
Thanks,
Michael Ji
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
result tuning
Posted by webmaster <we...@www.poundwebhosting.com>.
where is it that I would change the query results to every search only making
10 results, instead of 100? so it wont cache 10 pages of sub-results? I take
it that it is not the io.sort.factor option!!!
-Jay
Re: (mapred branch) Job.xml as a directory instead of a file, other
issues.
Posted by Doug Cutting <cu...@nutch.org>.
Jeremy Bensley wrote:
> After going through your checklist, I realized that my view on how the
> MapReduce function behaves was slightly flawed, as I did not realize
> that the temporary storage phase between map and reduce had to be in a
> shared location.
The temprorary storage between map and reduce is actually not stored in
NDFS, but on node's local disks. But the input (the url file in this
case) must be shared.
> So, my process for running crawl is now:
> 1. Set up / start NDFS name and data nodes
> 2. Copy url file into NDFS
> 3. Set up / start job and task trackers
> 4. run crawl with arguments referencing the NDFS positions of my
> inputs and outputs
That looks right to me.
We really need a mapred & ndfs-based tutorial...
> The only lasting issue I have is that, whenever I attempt to start a
> tasktracker or jobtracker and have the configuration parameters for
> mapred specified only in mapred-default.xml, I get the following
> error:
>
> 050816 164343 parsing
> file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
> 050816 164343 parsing
> file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
> Exception in thread "main" java.lang.RuntimeException: Bad
> mapred.job.tracker: local
> at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
> at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
> at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)
>
> It is as if the mapred-default.xml is not being parsed for its
> options. If I specify the same options in nutch-site.xml it works just
> fine.
The config files are a bit confusing. mapred-default.xml is for stuff
that may be reasonably overidden by applications, while nutch-site.xml
is for stuff that should not be overridden by applications. So the name
of the shared filesystem and of the job tracker should be in
nutch-site.xml, since they should not be overridden. But, e.g., the
default number of map and reduce tasks should be in mapred-default.xml,
since applications do sometimes change these.
The "local" job tracker should only be used in standalone
configurations, when everything runs in the same process. It doesn't
make sense to start a task tracker process configured with a "local" job
tracker. If you want to run them on the same host then you might
configure "localhost:xxxx" as the job tracker.
Doug
Re: (mapred branch) Job.xml as a directory instead of a file, other issues.
Posted by Jeremy Bensley <jb...@gmail.com>.
After going through your checklist, I realized that my view on how the
MapReduce function behaves was slightly flawed, as I did not realize
that the temporary storage phase between map and reduce had to be in a
shared location. So, my process for running crawl is now:
1. Set up / start NDFS name and data nodes
2. Copy url file into NDFS
3. Set up / start job and task trackers
4. run crawl with arguments referencing the NDFS positions of my
inputs and outputs
Following these steps I was able to get it to work as expected.
The only lasting issue I have is that, whenever I attempt to start a
tasktracker or jobtracker and have the configuration parameters for
mapred specified only in mapred-default.xml, I get the following
error:
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-default.xml
050816 164343 parsing
file:/main/home/jeremy/small_project/nutch-mapred/conf/nutch-site.xml
Exception in thread "main" java.lang.RuntimeException: Bad
mapred.job.tracker: local
at org.apache.nutch.mapred.JobTracker.getAddress(JobTracker.java:245)
at org.apache.nutch.mapred.TaskTracker.<init>(TaskTracker.java:72)
at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:609)
It is as if the mapred-default.xml is not being parsed for its
options. If I specify the same options in nutch-site.xml it works just
fine.
I appreciate the help, and look forward to experimenting with the software.
Jeremy
On 8/16/05, Doug Cutting <cu...@nutch.org> wrote:
> Jeremy Bensley wrote:
> > First, I have observed the same behavior as a previous poster from
> > yesterday who, instead of specifying a file for the URLs to be read
> > from, must now specify a directory (full path) to which a file
> > containing the URL list is stored. From the response to that thread I
> > am gathering that it isn't desired behavior to specify a directory
> > instead of a file.
>
> A directory is required. For consistency, all inputs and outputs are
> now directories of files rather than individual files.
>
> > Second, and more importantly, I am having issues with task trackers. I
> > have three machines running task tracker, and a fourth running the job
> > tracker, and they seem to be talking well. Whenever I try to invoke
> > crawl using the job tracker, however, all of my task trackers
> > continually fail with this:
> >
> > 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
> > [Fatal Error] :-1:-1: Premature end of file.
> > 050816 134532 SEVERE error parsing conf file:
> > org.xml.sax.SAXParseException: Premature end of file.
> > java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
> > end of file.
> > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
> > at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
> > at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
> > at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
> > at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
> > at org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
> > at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
> > at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
> > at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
> > Caused by: org.xml.sax.SAXParseException: Premature end of file.
> > at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> > at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> > at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
> > at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
> > ... 8 more
> >
> > Whenever I look at the job.xml file specified by this location, it
> > turns out that it is a directory, not a file.
> >
> > drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml
>
> I have not seen this before. If you remove everything in /tmp/nutch, is
> this reproducible? Are you using NDFS? If not, how are you sharing
> files between task trackers? Is this on Win32, Linux or what? Are you
> running the latest mapred code? If your troubles continue, please post
> your nutch-site.xml and mapred-default.xml.
>
> Doug
>
Re: (mapred branch) Job.xml as a directory instead of a file, other
issues.
Posted by Doug Cutting <cu...@nutch.org>.
Jeremy Bensley wrote:
> First, I have observed the same behavior as a previous poster from
> yesterday who, instead of specifying a file for the URLs to be read
> from, must now specify a directory (full path) to which a file
> containing the URL list is stored. From the response to that thread I
> am gathering that it isn't desired behavior to specify a directory
> instead of a file.
A directory is required. For consistency, all inputs and outputs are
now directories of files rather than individual files.
> Second, and more importantly, I am having issues with task trackers. I
> have three machines running task tracker, and a fourth running the job
> tracker, and they seem to be talking well. Whenever I try to invoke
> crawl using the job tracker, however, all of my task trackers
> continually fail with this:
>
> 050816 134532 parsing /tmp/nutch/mapred/local/tracker/task_m_5o5uvx/job.xml
> [Fatal Error] :-1:-1: Premature end of file.
> 050816 134532 SEVERE error parsing conf file:
> org.xml.sax.SAXParseException: Premature end of file.
> java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature
> end of file.
> at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:355)
> at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:290)
> at org.apache.nutch.util.NutchConf.get(NutchConf.java:91)
> at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:80)
> at org.apache.nutch.mapred.TaskTracker$TaskInProgress.localizeTask(TaskTracker.java:335)
> at org.apache.nutch.mapred.TaskTracker$TaskInProgress.<init>(TaskTracker.java:319)
> at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:221)
> at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:269)
> at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:610)
> Caused by: org.xml.sax.SAXParseException: Premature end of file.
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
> at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:315)
> ... 8 more
>
> Whenever I look at the job.xml file specified by this location, it
> turns out that it is a directory, not a file.
>
> drwxrwxr-x 2 jeremy users 4096 Aug 16 13:45 job.xml
I have not seen this before. If you remove everything in /tmp/nutch, is
this reproducible? Are you using NDFS? If not, how are you sharing
files between task trackers? Is this on Win32, Linux or what? Are you
running the latest mapred code? If your troubles continue, please post
your nutch-site.xml and mapred-default.xml.
Doug