You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/11/19 17:24:16 UTC
Problem with CRC files on NDFS
Hi,
I have a problem with the recently added CRC files, when "put"-ting
stuff to NDFS. NDFS complains that these files already exist - I suspect
that it creates them on the fly just before they are actually
transmitted from the NDFSClient - and aborts the transfer. I was able to
succeed in -put operation only if I first deleted all .*.crc files.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Problem with CRC files on NDFS
Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> I have a problem with the recently added CRC files, when "put"-ting
> stuff to NDFS. NDFS complains that these files already exist - I suspect
> that it creates them on the fly just before they are actually
> transmitted from the NDFSClient - and aborts the transfer. I was able to
> succeed in -put operation only if I first deleted all .*.crc files.
I have not seen this. Can you tell me more how to cause this problem,
perhaps providing the transcript of a session? Are you overwriting
existing files?
A crc file is created just after file is opened for output. It
overwrites any existing crc file. See NFSDataOutputStream.java line 44.
There are a few cases where things will complain about non-existant .crc
files. This happens, e.g., when putting a file that was not created by
Nutch tools.
It also notably happens with Lucene indexes, since these are created by
FSDirectory, not NDFSDirectory, since NDFS does not permit overwrites,
and Lucene overwrites in one place (TermInfosWriter.java line 141). If
we modify Lucene to write the term count at EOF-8 then Lucene indexes
can be written directly through a NutchFileSystem API and will be
correctly checksummed at creation. Is this change to Lucene justified?
Doug
Re: jobdetails.jsp and jobtracker.jsp
Posted by Andrzej Bialecki <ab...@getopt.org>.
anton@orbita1.ru wrote:
>They not need tomcat? But then, what we must type in browser address?
>
>
>
No, they don't - Jobtracker runs an embedded Jetty.
>http://<host_jobtracker>:<port_jobtracer>/jobtracker/jobtracker.jsp ?
>
>
You need to use whatever is the hostname that runs the JobTracker, and
whatever port you set for mapred.job.tracker.info.port in your config files.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: mapred.map.tasks
Posted by an...@orbita1.ru.
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.
In nutch-site.xml I specified parameters:
1) On the both machines:
<property>
<name>fs.default.name</name>
<value>192.168.0.250:9009</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.250:9010</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>The default number of map tasks per job. Typically set
to a prime several times greater than number of available hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>The maximum number of tasks that will be run
simultaneously by a task tracker.
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
<description>The default number of reduce tasks per job. Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local".
</description>
</property>
On 192.168.0.250 I started:
2) bin/nutch-daemon.sh start datanode
3) bin/nutch-daemon.sh start namenode
4) bin/nutch-daemon.sh start jobtracker
5) bin/nutch-daemon.sh start tasktracker
I created directory seeds and file urls in it. Urls contained 2 links.
Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds).
Directory was added successfully..
Then I launched command:
bin/nutch crawl seeds -depth 2
I a result I received log written by jobtracker:
....
051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845'
051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518'
051123 053130 Task 'task_m_z66npx' has finished successfully.
Log written by tasktracker on 192.168.0.111:
......
051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31
051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31
051110 142607 Task task_m_z66npx is done.
Log written by tasktracker on 192.168.0.250:
....
051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31
051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31
051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31
051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31
051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31
051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31
051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31
051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31
051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31
051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31
051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
... and so on... e.g. in this log were records with reducing percents.
I concluded that was an attempt to separate inject to 2 machines e.g.
were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx'
was finished successfully and 'task_m_xaynqo' caused some problems (negative
progress).
But if I change parameter mapred.reduce.tasks to 4 all tasks finished
successfully and all work right.
-----Original Message-----
From: Doug Cutting [mailto:cutting@nutch.org]
Sent: Tuesday, November 22, 2005 2:10 AM
To: nutch-dev@lucene.apache.org
Subject: Re: mapred.map.tasks
anton@orbita1.ru wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.
Can you please post a simple example that demonstrates the "negative
progress" problem? E.g., the minimal changes to your conf/ directory
required to illustrate this, how you start your daemons, etc.
Thanks,
Doug
Re: mapred.map.tasks
Posted by Doug Cutting <cu...@nutch.org>.
anton@orbita1.ru wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.
Can you please post a simple example that demonstrates the "negative
progress" problem? E.g., the minimal changes to your conf/ directory
required to illustrate this, how you start your daemons, etc.
Thanks,
Doug
mapred.map.tasks
Posted by an...@orbita1.ru.
Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got "negative progress
percentages" problem.
RE: jobdetails.jsp and jobtracker.jsp
Posted by an...@orbita1.ru.
They not need tomcat? But then, what we must type in browser address?
http://<host_jobtracker>:<port_jobtracer>/jobtracker/jobtracker.jsp ?
-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: Monday, November 21, 2005 12:46 PM
To: nutch-dev@lucene.apache.org
Subject: Re: jobdetails.jsp and jobtracker.jsp
anton@orbita1.ru wrote:
>How to use jobtracker.jsp and jobdetails.jsp?
>They need tomcat?
>
>
No, but jobdetails.jsp requires a parameter (job_id) - start with
jobtracker.jsp, and then follow the links.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: jobdetails.jsp and jobtracker.jsp
Posted by Andrzej Bialecki <ab...@getopt.org>.
anton@orbita1.ru wrote:
>Why we need parameter mapred.map.tasks greater than number of available
>host? If we set it equal to number of host, we got "negative progress
>percentages" problem.
>
>
Because the whole point of MapReduce tasktrackers is that they are able
to run more than 1 task simultaneously on a single host.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: jobdetails.jsp and jobtracker.jsp
Posted by an...@orbita1.ru.
Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got "negative progress
percentages" problem.
Re: jobdetails.jsp and jobtracker.jsp
Posted by Andrzej Bialecki <ab...@getopt.org>.
anton@orbita1.ru wrote:
>How to use jobtracker.jsp and jobdetails.jsp?
>They need tomcat?
>
>
No, but jobdetails.jsp requires a parameter (job_id) - start with
jobtracker.jsp, and then follow the links.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
jobdetails.jsp and jobtracker.jsp
Posted by an...@orbita1.ru.
How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat?
When I try start jobdetails.jsp with tomcat, it return error:
java.lang.NullPointerException
at
org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp:
53)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
22)
at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:173)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:178)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126
)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105
)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:107)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConne
ction(Http11Protocol.java:744)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav
a:527)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo
rkerThread.java:80)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav
a:684)
at java.lang.Thread.run(Thread.java:595)