You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2005/11/19 17:24:16 UTC

Problem with CRC files on NDFS

Hi,

I have a problem with the recently added CRC files, when "put"-ting 
stuff to NDFS. NDFS complains that these files already exist - I suspect 
that it creates them on the fly just before they are actually 
transmitted from the NDFSClient - and aborts the transfer. I was able to 
succeed in -put operation only if I first deleted all .*.crc files.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Problem with CRC files on NDFS

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> I have a problem with the recently added CRC files, when "put"-ting 
> stuff to NDFS. NDFS complains that these files already exist - I suspect 
> that it creates them on the fly just before they are actually 
> transmitted from the NDFSClient - and aborts the transfer. I was able to 
> succeed in -put operation only if I first deleted all .*.crc files.

I have not seen this.  Can you tell me more how to cause this problem, 
perhaps providing the transcript of a session?  Are you overwriting 
existing files?

A crc file is created just after file is opened for output.  It 
overwrites any existing crc file.  See NFSDataOutputStream.java line 44.

There are a few cases where things will complain about non-existant .crc 
files.  This happens, e.g., when putting a file that was not created by 
Nutch tools.

It also notably happens with Lucene indexes, since these are created by 
FSDirectory, not NDFSDirectory, since NDFS does not permit overwrites, 
and Lucene overwrites in one place (TermInfosWriter.java line 141).  If 
we modify Lucene to write the term count at EOF-8 then Lucene indexes 
can be written directly through a NutchFileSystem API and will be 
correctly checksummed at creation.  Is this change to Lucene justified?

Doug

Re: jobdetails.jsp and jobtracker.jsp

Posted by Andrzej Bialecki <ab...@getopt.org>.

anton@orbita1.ru wrote:

>They not need tomcat? But then, what we must type in browser address? 
>
>  
>

No, they don't - Jobtracker runs an embedded Jetty.

>http://<host_jobtracker>:<port_jobtracer>/jobtracker/jobtracker.jsp ?
>  
>

You need to use whatever is the hostname that runs the JobTracker, and 
whatever port you set for mapred.job.tracker.info.port in your config files.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: mapred.map.tasks

Posted by an...@orbita1.ru.

I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.

In nutch-site.xml I specified parameters:

1) On the both machines:
<property>
  <name>fs.default.name</name>
  <value>192.168.0.250:9009</value>
  <description>The name of the default file system.  Either the
  literal string "local" or a host:port for NDFS.</description>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>192.168.0.250:9010</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
<property>
  <name>mapred.map.tasks</name>
  <value>2</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".  
  </description>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>

On 192.168.0.250 I started:
2)       bin/nutch-daemon.sh start datanode
3)       bin/nutch-daemon.sh start namenode
4)       bin/nutch-daemon.sh start jobtracker
5)       bin/nutch-daemon.sh start tasktracker

I created directory seeds and file urls in it. Urls contained 2 links.
Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds).
Directory was added successfully..

Then I launched command: 
bin/nutch crawl seeds -depth 2

I a result I received log written by jobtracker:
....
051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845'
051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518'
051123 053130 Task 'task_m_z66npx' has finished successfully.

Log written by tasktracker on 192.168.0.111:
......
051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31
051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31
051110 142607 Task task_m_z66npx is done.

Log written by tasktracker on 192.168.0.250:
....
051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31
051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31
051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31
051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31
051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31
051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31
051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31
051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31
051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31
051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31
051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
... and so on... e.g. in this log were records with reducing percents.

I concluded that was an attempt to separate inject to 2 machines e.g.
were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx'
was finished successfully and 'task_m_xaynqo' caused some problems (negative

progress).

But if I change parameter mapred.reduce.tasks to 4 all tasks finished
successfully and all work right.

-----Original Message-----
From: Doug Cutting [mailto:cutting@nutch.org] 
Sent: Tuesday, November 22, 2005 2:10 AM
To: nutch-dev@lucene.apache.org
Subject: Re: mapred.map.tasks

anton@orbita1.ru wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.

Can you please post a simple example that demonstrates the "negative 
progress" problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug

Re: mapred.map.tasks

Posted by Doug Cutting <cu...@nutch.org>.

anton@orbita1.ru wrote:
> Why we need parameter mapred.map.tasks greater than number of available
> host? If we set it equal to number of host, we got "negative progress
> percentages" problem.

Can you please post a simple example that demonstrates the "negative 
progress" problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug

mapred.map.tasks

Posted by an...@orbita1.ru.

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got "negative progress
percentages" problem.

RE: jobdetails.jsp and jobtracker.jsp

Posted by an...@orbita1.ru.

They not need tomcat? But then, what we must type in browser address? 

http://<host_jobtracker>:<port_jobtracer>/jobtracker/jobtracker.jsp ?


-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Monday, November 21, 2005 12:46 PM
To: nutch-dev@lucene.apache.org
Subject: Re: jobdetails.jsp and jobtracker.jsp

anton@orbita1.ru wrote:

>How to use jobtracker.jsp and jobdetails.jsp?
>They need tomcat? 
>  
>

No, but jobdetails.jsp requires a parameter (job_id) - start with 
jobtracker.jsp, and then follow the links.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: jobdetails.jsp and jobtracker.jsp

Posted by Andrzej Bialecki <ab...@getopt.org>.

anton@orbita1.ru wrote:

>Why we need parameter mapred.map.tasks greater than number of available
>host? If we set it equal to number of host, we got "negative progress
>percentages" problem.
>  
>

Because the whole point of MapReduce tasktrackers is that they are able 
to run more than 1 task simultaneously on a single host.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: jobdetails.jsp and jobtracker.jsp

Posted by an...@orbita1.ru.

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got "negative progress
percentages" problem.

Re: jobdetails.jsp and jobtracker.jsp

Posted by Andrzej Bialecki <ab...@getopt.org>.

anton@orbita1.ru wrote:

>How to use jobtracker.jsp and jobdetails.jsp?
>They need tomcat? 
>  
>

No, but jobdetails.jsp requires a parameter (job_id) - start with 
jobtracker.jsp, and then follow the links.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

jobdetails.jsp and jobtracker.jsp

Posted by an...@orbita1.ru.

How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat? 

When I try start jobdetails.jsp with tomcat, it return error:
java.lang.NullPointerException
        at
org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp:
53)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
22)
        at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)
        at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:252)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:173)
        at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:213)
        at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:178)
        at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126
)
        at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105
)
        at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:107)
        at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
        at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConne
ction(Http11Protocol.java:744)
        at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav
a:527)
        at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo
rkerThread.java:80)
        at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav
a:684)
        at java.lang.Thread.run(Thread.java:595)