You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Casey McTaggart <ca...@gmail.com> on 2012/09/16 01:22:23 UTC

problem running Nutch 1.5.1 in distributed mode- simple crawl

Hi everyone,

I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version
1.0.1. I can run a local filesystem crawl with Nutch, and it returns what
I'd expect. However, I need to take advantage of the mapreduce
functionality, since I want to crawl a local filesystem with many GB of
files. I'm going to put all of these files on an apache server so they can
be crawled. First, though, I want to just crawl a simple website, and I
can't make it work.

My urls/seed.txt is on hdfs and is this:
http://lucene.apache.org

I run this command:
sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl

Sometimes, it fetches the URL, but does not go beyond depth 1... and when I
examine the CrawlDatum that's in
/user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the
seed url as the key, and the value of the CrawlDatum is
_pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
org/apache/tika/mime/MimeTypeException

Okay, so I tried running the command again with -libjars nutch1.5.1.jar,
and it fails with an ArrayIndexOutOfBoundsException. I tried running it
with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:

12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for
fetching, exiting ...
12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
fetch.
12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed list
and URL filters.
12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl

I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and
still 0 URLs are fetched.

I'm totally at a loss. can someone help?

Here's my regex-urlfilter:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.


here's my nutch-site.xml:

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>nutchtest</value>
  </property>
  <property>
    <name>plugin.folders</name>

<value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
  </property>
</configuration>


which also does not work if I include this part:

<property>
    <name>plugin.includes</name>

<value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
|query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
  </property>

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by jiuling <ji...@gmail.com>.

Thank Walter a lot. It does work following your advise. Thank you again.



--
View this message in context: http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008842.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Walter Tietze <ti...@neofonie.de>.

Hi Jiuling,


It should suffice to recompile! You don't have to unpack your job.



I start the job with the command

'runtime/deploy/bin/nutch crawl your_seeds_dir -depth 1'

which does nothing else then calling

'hadoop jar apache-nutch-1.5.1.job ....'!

That should suffice.


For accessing the plugins from the job, the parameter

<property>
  <name>plugin.folders</name>
  <!-- value>plugins</value -->
  <value>classes/plugins</value>
</property>

might have to be adjusted like the example above,


Please check the structure of the plugins directory
in your job.



I made one further modification, which came from the need to be able
to set hadoop parameters for the jobs.


I modified class ./src/java/org/apache/nutch/util/NutchJob.java to



public class NutchJob extends JobConf {
  public NutchJob(Configuration conf) {
    super(conf, NutchJob.class);
    checkMyOpts();
  }

  public void checkMyOpts() {
        Map<String, String> env = System.getenv();
        String myOpts = env.get("MY_CRAWLER_OPTS");
        if(null != myOpts) {
           String[] myOptsArray = myOpts.split(" ");
           for(int i = 0; i < myOptsArray.length; i++) {
                String[] keyval = myOptsArray[i].split("=");
                if(null != keyval && keyval.length == 2) {
                        set(keyval[0], keyval[1]);
                }
           }
        }
  }
}


to be able to set hadoop parameters for the jobs from the commandline,
because I had problems with the default settings for the hadoop child
processes.


If you add the code above,  you can set an environment variable to


export MY_CRAWLER_OPTS="mapreduce.map.java.opts=-Xmx4096m
mapreduce.reduce.java.opts=-Xmx4096m mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=4096 mapreduce.job.maps=21
mapreduce.job.reduces=21"


which sets for YarnChild processes the java parameter -Xmx to 4GB
and requests for the crawl 21 maps and 21 reduces.

This variables get important, when for example you want to generate
a nutch webgraph and the hadoop default settings are choosen for
'normally' sized jobs.


Please remark, if hadoop unpacks the job, the container must have at
least space for the unpacked files and enough memory space to load
the jars into the jvms of the child processes.


Hope this helps!




Cheers, Walter



Am 18.09.2012 04:39, schrieb jiuling:
> Dir Walter:
> 
>     I am sorry for I want your more help. 
> 
>      I have update the corresponding java and recompiled. At the first step,
> I do not unpack the job and directly excute hadoop jar *.job ..., it still
> not work. 
>     Finally, I unpacked the job, but don't known how to compile the command?
> Can you help me for more information  about "Something one can do, is to
> unpack the job in the Nodemanager manually 
> and to load the classes from within the code into the current 
> classloader. "?
> 
>     Thank you a lot.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008512.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

Walter.Tietze@neofonie.de
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by jiuling <ji...@gmail.com>.

Dir Walter:

    I am sorry for I want your more help. 

     I have update the corresponding java and recompiled. At the first step,
I do not unpack the job and directly excute hadoop jar *.job ..., it still
not work. 
    Finally, I unpacked the job, but don't known how to compile the command?
Can you help me for more information  about "Something one can do, is to
unpack the job in the Nodemanager manually 
and to load the classes from within the code into the current 
classloader. "?

    Thank you a lot.



--
View this message in context: http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008512.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Casey McTaggart <ca...@gmail.com>.

including /plugins/classes in plugin.folders made it work. thank you!!!

On Tue, Sep 18, 2012 at 10:58 AM, Walter Tietze <ti...@neofonie.de> wrote:

> Am 18.09.2012 18:46, schrieb Casey McTaggart:
> > thanks Walter, I still am unable to get anything to run- I think it's
> > because Hadoop is for some reason not finding the tika jar. I tried
> > running Hadoop with -libjars and including both the Nutch jar and the
> > Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch
> > the seed list! When I don't run it with -libjars, it fetches the seed
> > list, then stops with the ClassNotFound exception in the CrawlDatum.
> >
> > I'll try your solution that you just posted. But, any idea why this is
> > happening?
> > thanks!
> > Casey
> >
>
> Hi Casey,
>
>
>
> sry, but I think the changes I mentioned were really all changes I made.
>
> I'll try to check my code again, if I forgot something to post.
>
>
> Remark: I also tried to insert the workaround with the nutch-2.0 code
> base, but was unable to make it work, because nutch-2.0 uses already
> the new Mapreduce classes and seems not to implement the same loading
> mechanism for the plugin repository.
>
>
>
> Any other ideas?
>
>
>
> Cheers, Walter
>
>
> > On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <tietze@neofonie.de
> > <ma...@neofonie.de>> wrote:
> >
> >
> >
> >     Hi,
> >
> >     I had the same problems and couldn't get around in a proper way
> >     satisfyingly.
> >
> >     I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without
> >     MR_v1 and couldn't make it simply work.
> >
> >
> >     But I found a workaround to make nutch 1.5.1 work on CDH4.
> >
> >
> >     Since MR_v2 it is no longer allowed to pack a project as *nutch*.job
> >     altogether and since the former TaskManager is divided into
> >     the ResourceManager and the NodeManager, the NodeManager seems not to
> >     be able to handle the packed nutch-project.
> >
> >     (see also:
> >
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
> >     )
> >
> >
> >     Something one can do, is to unpack the job in the Nodemanager
> manually
> >     and to load the classes from within the code into the current
> >     classloader.
> >
> >     I modified the org/apache/nutch/plugin/PluginManifestParser.java
> >     slightly and everything works fine at least for the moment.
> >
> >
> >     I attached the modified file.
> >
> >
> >     Please remark, I don't have experience yet, if CDH4 removes the
> >     application directories and the unpacked files properly.
> >     You should consider to check the directories, if they are still
> >     needed after the crawl succeeded.
> >
> >
> >
> >     Hope this helps, cheers, Walter
> >
> >
> >
> >
> >     Am 17.09.2012 18:31, schrieb Casey McTaggart:
> >     > I would also like to add that I can run the same crawl locally and
> >     it's
> >     > successful. So, it's just the distributed mode that's not working.
> can
> >     > anyone offer any advice? Do you think it might be something with
> CDH4?
> >     >
> >     > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
> >     > <casey.mctaggart@gmail.com <mailto:casey.mctaggart@gmail.com
> >>wrote:
> >     >
> >     >> Hi everyone,
> >     >>
> >     >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's
> >     version
> >     >> 1.0.1. I can run a local filesystem crawl with Nutch, and it
> >     returns what
> >     >> I'd expect. However, I need to take advantage of the mapreduce
> >     >> functionality, since I want to crawl a local filesystem with many
> >     GB of
> >     >> files. I'm going to put all of these files on an apache server so
> >     they can
> >     >> be crawled. First, though, I want to just crawl a simple website,
> >     and I
> >     >> can't make it work.
> >     >>
> >     >> My urls/seed.txt is on hdfs and is this:
> >     >> http://lucene.apache.org
> >     >>
> >     >> I run this command:
> >     >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
> >     >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
> >     >>
> >     >> Sometimes, it fetches the URL, but does not go beyond depth 1...
> >     and when
> >     >> I examine the CrawlDatum that's in
> >     >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one
> >     entry: the
> >     >> seed url as the key, and the value of the CrawlDatum is
> >     >> _pst_=exception(16), lastModified=0:
> java.lang.NoClassDefFoundError:
> >     >> org/apache/tika/mime/MimeTypeException
> >     >>
> >     >> Okay, so I tried running the command again with -libjars
> >     nutch1.5.1.jar,
> >     >> and it fails with an ArrayIndexOutOfBoundsException. I tried
> >     running it
> >     >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails
> with:
> >     >>
> >     >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records
> >     selected for
> >     >> fetching, exiting ...
> >     >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more
> >     URLs to
> >     >> fetch.
> >     >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your
> >     seed
> >     >> list and URL filters.
> >     >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
> >     >>
> >     >> I tried copying lib/tika-core-1.1.jar to
> >     /usr/local/hadoop-1.0.1/lib, and
> >     >> still 0 URLs are fetched.
> >     >>
> >     >> I'm totally at a loss. can someone help?
> >     >>
> >     >> Here's my regex-urlfilter:
> >     >>
> >     >> # skip file: ftp: and mailto: urls
> >     >> -^(file|ftp|mailto):
> >     >> # skip image and other suffixes we can't yet parse
> >     >> # for a more extensive coverage use the urlfilter-suffix plugin
> >     >>
> >     >>
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
> >     >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >     >> # skip URLs containing certain characters as probable queries,
> etc.
> >     >> -[?*!@=]
> >     >> # skip URLs with slash-delimited segment that repeats 3+ times,
> >     to break
> >     >> loops
> >     >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >     >> # accept anything else
> >     >> +.
> >     >>
> >     >>
> >     >> here's my nutch-site.xml:
> >     >>
> >     >> <configuration>
> >     >>   <property>
> >     >>     <name>http.agent.name <http://http.agent.name></name>
> >     >>     <value>nutchtest</value>
> >     >>   </property>
> >     >>   <property>
> >     >>     <name>plugin.folders</name>
> >     >>
> >     >>
> >
> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
> >     >>   </property>
> >     >> </configuration>
> >     >>
> >     >>
> >     >> which also does not work if I include this part:
> >     >>
> >     >> <property>
> >     >>     <name>plugin.includes</name>
> >     >>
> >     >>
> >
> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
> >     >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
> >     >>   </property>
> >     >>
> >     >>
> >     >
> >
> >
> >     --
> >
> >     --------------------------------
> >     Walter Tietze
> >     Senior Softwareengineer
> >     Research
> >
> >     Neofonie GmbH
> >     Robert-Koch-Platz 4
> >     10115 Berlin
> >
> >     T +49.30 24627 318 <tel:%2B49.30%2024627%20318>
> >     F +49.30 24627 120 <tel:%2B49.30%2024627%20120>
> >
> >     Walter.Tietze@neofonie.de <ma...@neofonie.de>
> >     http://www.neofonie.de
> >
> >     Handelsregister
> >     Berlin-Charlottenburg: HRB 67460
> >
> >     Geschäftsführung:
> >     Thomas Kitlitschko
> >     --------------------------------
> >
> >
>
>
> --
>
> --------------------------------
> Walter Tietze
> Senior Softwareengineer
> Research
>
> Neofonie GmbH
> Robert-Koch-Platz 4
> 10115 Berlin
>
> T +49.30 24627 318
> F +49.30 24627 120
>
> Walter.Tietze@neofonie.de
> http://www.neofonie.de
>
> Handelsregister
> Berlin-Charlottenburg: HRB 67460
>
> Geschäftsführung:
> Thomas Kitlitschko
> --------------------------------
>
>

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Walter Tietze <ti...@neofonie.de>.

Am 18.09.2012 18:46, schrieb Casey McTaggart:
> thanks Walter, I still am unable to get anything to run- I think it's
> because Hadoop is for some reason not finding the tika jar. I tried
> running Hadoop with -libjars and including both the Nutch jar and the
> Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch
> the seed list! When I don't run it with -libjars, it fetches the seed
> list, then stops with the ClassNotFound exception in the CrawlDatum.
> 
> I'll try your solution that you just posted. But, any idea why this is
> happening?
> thanks!
> Casey
> 

Hi Casey,



sry, but I think the changes I mentioned were really all changes I made.

I'll try to check my code again, if I forgot something to post.


Remark: I also tried to insert the workaround with the nutch-2.0 code
base, but was unable to make it work, because nutch-2.0 uses already
the new Mapreduce classes and seems not to implement the same loading
mechanism for the plugin repository.



Any other ideas?



Cheers, Walter


> On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <tietze@neofonie.de
> <ma...@neofonie.de>> wrote:
> 
> 
> 
>     Hi,
> 
>     I had the same problems and couldn't get around in a proper way
>     satisfyingly.
> 
>     I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without
>     MR_v1 and couldn't make it simply work.
> 
> 
>     But I found a workaround to make nutch 1.5.1 work on CDH4.
> 
> 
>     Since MR_v2 it is no longer allowed to pack a project as *nutch*.job
>     altogether and since the former TaskManager is divided into
>     the ResourceManager and the NodeManager, the NodeManager seems not to
>     be able to handle the packed nutch-project.
> 
>     (see also:
>     http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>     )
> 
> 
>     Something one can do, is to unpack the job in the Nodemanager manually
>     and to load the classes from within the code into the current
>     classloader.
> 
>     I modified the org/apache/nutch/plugin/PluginManifestParser.java
>     slightly and everything works fine at least for the moment.
> 
> 
>     I attached the modified file.
> 
> 
>     Please remark, I don't have experience yet, if CDH4 removes the
>     application directories and the unpacked files properly.
>     You should consider to check the directories, if they are still
>     needed after the crawl succeeded.
> 
> 
> 
>     Hope this helps, cheers, Walter
> 
> 
> 
> 
>     Am 17.09.2012 18:31, schrieb Casey McTaggart:
>     > I would also like to add that I can run the same crawl locally and
>     it's
>     > successful. So, it's just the distributed mode that's not working. can
>     > anyone offer any advice? Do you think it might be something with CDH4?
>     >
>     > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
>     > <casey.mctaggart@gmail.com <ma...@gmail.com>>wrote:
>     >
>     >> Hi everyone,
>     >>
>     >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's
>     version
>     >> 1.0.1. I can run a local filesystem crawl with Nutch, and it
>     returns what
>     >> I'd expect. However, I need to take advantage of the mapreduce
>     >> functionality, since I want to crawl a local filesystem with many
>     GB of
>     >> files. I'm going to put all of these files on an apache server so
>     they can
>     >> be crawled. First, though, I want to just crawl a simple website,
>     and I
>     >> can't make it work.
>     >>
>     >> My urls/seed.txt is on hdfs and is this:
>     >> http://lucene.apache.org
>     >>
>     >> I run this command:
>     >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
>     >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
>     >>
>     >> Sometimes, it fetches the URL, but does not go beyond depth 1...
>     and when
>     >> I examine the CrawlDatum that's in
>     >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one
>     entry: the
>     >> seed url as the key, and the value of the CrawlDatum is
>     >> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
>     >> org/apache/tika/mime/MimeTypeException
>     >>
>     >> Okay, so I tried running the command again with -libjars
>     nutch1.5.1.jar,
>     >> and it fails with an ArrayIndexOutOfBoundsException. I tried
>     running it
>     >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:
>     >>
>     >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records
>     selected for
>     >> fetching, exiting ...
>     >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more
>     URLs to
>     >> fetch.
>     >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your
>     seed
>     >> list and URL filters.
>     >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
>     >>
>     >> I tried copying lib/tika-core-1.1.jar to
>     /usr/local/hadoop-1.0.1/lib, and
>     >> still 0 URLs are fetched.
>     >>
>     >> I'm totally at a loss. can someone help?
>     >>
>     >> Here's my regex-urlfilter:
>     >>
>     >> # skip file: ftp: and mailto: urls
>     >> -^(file|ftp|mailto):
>     >> # skip image and other suffixes we can't yet parse
>     >> # for a more extensive coverage use the urlfilter-suffix plugin
>     >>
>     >>
>     -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
>     >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>     >> # skip URLs containing certain characters as probable queries, etc.
>     >> -[?*!@=]
>     >> # skip URLs with slash-delimited segment that repeats 3+ times,
>     to break
>     >> loops
>     >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>     >> # accept anything else
>     >> +.
>     >>
>     >>
>     >> here's my nutch-site.xml:
>     >>
>     >> <configuration>
>     >>   <property>
>     >>     <name>http.agent.name <http://http.agent.name></name>
>     >>     <value>nutchtest</value>
>     >>   </property>
>     >>   <property>
>     >>     <name>plugin.folders</name>
>     >>
>     >>
>     <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
>     >>   </property>
>     >> </configuration>
>     >>
>     >>
>     >> which also does not work if I include this part:
>     >>
>     >> <property>
>     >>     <name>plugin.includes</name>
>     >>
>     >>
>     <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
>     >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
>     >>   </property>
>     >>
>     >>
>     >
> 
> 
>     --
> 
>     --------------------------------
>     Walter Tietze
>     Senior Softwareengineer
>     Research
> 
>     Neofonie GmbH
>     Robert-Koch-Platz 4
>     10115 Berlin
> 
>     T +49.30 24627 318 <tel:%2B49.30%2024627%20318>
>     F +49.30 24627 120 <tel:%2B49.30%2024627%20120>
> 
>     Walter.Tietze@neofonie.de <ma...@neofonie.de>
>     http://www.neofonie.de
> 
>     Handelsregister
>     Berlin-Charlottenburg: HRB 67460
> 
>     Geschäftsführung:
>     Thomas Kitlitschko
>     --------------------------------
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

Walter.Tietze@neofonie.de
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Casey McTaggart <ca...@gmail.com>.

thanks Walter, I still am unable to get anything to run- I think it's
because Hadoop is for some reason not finding the tika jar. I tried running
Hadoop with -libjars and including both the Nutch jar and the Tika jar, and
when I do this it gives me 0 URLs - it doesn't even fetch the seed list!
When I don't run it with -libjars, it fetches the seed list, then stops
with the ClassNotFound exception in the CrawlDatum.

I'll try your solution that you just posted. But, any idea why this is
happening?
thanks!
Casey

On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <ti...@neofonie.de> wrote:

>
>
> Hi,
>
> I had the same problems and couldn't get around in a proper way
> satisfyingly.
>
> I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without
> MR_v1 and couldn't make it simply work.
>
>
> But I found a workaround to make nutch 1.5.1 work on CDH4.
>
>
> Since MR_v2 it is no longer allowed to pack a project as *nutch*.job
> altogether and since the former TaskManager is divided into
> the ResourceManager and the NodeManager, the NodeManager seems not to
> be able to handle the packed nutch-project.
>
> (see also:
>
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
> )
>
>
> Something one can do, is to unpack the job in the Nodemanager manually
> and to load the classes from within the code into the current
> classloader.
>
> I modified the org/apache/nutch/plugin/PluginManifestParser.java
> slightly and everything works fine at least for the moment.
>
>
> I attached the modified file.
>
>
> Please remark, I don't have experience yet, if CDH4 removes the
> application directories and the unpacked files properly.
> You should consider to check the directories, if they are still
> needed after the crawl succeeded.
>
>
>
> Hope this helps, cheers, Walter
>
>
>
>
> Am 17.09.2012 18:31, schrieb Casey McTaggart:
> > I would also like to add that I can run the same crawl locally and it's
> > successful. So, it's just the distributed mode that's not working. can
> > anyone offer any advice? Do you think it might be something with CDH4?
> >
> > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
> > <ca...@gmail.com>wrote:
> >
> >> Hi everyone,
> >>
> >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version
> >> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns
> what
> >> I'd expect. However, I need to take advantage of the mapreduce
> >> functionality, since I want to crawl a local filesystem with many GB of
> >> files. I'm going to put all of these files on an apache server so they
> can
> >> be crawled. First, though, I want to just crawl a simple website, and I
> >> can't make it work.
> >>
> >> My urls/seed.txt is on hdfs and is this:
> >> http://lucene.apache.org
> >>
> >> I run this command:
> >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
> >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
> >>
> >> Sometimes, it fetches the URL, but does not go beyond depth 1... and
> when
> >> I examine the CrawlDatum that's in
> >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the
> >> seed url as the key, and the value of the CrawlDatum is
> >> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
> >> org/apache/tika/mime/MimeTypeException
> >>
> >> Okay, so I tried running the command again with -libjars nutch1.5.1.jar,
> >> and it fails with an ArrayIndexOutOfBoundsException. I tried running it
> >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:
> >>
> >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected
> for
> >> fetching, exiting ...
> >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs
> to
> >> fetch.
> >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed
> >> list and URL filters.
> >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
> >>
> >> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib,
> and
> >> still 0 URLs are fetched.
> >>
> >> I'm totally at a loss. can someone help?
> >>
> >> Here's my regex-urlfilter:
> >>
> >> # skip file: ftp: and mailto: urls
> >> -^(file|ftp|mailto):
> >> # skip image and other suffixes we can't yet parse
> >> # for a more extensive coverage use the urlfilter-suffix plugin
> >>
> >>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
> >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >> # skip URLs containing certain characters as probable queries, etc.
> >> -[?*!@=]
> >> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> >> loops
> >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >> # accept anything else
> >> +.
> >>
> >>
> >> here's my nutch-site.xml:
> >>
> >> <configuration>
> >>   <property>
> >>     <name>http.agent.name</name>
> >>     <value>nutchtest</value>
> >>   </property>
> >>   <property>
> >>     <name>plugin.folders</name>
> >>
> >>
> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
> >>   </property>
> >> </configuration>
> >>
> >>
> >> which also does not work if I include this part:
> >>
> >> <property>
> >>     <name>plugin.includes</name>
> >>
> >>
> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
> >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
> >>   </property>
> >>
> >>
> >
>
>
> --
>
> --------------------------------
> Walter Tietze
> Senior Softwareengineer
> Research
>
> Neofonie GmbH
> Robert-Koch-Platz 4
> 10115 Berlin
>
> T +49.30 24627 318
> F +49.30 24627 120
>
> Walter.Tietze@neofonie.de
> http://www.neofonie.de
>
> Handelsregister
> Berlin-Charlottenburg: HRB 67460
>
> Geschäftsführung:
> Thomas Kitlitschko
> --------------------------------
>
>

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Walter Tietze <ti...@neofonie.de>.


Hi,

I had the same problems and couldn't get around in a proper way
satisfyingly.

I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without
MR_v1 and couldn't make it simply work.


But I found a workaround to make nutch 1.5.1 work on CDH4.


Since MR_v2 it is no longer allowed to pack a project as *nutch*.job
altogether and since the former TaskManager is divided into
the ResourceManager and the NodeManager, the NodeManager seems not to
be able to handle the packed nutch-project.

(see also:
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
)


Something one can do, is to unpack the job in the Nodemanager manually
and to load the classes from within the code into the current
classloader.

I modified the org/apache/nutch/plugin/PluginManifestParser.java
slightly and everything works fine at least for the moment.


I attached the modified file.


Please remark, I don't have experience yet, if CDH4 removes the
application directories and the unpacked files properly.
You should consider to check the directories, if they are still
needed after the crawl succeeded.



Hope this helps, cheers, Walter




Am 17.09.2012 18:31, schrieb Casey McTaggart:
> I would also like to add that I can run the same crawl locally and it's
> successful. So, it's just the distributed mode that's not working. can
> anyone offer any advice? Do you think it might be something with CDH4?
> 
> On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
> <ca...@gmail.com>wrote:
> 
>> Hi everyone,
>>
>> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version
>> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what
>> I'd expect. However, I need to take advantage of the mapreduce
>> functionality, since I want to crawl a local filesystem with many GB of
>> files. I'm going to put all of these files on an apache server so they can
>> be crawled. First, though, I want to just crawl a simple website, and I
>> can't make it work.
>>
>> My urls/seed.txt is on hdfs and is this:
>> http://lucene.apache.org
>>
>> I run this command:
>> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
>> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
>>
>> Sometimes, it fetches the URL, but does not go beyond depth 1... and when
>> I examine the CrawlDatum that's in
>> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the
>> seed url as the key, and the value of the CrawlDatum is
>> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
>> org/apache/tika/mime/MimeTypeException
>>
>> Okay, so I tried running the command again with -libjars nutch1.5.1.jar,
>> and it fails with an ArrayIndexOutOfBoundsException. I tried running it
>> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:
>>
>> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for
>> fetching, exiting ...
>> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
>> fetch.
>> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed
>> list and URL filters.
>> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
>>
>> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and
>> still 0 URLs are fetched.
>>
>> I'm totally at a loss. can someone help?
>>
>> Here's my regex-urlfilter:
>>
>> # skip file: ftp: and mailto: urls
>> -^(file|ftp|mailto):
>> # skip image and other suffixes we can't yet parse
>> # for a more extensive coverage use the urlfilter-suffix plugin
>>
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
>> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>> # skip URLs containing certain characters as probable queries, etc.
>> -[?*!@=]
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> # accept anything else
>> +.
>>
>>
>> here's my nutch-site.xml:
>>
>> <configuration>
>>   <property>
>>     <name>http.agent.name</name>
>>     <value>nutchtest</value>
>>   </property>
>>   <property>
>>     <name>plugin.folders</name>
>>
>> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
>>   </property>
>> </configuration>
>>
>>
>> which also does not work if I include this part:
>>
>> <property>
>>     <name>plugin.includes</name>
>>
>> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
>> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
>>   </property>
>>
>>
> 


-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

Walter.Tietze@neofonie.de
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Casey McTaggart <ca...@gmail.com>.

I would also like to add that I can run the same crawl locally and it's
successful. So, it's just the distributed mode that's not working. can
anyone offer any advice? Do you think it might be something with CDH4?

On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
<ca...@gmail.com>wrote:

> Hi everyone,
>
> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version
> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what
> I'd expect. However, I need to take advantage of the mapreduce
> functionality, since I want to crawl a local filesystem with many GB of
> files. I'm going to put all of these files on an apache server so they can
> be crawled. First, though, I want to just crawl a simple website, and I
> can't make it work.
>
> My urls/seed.txt is on hdfs and is this:
> http://lucene.apache.org
>
> I run this command:
> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
>
> Sometimes, it fetches the URL, but does not go beyond depth 1... and when
> I examine the CrawlDatum that's in
> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the
> seed url as the key, and the value of the CrawlDatum is
> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
> org/apache/tika/mime/MimeTypeException
>
> Okay, so I tried running the command again with -libjars nutch1.5.1.jar,
> and it fails with an ArrayIndexOutOfBoundsException. I tried running it
> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:
>
> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for
> fetching, exiting ...
> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
> fetch.
> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed
> list and URL filters.
> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
>
> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and
> still 0 URLs are fetched.
>
> I'm totally at a loss. can someone help?
>
> Here's my regex-urlfilter:
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> # accept anything else
> +.
>
>
> here's my nutch-site.xml:
>
> <configuration>
>   <property>
>     <name>http.agent.name</name>
>     <value>nutchtest</value>
>   </property>
>   <property>
>     <name>plugin.folders</name>
>
> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
>   </property>
> </configuration>
>
>
> which also does not work if I include this part:
>
> <property>
>     <name>plugin.includes</name>
>
> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
>   </property>
>
>

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by jiuling <ji...@gmail.com>.

Dear Lewis:
    I have met the same problem. I compile in the your same way. But it
still cause the problem.  The configuration of seeds and filters do work for
a local crawl, but failed  in deploy mode. Please help me , thank you a lot.

    The procedure is as following:
[Jiuling@crawler-3 deploy]$ bin/nutch crawl urls -dir crawls -depth 20 (*i
have also execute by "bin/hadoop jar apache-nutch-1.6-SNAPSHOT.job
org.apache.nutch.crawl.Crawl urls -dir crawls -depth 20"* )
Warning: $HADOOP_HOME is deprecated.

12/09/16 18:40:16 WARN crawl.Crawl: solrUrl is not set, indexing will be
skipped...
12/09/16 18:40:16 INFO crawl.Crawl: crawl started in: crawls
12/09/16 18:40:16 INFO crawl.Crawl: rootUrlDir = urls
12/09/16 18:40:16 INFO crawl.Crawl: threads = 10
12/09/16 18:40:16 INFO crawl.Crawl: depth = 20
12/09/16 18:40:16 INFO crawl.Crawl: solrUrl=null
12/09/16 18:40:16 INFO crawl.Injector: Injector: starting at 2012-09-16
18:40:16
12/09/16 18:40:16 INFO crawl.Injector: Injector: crawlDb: crawls/crawldb
12/09/16 18:40:16 INFO crawl.Injector: Injector: urlDir: urls
12/09/16 18:40:16 INFO crawl.Injector: Injector: Converting injected urls to
crawl db entries.
12/09/16 18:40:23 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/09/16 18:40:23 WARN snappy.LoadSnappy: Snappy native library not loaded
12/09/16 18:40:23 INFO mapred.FileInputFormat: Total input paths to process
: 1
12/09/16 18:40:23 INFO mapred.JobClient: Running job: job_201209161612_0047
12/09/16 18:40:24 INFO mapred.JobClient:  map 0% reduce 0%
12/09/16 18:40:39 INFO mapred.JobClient:  map 100% reduce 0%
12/09/16 18:40:51 INFO mapred.JobClient:  map 100% reduce 50%
12/09/16 18:40:54 INFO mapred.JobClient:  map 100% reduce 100%
12/09/16 18:40:59 INFO mapred.JobClient: Job complete: job_201209161612_0047
12/09/16 18:40:59 INFO mapred.JobClient: Counters: 30
12/09/16 18:40:59 INFO mapred.JobClient:   Job Counters 
12/09/16 18:40:59 INFO mapred.JobClient:     Launched reduce tasks=2
12/09/16 18:40:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=16534
12/09/16 18:40:59 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0
12/09/16 18:40:59 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/09/16 18:40:59 INFO mapred.JobClient:     Launched map tasks=2
12/09/16 18:40:59 INFO mapred.JobClient:     Data-local map tasks=2
12/09/16 18:40:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=20086
12/09/16 18:40:59 INFO mapred.JobClient:   File Input Format Counters 
12/09/16 18:40:59 INFO mapred.JobClient:     Bytes Read=321
12/09/16 18:40:59 INFO mapred.JobClient:   File Output Format Counters 
12/09/16 18:40:59 INFO mapred.JobClient:     Bytes Written=716
12/09/16 18:40:59 INFO mapred.JobClient:   FileSystemCounters
12/09/16 18:40:59 INFO mapred.JobClient:     FILE_BYTES_READ=502
12/09/16 18:40:59 INFO mapred.JobClient:     HDFS_BYTES_READ=517
12/09/16 18:40:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=132358
12/09/16 18:40:59 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=716
12/09/16 18:40:59 INFO mapred.JobClient:   Map-Reduce Framework
12/09/16 18:40:59 INFO mapred.JobClient:     Map output materialized
bytes=514
12/09/16 18:40:59 INFO mapred.JobClient:     Map input records=11
12/09/16 18:40:59 INFO mapred.JobClient:     Reduce shuffle bytes=231
12/09/16 18:40:59 INFO mapred.JobClient:     Spilled Records=18
12/09/16 18:40:59 INFO mapred.JobClient:     Map output bytes=472
12/09/16 18:40:59 INFO mapred.JobClient:     Total committed heap usage
(bytes)=358285312
12/09/16 18:40:59 INFO mapred.JobClient:     CPU time spent (ms)=3070
12/09/16 18:40:59 INFO mapred.JobClient:     Map input bytes=213
12/09/16 18:40:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=196
12/09/16 18:40:59 INFO mapred.JobClient:     Combine input records=0
12/09/16 18:40:59 INFO mapred.JobClient:     Reduce input records=9
12/09/16 18:40:59 INFO mapred.JobClient:     Reduce input groups=9
12/09/16 18:40:59 INFO mapred.JobClient:     Combine output records=0
12/09/16 18:40:59 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=580689920
12/09/16 18:40:59 INFO mapred.JobClient:     Reduce output records=9
12/09/16 18:40:59 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=8829870080
12/09/16 18:40:59 INFO mapred.JobClient:     Map output records=9
12/09/16 18:40:59 INFO crawl.Injector: Injector: Merging injected urls into
crawl db.
12/09/16 18:41:05 INFO mapred.FileInputFormat: Total input paths to process
: 4
12/09/16 18:41:06 INFO mapred.JobClient: Running job: job_201209161612_0048
12/09/16 18:41:07 INFO mapred.JobClient:  map 0% reduce 0%
12/09/16 18:41:22 INFO mapred.JobClient:  map 50% reduce 0%
12/09/16 18:41:28 INFO mapred.JobClient:  map 100% reduce 0%
12/09/16 18:41:31 INFO mapred.JobClient:  map 100% reduce 8%
12/09/16 18:41:37 INFO mapred.JobClient:  map 100% reduce 58%
12/09/16 18:41:40 INFO mapred.JobClient:  map 100% reduce 100%
12/09/16 18:41:45 INFO mapred.JobClient: Job complete: job_201209161612_0048
12/09/16 18:41:45 INFO mapred.JobClient: Counters: 30
12/09/16 18:41:45 INFO mapred.JobClient:   Job Counters 
12/09/16 18:41:45 INFO mapred.JobClient:     Launched reduce tasks=2
12/09/16 18:41:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=26468
12/09/16 18:41:45 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0
12/09/16 18:41:45 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/09/16 18:41:45 INFO mapred.JobClient:     Launched map tasks=4
12/09/16 18:41:45 INFO mapred.JobClient:     Data-local map tasks=4
12/09/16 18:41:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=26867
12/09/16 18:41:45 INFO mapred.JobClient:   File Input Format Counters 
12/09/16 18:41:45 INFO mapred.JobClient:     Bytes Read=51222
12/09/16 18:41:45 INFO mapred.JobClient:   File Output Format Counters 
12/09/16 18:41:45 INFO mapred.JobClient:     Bytes Written=51056
12/09/16 18:41:45 INFO mapred.JobClient:   FileSystemCounters
12/09/16 18:41:45 INFO mapred.JobClient:     FILE_BYTES_READ=46201
12/09/16 18:41:45 INFO mapred.JobClient:     HDFS_BYTES_READ=51754
12/09/16 18:41:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=290892
12/09/16 18:41:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=51056
12/09/16 18:41:45 INFO mapred.JobClient:   Map-Reduce Framework
12/09/16 18:41:45 INFO mapred.JobClient:     Map output materialized
bytes=46237
12/09/16 18:41:45 INFO mapred.JobClient:     Map input records=703
12/09/16 18:41:45 INFO mapred.JobClient:     Reduce shuffle bytes=46010
12/09/16 18:41:45 INFO mapred.JobClient:     Spilled Records=1406
12/09/16 18:41:45 INFO mapred.JobClient:     Map output bytes=44774
12/09/16 18:41:45 INFO mapred.JobClient:     Total committed heap usage
(bytes)=599851008
12/09/16 18:41:45 INFO mapred.JobClient:     CPU time spent (ms)=2690
12/09/16 18:41:45 INFO mapred.JobClient:     Map input bytes=50878
12/09/16 18:41:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=532
12/09/16 18:41:45 INFO mapred.JobClient:     Combine input records=0
12/09/16 18:41:45 INFO mapred.JobClient:     Reduce input records=703
12/09/16 18:41:45 INFO mapred.JobClient:     Reduce input groups=694
12/09/16 18:41:45 INFO mapred.JobClient:     Combine output records=0
12/09/16 18:41:45 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=923774976
12/09/16 18:41:45 INFO mapred.JobClient:     Reduce output records=694
12/09/16 18:41:45 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=12767576064
12/09/16 18:41:45 INFO mapred.JobClient:     Map output records=703
12/09/16 18:41:45 INFO crawl.Injector: Injector: finished at 2012-09-16
18:41:45, elapsed: 00:01:28
12/09/16 18:41:45 INFO crawl.Generator: Generator: starting at 2012-09-16
18:41:45
12/09/16 18:41:45 INFO crawl.Generator: Generator: Selecting best-scoring
urls due for fetch.
12/09/16 18:41:45 INFO crawl.Generator: Generator: filtering: true
12/09/16 18:41:45 INFO crawl.Generator: Generator: normalizing: true
12/09/16 18:41:51 INFO mapred.FileInputFormat: Total input paths to process
: 2
12/09/16 18:41:51 INFO mapred.JobClient: Running job: job_201209161612_0049
12/09/16 18:41:52 INFO mapred.JobClient:  map 0% reduce 0%
12/09/16 18:42:07 INFO mapred.JobClient:  map 100% reduce 0%
12/09/16 18:42:16 INFO mapred.JobClient:  map 100% reduce 8%
12/09/16 18:42:19 INFO mapred.JobClient:  map 100% reduce 66%
12/09/16 18:42:22 INFO mapred.JobClient:  map 100% reduce 100%
12/09/16 18:42:27 INFO mapred.JobClient: Job complete: job_201209161612_0049
12/09/16 18:42:27 INFO mapred.JobClient: Counters: 29
12/09/16 18:42:27 INFO mapred.JobClient:   Job Counters 
12/09/16 18:42:27 INFO mapred.JobClient:     Launched reduce tasks=2
12/09/16 18:42:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=17772
12/09/16 18:42:27 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0
12/09/16 18:42:27 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0
12/09/16 18:42:27 INFO mapred.JobClient:     Launched map tasks=2
12/09/16 18:42:27 INFO mapred.JobClient:     Data-local map tasks=2
12/09/16 18:42:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=20043
12/09/16 18:42:27 INFO mapred.JobClient:   File Input Format Counters 
12/09/16 18:42:27 INFO mapred.JobClient:     Bytes Read=50506
12/09/16 18:42:27 INFO mapred.JobClient:   File Output Format Counters 
12/09/16 18:42:27 INFO mapred.JobClient:     Bytes Written=0
12/09/16 18:42:27 INFO mapred.JobClient:   FileSystemCounters
12/09/16 18:42:27 INFO mapred.JobClient:     FILE_BYTES_READ=12
12/09/16 18:42:27 INFO mapred.JobClient:     HDFS_BYTES_READ=50752
12/09/16 18:42:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=135098
12/09/16 18:42:27 INFO mapred.JobClient:   Map-Reduce Framework
12/09/16 18:42:27 INFO mapred.JobClient:     Map output materialized
bytes=24
12/09/16 18:42:27 INFO mapred.JobClient:     Map input records=694
12/09/16 18:42:27 INFO mapred.JobClient:     Reduce shuffle bytes=18
12/09/16 18:42:27 INFO mapred.JobClient:     Spilled Records=0
12/09/16 18:42:27 INFO mapred.JobClient:     Map output bytes=0
12/09/16 18:42:27 INFO mapred.JobClient:     Total committed heap usage
(bytes)=369360896
12/09/16 18:42:27 INFO mapred.JobClient:     CPU time spent (ms)=3330
12/09/16 18:42:27 INFO mapred.JobClient:     Map input bytes=50334
12/09/16 18:42:27 INFO mapred.JobClient:     SPLIT_RAW_BYTES=246
12/09/16 18:42:27 INFO mapred.JobClient:     Combine input records=0
12/09/16 18:42:27 INFO mapred.JobClient:     Reduce input records=0
12/09/16 18:42:27 INFO mapred.JobClient:     Reduce input groups=0
12/09/16 18:42:27 INFO mapred.JobClient:     Combine output records=0
12/09/16 18:42:27 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=582873088
12/09/16 18:42:27 INFO mapred.JobClient:     Reduce output records=0
12/09/16 18:42:27 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=8829927424
12/09/16 18:42:27 INFO mapred.JobClient:     Map output records=0
12/09/16 18:42:27 WARN crawl.Generator: Generator: 0 records selected for
fetching, exiting ...
12/09/16 18:42:28 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to
fetch.
12/09/16 18:42:28 WARN crawl.Crawl: No URLs to fetch - check your seed list
and URL filters.
12/09/16 18:42:28 INFO crawl.Crawl: crawl finished: crawls




--
View this message in context: http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008102.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Casey McTaggart <ca...@gmail.com>.

Hi Lewis,

I get the exact same results when I run the bin/nutch script from
runtime/deploy... any other help? sorry, thanks!

I run it like this
sudo -u hdfs bin/nutch crawl urls/seed.txt -dir crawl


On Sat, Sep 15, 2012 at 5:49 PM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> Hi Casey,
>
> On Sun, Sep 16, 2012 at 12:22 AM, Casey McTaggart
> <ca...@gmail.com> wrote:
>
>> I run this command:
>> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
>> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
>
> I don-t think you should do this.
>
> Please see a similar post a couple days back [0] and Julien's [1] answer.
>
> Get back to us if you have probs. I hope this works for you.
>
> Lewis
>
>
> [1] http://www.mail-archive.com/user%40nutch.apache.org/msg07564.html
> [0] http://www.mail-archive.com/user%40nutch.apache.org/msg07565.html

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Casey,

On Sun, Sep 16, 2012 at 12:22 AM, Casey McTaggart
<ca...@gmail.com> wrote:

> I run this command:
> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl

I don-t think you should do this.

Please see a similar post a couple days back [0] and Julien's [1] answer.

Get back to us if you have probs. I hope this works for you.

Lewis

[1] http://www.mail-archive.com/user%40nutch.apache.org/msg07564.html
[0] http://www.mail-archive.com/user%40nutch.apache.org/msg07565.html