You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by atawfik <co...@gmail.com> on 2019/02/19 18:03:10 UTC

Nutch 1.15 runtime/local does not run in Standalone mode

Hi all,

I downloaded Nutch 1.15 and built using *ant runtime*. When I issue the
following crawl command from *runtime/local*

 

Nutch generates hadoop jobs and  hadoop single node logs. See the content of
the *hadoop.log* file below:



If I understand right, it seems that nutch is running in a SingleNode mode.
We are not running Nutch in a cluster. We are just running locally. 

Please correct me if I misunderstood anything. 

Regards
Ameer




--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Re: Nutch 1.15 runtime/local does not run in Standalone mode

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.

Hi Ameer,

> The log issue was added recently, it was confusing me.

No problem.  The Hadoop job logs are only in the hadoop.log
and are not written to stdout.  That should be acceptable.
If not anybody is free to modify the log4j.properties.

> Do you think it will be better to use the Nutch server and monitor the jobs
> and their statuses? I will then delete the failed ones.

It's your decision. But the NutchServer isn't super-stable and also does not
include all possible options and tools available on the command-line.

bin/crawl and bin/nutch will signalize per exit value whether they have
succeeded or failed.  You could do the cleanup in a shell script:

 if ! .../bin/crawl ... ; then
   echo "Job ... failed, cleaning up"
   rm -rf  # hadoop.tmp.dir
 fi

Best,
Sebastian

On 2/20/19 1:24 PM, Ameer Tawfik wrote:
> Thanks  Sebastian  ,
> 
> Now after I have looked into the Jira issue, I knew the cause of the
> confusion. I have used all previous versions of Nutch up to 1.13. I recall
> seeing these messages in the pseudo or cluster modes. The log issue was
> added recently, it was confusing me.
> 
> Do you think it will be better to use the Nutch server and monitor the jobs
> and their statuses? I will then delete the failed ones.
> 
> 
> Regards
> Ameer
> 
> 
> 
> On Wed, Feb 20, 2019 at 8:58 PM Sebastian Nagel <wa...@googlemail.com>
> wrote:
> 
>> Hi Ameer,
>>
>> (bringing this back to user@nutch - sorry, I hit the wrong reply to)
>>
>>> So, does that mean we do not have the standalone mode anymore as it used
>> be in the past
>>
>> Nutch is based on Hadoop since the beginning and the "local" mode is an
>> emulated Hadoop system in a
>> single process/JVM.
>> There has been no change to this behavior in recent Nutch versions.
>>
>>> Any thoughts in getting back the old behavior with no jobs being created
>> in the
>>> *tmp* directory.
>>
>> The issues with the /tmp directory have ever been there in local mode, see
>>   http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html
>>
>> In local mode, you can change the temporary folder used by Hadoop via the
>> Java
>> option
>>   -Dhadoop.tmp.dir
>>
>> With bin/nutch or bin/crawl this is done by setting the environment
>> variable NUTCH_OPTS
>>
>>   export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir
>>
>> Then all temporary data is written to /my/nutch/tmpdir but you're still
>> responsible
>> to clean-up this folder.
>>
>>
>>> It confuses me to see these messages
>>
>> You can suppress them by removing the following lines in
>> conf/log4j.properties:
>>
>> # log mapreduce job messages and counters
>> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>>
>> However, for debugging these messages are really useful, esp. the job
>> counters.
>> See https://issues.apache.org/jira/browse/NUTCH-2519
>>
>>
>> Best,
>> Sebastian
>>
>>
>>
>> On 2/19/19 11:01 PM, Ameer Tawfik wrote:
>>> Thanks Sebastian for the reply.
>>>
>>> So, does that mean we do not have the standalone mode anymore as it used
>> be in the past. It confuses
>>> me to see these messages
>>>
>>>  The url to track the job: http://localhost:8080/
>>> 2019-02-20 04:48:08,156 INFO  mapreduce.Job - Running job:
>> job_local2035597620_0001
>>> 2019-02-20 04:48:09,159 INFO  mapreduce.Job - J*ob
>> job_local2035597620_0001* running in uber mode :
>>> false
>>> 2019-02-20 04:48:09,161 INFO  mapreduce.Job -  *map 0% reduce 100%*
>>> 2019-02-20 04:48:09,163 INFO  mapreduce.Job - J*ob
>> job_local2035597620_0001 *completed successfully
>>> 2019-02-20 04:48:09,194 INFO  mapreduce.Job - Counters: 24
>>>
>>> In addition, it starts to create problems as these jobs accumulated in
>>> the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/
>> *directory* *and eats up the
>>> harddisk space. Any thoughts in getting back the old behavior with no
>> jobs being created in the
>>> *tmp* directory. It also seems slow to me.
>>>
>>> Regards
>>> Ameer
>>>
>>>
>>>
>>> On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <
>> wastl.nagel@googlemail.com
>>> <ma...@googlemail.com>> wrote:
>>>
>>>     Hi Ameer,
>>>
>>>     yes, you're correct.  If launched by
>>>       runtime/local/bin/nutch
>>>     resp.
>>>       runtime/local/bin/crawl
>>>     Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job
>> and task clients
>>>     in a single process (JVM).
>>>
>>>     The other options are:
>>>      - pseudo-distributed mode: HDFS namenode and datanode, job and task
>> clients
>>>        as multiple processes on a single node
>>>      - fully distributed mode: multiple processes on multiple nodes
>>>
>>>     Best,
>>>     Sebastian
>>>
>>>
>>>
>>>     On 2/19/19 7:03 PM, atawfik wrote:
>>>     > Hi all,
>>>     >
>>>     > I downloaded Nutch 1.15 and built using *ant runtime*. When I
>> issue the
>>>     > following crawl command from *runtime/local*
>>>     >
>>>     >
>>>     >
>>>     > Nutch generates hadoop jobs and  hadoop single node logs. See the
>> content of
>>>     > the *hadoop.log* file below:
>>>     >
>>>     >
>>>     >
>>>     > If I understand right, it seems that nutch is running in a
>> SingleNode mode.
>>>     > We are not running Nutch in a cluster. We are just running locally.
>>>     >
>>>     > Please correct me if I misunderstood anything.
>>>     >
>>>     > Regards
>>>     > Ameer
>>>     >
>>>     >
>>>     >
>>>     >
>>>     > --
>>>     > Sent from:
>> http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>>>     >
>>>
>>
>>
>

Re: Nutch 1.15 runtime/local does not run in Standalone mode

Posted by Ameer Tawfik <co...@gmail.com>.

Thanks  Sebastian  ,

Now after I have looked into the Jira issue, I knew the cause of the
confusion. I have used all previous versions of Nutch up to 1.13. I recall
seeing these messages in the pseudo or cluster modes. The log issue was
added recently, it was confusing me.

Do you think it will be better to use the Nutch server and monitor the jobs
and their statuses? I will then delete the failed ones.


Regards
Ameer



On Wed, Feb 20, 2019 at 8:58 PM Sebastian Nagel <wa...@googlemail.com>
wrote:

> Hi Ameer,
>
> (bringing this back to user@nutch - sorry, I hit the wrong reply to)
>
> > So, does that mean we do not have the standalone mode anymore as it used
> be in the past
>
> Nutch is based on Hadoop since the beginning and the "local" mode is an
> emulated Hadoop system in a
> single process/JVM.
> There has been no change to this behavior in recent Nutch versions.
>
> > Any thoughts in getting back the old behavior with no jobs being created
> in the
> > *tmp* directory.
>
> The issues with the /tmp directory have ever been there in local mode, see
>   http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html
>
> In local mode, you can change the temporary folder used by Hadoop via the
> Java
> option
>   -Dhadoop.tmp.dir
>
> With bin/nutch or bin/crawl this is done by setting the environment
> variable NUTCH_OPTS
>
>   export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir
>
> Then all temporary data is written to /my/nutch/tmpdir but you're still
> responsible
> to clean-up this folder.
>
>
> > It confuses me to see these messages
>
> You can suppress them by removing the following lines in
> conf/log4j.properties:
>
> # log mapreduce job messages and counters
> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>
> However, for debugging these messages are really useful, esp. the job
> counters.
> See https://issues.apache.org/jira/browse/NUTCH-2519
>
>
> Best,
> Sebastian
>
>
>
> On 2/19/19 11:01 PM, Ameer Tawfik wrote:
> > Thanks Sebastian for the reply.
> >
> > So, does that mean we do not have the standalone mode anymore as it used
> be in the past. It confuses
> > me to see these messages
> >
> >  The url to track the job: http://localhost:8080/
> > 2019-02-20 04:48:08,156 INFO  mapreduce.Job - Running job:
> job_local2035597620_0001
> > 2019-02-20 04:48:09,159 INFO  mapreduce.Job - J*ob
> job_local2035597620_0001* running in uber mode :
> > false
> > 2019-02-20 04:48:09,161 INFO  mapreduce.Job -  *map 0% reduce 100%*
> > 2019-02-20 04:48:09,163 INFO  mapreduce.Job - J*ob
> job_local2035597620_0001 *completed successfully
> > 2019-02-20 04:48:09,194 INFO  mapreduce.Job - Counters: 24
> >
> > In addition, it starts to create problems as these jobs accumulated in
> > the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/
> *directory* *and eats up the
> > harddisk space. Any thoughts in getting back the old behavior with no
> jobs being created in the
> > *tmp* directory. It also seems slow to me.
> >
> > Regards
> > Ameer
> >
> >
> >
> > On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <
> wastl.nagel@googlemail.com
> > <ma...@googlemail.com>> wrote:
> >
> >     Hi Ameer,
> >
> >     yes, you're correct.  If launched by
> >       runtime/local/bin/nutch
> >     resp.
> >       runtime/local/bin/crawl
> >     Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job
> and task clients
> >     in a single process (JVM).
> >
> >     The other options are:
> >      - pseudo-distributed mode: HDFS namenode and datanode, job and task
> clients
> >        as multiple processes on a single node
> >      - fully distributed mode: multiple processes on multiple nodes
> >
> >     Best,
> >     Sebastian
> >
> >
> >
> >     On 2/19/19 7:03 PM, atawfik wrote:
> >     > Hi all,
> >     >
> >     > I downloaded Nutch 1.15 and built using *ant runtime*. When I
> issue the
> >     > following crawl command from *runtime/local*
> >     >
> >     >
> >     >
> >     > Nutch generates hadoop jobs and  hadoop single node logs. See the
> content of
> >     > the *hadoop.log* file below:
> >     >
> >     >
> >     >
> >     > If I understand right, it seems that nutch is running in a
> SingleNode mode.
> >     > We are not running Nutch in a cluster. We are just running locally.
> >     >
> >     > Please correct me if I misunderstood anything.
> >     >
> >     > Regards
> >     > Ameer
> >     >
> >     >
> >     >
> >     >
> >     > --
> >     > Sent from:
> http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> >     >
> >
>
>

Re: Nutch 1.15 runtime/local does not run in Standalone mode

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.

Hi Ameer,

(bringing this back to user@nutch - sorry, I hit the wrong reply to)

> So, does that mean we do not have the standalone mode anymore as it used be in the past

Nutch is based on Hadoop since the beginning and the "local" mode is an emulated Hadoop system in a
single process/JVM.
There has been no change to this behavior in recent Nutch versions.

> Any thoughts in getting back the old behavior with no jobs being created in the
> *tmp* directory.

The issues with the /tmp directory have ever been there in local mode, see
  http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html

In local mode, you can change the temporary folder used by Hadoop via the Java
option
  -Dhadoop.tmp.dir

With bin/nutch or bin/crawl this is done by setting the environment variable NUTCH_OPTS

  export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir

Then all temporary data is written to /my/nutch/tmpdir but you're still responsible
to clean-up this folder.


> It confuses me to see these messages

You can suppress them by removing the following lines in
conf/log4j.properties:

# log mapreduce job messages and counters
log4j.logger.org.apache.hadoop.mapreduce.Job=INFO

However, for debugging these messages are really useful, esp. the job counters.
See https://issues.apache.org/jira/browse/NUTCH-2519


Best,
Sebastian



On 2/19/19 11:01 PM, Ameer Tawfik wrote:
> Thanks Sebastian for the reply.
> 
> So, does that mean we do not have the standalone mode anymore as it used be in the past. It confuses
> me to see these messages 
> 
>  The url to track the job: http://localhost:8080/
> 2019-02-20 04:48:08,156 INFO  mapreduce.Job - Running job: job_local2035597620_0001
> 2019-02-20 04:48:09,159 INFO  mapreduce.Job - J*ob job_local2035597620_0001* running in uber mode :
> false
> 2019-02-20 04:48:09,161 INFO  mapreduce.Job -  *map 0% reduce 100%*
> 2019-02-20 04:48:09,163 INFO  mapreduce.Job - J*ob job_local2035597620_0001 *completed successfully
> 2019-02-20 04:48:09,194 INFO  mapreduce.Job - Counters: 24
> 
> In addition, it starts to create problems as these jobs accumulated in
> the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/ *directory* *and eats up the
> harddisk space. Any thoughts in getting back the old behavior with no jobs being created in the
> *tmp* directory. It also seems slow to me.
> 
> Regards
> Ameer
> 
> 
> 
> On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <wastl.nagel@googlemail.com
> <ma...@googlemail.com>> wrote:
> 
>     Hi Ameer,
> 
>     yes, you're correct.  If launched by
>       runtime/local/bin/nutch
>     resp.
>       runtime/local/bin/crawl
>     Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job and task clients
>     in a single process (JVM).
> 
>     The other options are:
>      - pseudo-distributed mode: HDFS namenode and datanode, job and task clients
>        as multiple processes on a single node
>      - fully distributed mode: multiple processes on multiple nodes
> 
>     Best,
>     Sebastian
> 
> 
> 
>     On 2/19/19 7:03 PM, atawfik wrote:
>     > Hi all,
>     >
>     > I downloaded Nutch 1.15 and built using *ant runtime*. When I issue the
>     > following crawl command from *runtime/local*
>     >
>     > 
>     >
>     > Nutch generates hadoop jobs and  hadoop single node logs. See the content of
>     > the *hadoop.log* file below:
>     >
>     >
>     >
>     > If I understand right, it seems that nutch is running in a SingleNode mode.
>     > We are not running Nutch in a cluster. We are just running locally.
>     >
>     > Please correct me if I misunderstood anything.
>     >
>     > Regards
>     > Ameer
>     >
>     >
>     >
>     >
>     > --
>     > Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>     >
>