You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by atawfik <co...@gmail.com> on 2019/02/19 18:03:10 UTC
Nutch 1.15 runtime/local does not run in Standalone mode
Hi all,
I downloaded Nutch 1.15 and built using *ant runtime*. When I issue the
following crawl command from *runtime/local*
Nutch generates hadoop jobs and hadoop single node logs. See the content of
the *hadoop.log* file below:
If I understand right, it seems that nutch is running in a SingleNode mode.
We are not running Nutch in a cluster. We are just running locally.
Please correct me if I misunderstood anything.
Regards
Ameer
--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
Re: Nutch 1.15 runtime/local does not run in Standalone mode
Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Ameer,
> The log issue was added recently, it was confusing me.
No problem. The Hadoop job logs are only in the hadoop.log
and are not written to stdout. That should be acceptable.
If not anybody is free to modify the log4j.properties.
> Do you think it will be better to use the Nutch server and monitor the jobs
> and their statuses? I will then delete the failed ones.
It's your decision. But the NutchServer isn't super-stable and also does not
include all possible options and tools available on the command-line.
bin/crawl and bin/nutch will signalize per exit value whether they have
succeeded or failed. You could do the cleanup in a shell script:
if ! .../bin/crawl ... ; then
echo "Job ... failed, cleaning up"
rm -rf # hadoop.tmp.dir
fi
Best,
Sebastian
On 2/20/19 1:24 PM, Ameer Tawfik wrote:
> Thanks Sebastian ,
>
> Now after I have looked into the Jira issue, I knew the cause of the
> confusion. I have used all previous versions of Nutch up to 1.13. I recall
> seeing these messages in the pseudo or cluster modes. The log issue was
> added recently, it was confusing me.
>
> Do you think it will be better to use the Nutch server and monitor the jobs
> and their statuses? I will then delete the failed ones.
>
>
> Regards
> Ameer
>
>
>
> On Wed, Feb 20, 2019 at 8:58 PM Sebastian Nagel <wa...@googlemail.com>
> wrote:
>
>> Hi Ameer,
>>
>> (bringing this back to user@nutch - sorry, I hit the wrong reply to)
>>
>>> So, does that mean we do not have the standalone mode anymore as it used
>> be in the past
>>
>> Nutch is based on Hadoop since the beginning and the "local" mode is an
>> emulated Hadoop system in a
>> single process/JVM.
>> There has been no change to this behavior in recent Nutch versions.
>>
>>> Any thoughts in getting back the old behavior with no jobs being created
>> in the
>>> *tmp* directory.
>>
>> The issues with the /tmp directory have ever been there in local mode, see
>> http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html
>>
>> In local mode, you can change the temporary folder used by Hadoop via the
>> Java
>> option
>> -Dhadoop.tmp.dir
>>
>> With bin/nutch or bin/crawl this is done by setting the environment
>> variable NUTCH_OPTS
>>
>> export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir
>>
>> Then all temporary data is written to /my/nutch/tmpdir but you're still
>> responsible
>> to clean-up this folder.
>>
>>
>>> It confuses me to see these messages
>>
>> You can suppress them by removing the following lines in
>> conf/log4j.properties:
>>
>> # log mapreduce job messages and counters
>> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>>
>> However, for debugging these messages are really useful, esp. the job
>> counters.
>> See https://issues.apache.org/jira/browse/NUTCH-2519
>>
>>
>> Best,
>> Sebastian
>>
>>
>>
>> On 2/19/19 11:01 PM, Ameer Tawfik wrote:
>>> Thanks Sebastian for the reply.
>>>
>>> So, does that mean we do not have the standalone mode anymore as it used
>> be in the past. It confuses
>>> me to see these messages
>>>
>>> The url to track the job: http://localhost:8080/
>>> 2019-02-20 04:48:08,156 INFO mapreduce.Job - Running job:
>> job_local2035597620_0001
>>> 2019-02-20 04:48:09,159 INFO mapreduce.Job - J*ob
>> job_local2035597620_0001* running in uber mode :
>>> false
>>> 2019-02-20 04:48:09,161 INFO mapreduce.Job - *map 0% reduce 100%*
>>> 2019-02-20 04:48:09,163 INFO mapreduce.Job - J*ob
>> job_local2035597620_0001 *completed successfully
>>> 2019-02-20 04:48:09,194 INFO mapreduce.Job - Counters: 24
>>>
>>> In addition, it starts to create problems as these jobs accumulated in
>>> the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/
>> *directory* *and eats up the
>>> harddisk space. Any thoughts in getting back the old behavior with no
>> jobs being created in the
>>> *tmp* directory. It also seems slow to me.
>>>
>>> Regards
>>> Ameer
>>>
>>>
>>>
>>> On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <
>> wastl.nagel@googlemail.com
>>> <ma...@googlemail.com>> wrote:
>>>
>>> Hi Ameer,
>>>
>>> yes, you're correct. If launched by
>>> runtime/local/bin/nutch
>>> resp.
>>> runtime/local/bin/crawl
>>> Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job
>> and task clients
>>> in a single process (JVM).
>>>
>>> The other options are:
>>> - pseudo-distributed mode: HDFS namenode and datanode, job and task
>> clients
>>> as multiple processes on a single node
>>> - fully distributed mode: multiple processes on multiple nodes
>>>
>>> Best,
>>> Sebastian
>>>
>>>
>>>
>>> On 2/19/19 7:03 PM, atawfik wrote:
>>> > Hi all,
>>> >
>>> > I downloaded Nutch 1.15 and built using *ant runtime*. When I
>> issue the
>>> > following crawl command from *runtime/local*
>>> >
>>> >
>>> >
>>> > Nutch generates hadoop jobs and hadoop single node logs. See the
>> content of
>>> > the *hadoop.log* file below:
>>> >
>>> >
>>> >
>>> > If I understand right, it seems that nutch is running in a
>> SingleNode mode.
>>> > We are not running Nutch in a cluster. We are just running locally.
>>> >
>>> > Please correct me if I misunderstood anything.
>>> >
>>> > Regards
>>> > Ameer
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Sent from:
>> http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
>>> >
>>>
>>
>>
>
Re: Nutch 1.15 runtime/local does not run in Standalone mode
Posted by Ameer Tawfik <co...@gmail.com>.
Thanks Sebastian ,
Now after I have looked into the Jira issue, I knew the cause of the
confusion. I have used all previous versions of Nutch up to 1.13. I recall
seeing these messages in the pseudo or cluster modes. The log issue was
added recently, it was confusing me.
Do you think it will be better to use the Nutch server and monitor the jobs
and their statuses? I will then delete the failed ones.
Regards
Ameer
On Wed, Feb 20, 2019 at 8:58 PM Sebastian Nagel <wa...@googlemail.com>
wrote:
> Hi Ameer,
>
> (bringing this back to user@nutch - sorry, I hit the wrong reply to)
>
> > So, does that mean we do not have the standalone mode anymore as it used
> be in the past
>
> Nutch is based on Hadoop since the beginning and the "local" mode is an
> emulated Hadoop system in a
> single process/JVM.
> There has been no change to this behavior in recent Nutch versions.
>
> > Any thoughts in getting back the old behavior with no jobs being created
> in the
> > *tmp* directory.
>
> The issues with the /tmp directory have ever been there in local mode, see
> http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html
>
> In local mode, you can change the temporary folder used by Hadoop via the
> Java
> option
> -Dhadoop.tmp.dir
>
> With bin/nutch or bin/crawl this is done by setting the environment
> variable NUTCH_OPTS
>
> export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir
>
> Then all temporary data is written to /my/nutch/tmpdir but you're still
> responsible
> to clean-up this folder.
>
>
> > It confuses me to see these messages
>
> You can suppress them by removing the following lines in
> conf/log4j.properties:
>
> # log mapreduce job messages and counters
> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>
> However, for debugging these messages are really useful, esp. the job
> counters.
> See https://issues.apache.org/jira/browse/NUTCH-2519
>
>
> Best,
> Sebastian
>
>
>
> On 2/19/19 11:01 PM, Ameer Tawfik wrote:
> > Thanks Sebastian for the reply.
> >
> > So, does that mean we do not have the standalone mode anymore as it used
> be in the past. It confuses
> > me to see these messages
> >
> > The url to track the job: http://localhost:8080/
> > 2019-02-20 04:48:08,156 INFO mapreduce.Job - Running job:
> job_local2035597620_0001
> > 2019-02-20 04:48:09,159 INFO mapreduce.Job - J*ob
> job_local2035597620_0001* running in uber mode :
> > false
> > 2019-02-20 04:48:09,161 INFO mapreduce.Job - *map 0% reduce 100%*
> > 2019-02-20 04:48:09,163 INFO mapreduce.Job - J*ob
> job_local2035597620_0001 *completed successfully
> > 2019-02-20 04:48:09,194 INFO mapreduce.Job - Counters: 24
> >
> > In addition, it starts to create problems as these jobs accumulated in
> > the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/
> *directory* *and eats up the
> > harddisk space. Any thoughts in getting back the old behavior with no
> jobs being created in the
> > *tmp* directory. It also seems slow to me.
> >
> > Regards
> > Ameer
> >
> >
> >
> > On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <
> wastl.nagel@googlemail.com
> > <ma...@googlemail.com>> wrote:
> >
> > Hi Ameer,
> >
> > yes, you're correct. If launched by
> > runtime/local/bin/nutch
> > resp.
> > runtime/local/bin/crawl
> > Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job
> and task clients
> > in a single process (JVM).
> >
> > The other options are:
> > - pseudo-distributed mode: HDFS namenode and datanode, job and task
> clients
> > as multiple processes on a single node
> > - fully distributed mode: multiple processes on multiple nodes
> >
> > Best,
> > Sebastian
> >
> >
> >
> > On 2/19/19 7:03 PM, atawfik wrote:
> > > Hi all,
> > >
> > > I downloaded Nutch 1.15 and built using *ant runtime*. When I
> issue the
> > > following crawl command from *runtime/local*
> > >
> > >
> > >
> > > Nutch generates hadoop jobs and hadoop single node logs. See the
> content of
> > > the *hadoop.log* file below:
> > >
> > >
> > >
> > > If I understand right, it seems that nutch is running in a
> SingleNode mode.
> > > We are not running Nutch in a cluster. We are just running locally.
> > >
> > > Please correct me if I misunderstood anything.
> > >
> > > Regards
> > > Ameer
> > >
> > >
> > >
> > >
> > > --
> > > Sent from:
> http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> > >
> >
>
>
Re: Nutch 1.15 runtime/local does not run in Standalone mode
Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Ameer,
(bringing this back to user@nutch - sorry, I hit the wrong reply to)
> So, does that mean we do not have the standalone mode anymore as it used be in the past
Nutch is based on Hadoop since the beginning and the "local" mode is an emulated Hadoop system in a
single process/JVM.
There has been no change to this behavior in recent Nutch versions.
> Any thoughts in getting back the old behavior with no jobs being created in the
> *tmp* directory.
The issues with the /tmp directory have ever been there in local mode, see
http://lucene.472066.n3.nabble.com/tmp-folder-problem-td4008834.html
In local mode, you can change the temporary folder used by Hadoop via the Java
option
-Dhadoop.tmp.dir
With bin/nutch or bin/crawl this is done by setting the environment variable NUTCH_OPTS
export NUTCH_OPTS=-Dhadoop.tmp.dir=/my/nutch/tmpdir
Then all temporary data is written to /my/nutch/tmpdir but you're still responsible
to clean-up this folder.
> It confuses me to see these messages
You can suppress them by removing the following lines in
conf/log4j.properties:
# log mapreduce job messages and counters
log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
However, for debugging these messages are really useful, esp. the job counters.
See https://issues.apache.org/jira/browse/NUTCH-2519
Best,
Sebastian
On 2/19/19 11:01 PM, Ameer Tawfik wrote:
> Thanks Sebastian for the reply.
>
> So, does that mean we do not have the standalone mode anymore as it used be in the past. It confuses
> me to see these messages
>
> The url to track the job: http://localhost:8080/
> 2019-02-20 04:48:08,156 INFO mapreduce.Job - Running job: job_local2035597620_0001
> 2019-02-20 04:48:09,159 INFO mapreduce.Job - J*ob job_local2035597620_0001* running in uber mode :
> false
> 2019-02-20 04:48:09,161 INFO mapreduce.Job - *map 0% reduce 100%*
> 2019-02-20 04:48:09,163 INFO mapreduce.Job - J*ob job_local2035597620_0001 *completed successfully
> 2019-02-20 04:48:09,194 INFO mapreduce.Job - Counters: 24
>
> In addition, it starts to create problems as these jobs accumulated in
> the */tmp/hadoop-ameer/mapred/local/localRunner/ameer/jobcache/ *directory* *and eats up the
> harddisk space. Any thoughts in getting back the old behavior with no jobs being created in the
> *tmp* directory. It also seems slow to me.
>
> Regards
> Ameer
>
>
>
> On Wed, Feb 20, 2019 at 6:10 AM Sebastian Nagel <wastl.nagel@googlemail.com
> <ma...@googlemail.com>> wrote:
>
> Hi Ameer,
>
> yes, you're correct. If launched by
> runtime/local/bin/nutch
> resp.
> runtime/local/bin/crawl
> Nutch runs in "local" mode - Hadoop is "emulated" running HDFS, job and task clients
> in a single process (JVM).
>
> The other options are:
> - pseudo-distributed mode: HDFS namenode and datanode, job and task clients
> as multiple processes on a single node
> - fully distributed mode: multiple processes on multiple nodes
>
> Best,
> Sebastian
>
>
>
> On 2/19/19 7:03 PM, atawfik wrote:
> > Hi all,
> >
> > I downloaded Nutch 1.15 and built using *ant runtime*. When I issue the
> > following crawl command from *runtime/local*
> >
> >
> >
> > Nutch generates hadoop jobs and hadoop single node logs. See the content of
> > the *hadoop.log* file below:
> >
> >
> >
> > If I understand right, it seems that nutch is running in a SingleNode mode.
> > We are not running Nutch in a cluster. We are just running locally.
> >
> > Please correct me if I misunderstood anything.
> >
> > Regards
> > Ameer
> >
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html
> >
>