You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Clark Benham <cl...@thehive.ai> on 2021/07/14 18:27:23 UTC

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Hi All,

Sebastian Helped fix my issue: using S3 as a backend I was able to get
nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
`hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
jars from the hadoop download.
Also, in the main nutch branch (
https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
has dependencies on hadoop-3.1.3; eg.
<!-- Hadoop Dependencies -->
<dependency org="org.apache.hadoop" name="hadoop-common" rev="3.1.3"
conf="*->default">
<exclude org="hsqldb" name="hsqldb" />
<exclude org="net.sf.kosmosfs" name="kfs" />z
<exclude org="net.java.dev.jets3t" name="jets3t" />
<exclude org="org.eclipse.jdt" name="core" />
<exclude org="org.mortbay.jetty" name="jsp-*" />
<exclude org="ant" name="ant" />
</dependency>
<dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="3.1.3"
conf="*->default" />
<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core"
rev="3.1.3" conf="*->default" />
<dependency org="org.apache.hadoop"
name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" />
<!-- End of Hadoop Dependencies -->

I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.

I didn't change "mapreduce.job.dir" because there's no namenode nor
datanode processes running when using hadoop with S3, so the UI is blank.

Copied from Email with Sebastian:
>  > The plugin loader doesn't appear to be able to read from s3 in
nutch-1.18
>  > with hadoop-3.2.1[1].

> I had a look into the plugin loader: it can only read from the local file
system.
> But that's ok because the Nutch job file is copied to the local machine
> and unpacked. Here the paths how it looks like on one of the running
Common Crawl
> task nodes:

The configs for the working hadoop are as follows:

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--

  Licensed under the Apache License, Version 2.0 (the "License");

  you may not use this file except in compliance with the License.

  You may obtain a copy of the License at


    http://www.apache.org/licenses/LICENSE-2.0


  Unless required by applicable law or agreed to in writing, software

  distributed under the License is distributed on an "AS IS" BASIS,

  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  See the License for the specific language governing permissions and

  limitations under the License. See accompanying LICENSE file.

-->


<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

  <name>hadoop.tmp.dir</name>

  <value>/home/hdoop/tmpdata</value>

</property>

<property>

  <name>fs.defaultFS</name>

  <value>s3a://my-bucket</value>

</property>


<property>

        <name>fs.s3a.access.key</name>

        <value>KEY_PLACEHOLDER</value>

  <description>AWS access key ID.

   Omit for IAM role-based or provider-based authentication.</description>

</property>


<property>

  <name>fs.s3a.secret.key</name>

  <value>SECRET_PLACEHOLDER</value>

  <description>AWS secret key.

   Omit for IAM role-based or provider-based authentication.</description>

</property>


<property>

  <name>fs.s3a.aws.credentials.provider</name>

  <description>

    Comma-separated class names of credential provider classes which
implement

    com.amazonaws.auth.AWSCredentialsProvider.


    These are loaded and queried in sequence for a valid set of credentials.

    Each listed class must implement one of the following means of

    construction, which are attempted in order:

    1. a public constructor accepting java.net.URI and

        org.apache.hadoop.conf.Configuration,

    2. a public static method named getInstance that accepts no

       arguments and returns an instance of

       com.amazonaws.auth.AWSCredentialsProvider, or

    3. a public default constructor.


    Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
allows

    anonymous access to a publicly accessible S3 bucket without any
credentials.

    Please note that allowing anonymous access to an S3 bucket compromises

    security and therefore is unsuitable for most use cases. It can be
useful

    for accessing public data sets without requiring AWS credentials.


    If unspecified, then the default list of credential provider classes,

    queried in sequence, is:

    1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:

       Uses the values of fs.s3a.access.key and fs.s3a.secret.key.

    2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports

        configuration of AWS access key ID and secret access key in

        environment variables named AWS_ACCESS_KEY_ID and

        AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.

    3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use

        of instance profile credentials if running in an EC2 VM.

  </description>

</property>



<dependencies>

  <dependency>

    <groupId>org.apache.hadoop</groupId>

    <artifactId>hadoop-client</artifactId>

    <version>${hadoop.version}</version>

  </dependency>

  <dependency>

    <groupId>org.apache.hadoop</groupId>

    <artifactId>hadoop-aws</artifactId>

    <version>${hadoop.version}</version>

  </dependency>

</dependencies>


</configuration>



hadoop-env.sh

#

# Licensed to the Apache Software Foundation (ASF) under one

# omore contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

#a with the License.  You may obtain a copy of the License at

#

#     http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.


# Set Hadoop-specific environment variables here.


##

## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.

## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,

## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE

## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh.

##

## Precedence rules:

##

## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults

##

## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults

##


# Many of the options here are built from the perspective that users

# may want to provide OVERWRITING values on the command line.

# For example:

#

#  JAVA_HOME=/usr/java/testing hdfs dfs -ls

#

# Therefore, the vast majority (BUT NOT ALL!) of these defaults

# are configured for substitution and not append.  If append

# is preferable, modify this file accordingly.


###

# Generic settings for HADOOP

###


# Technically, the only required environment variable is JAVA_HOME.

# All others are optional.  However, the defaults are probably not

# preferred.  Many sites configure these options outside of Hadoop,

# such as in /etc/profile.d


# The java implementation to use. By default, this environment

# variable is REQUIRED on ALL platforms except OS X!

export HADOOP_HOME=~/hadoop-3.3.0

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

export
EXTRA_PATH=/home/hdoop/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar:/home/hdoop/nutch:/home/hdoop/nutch/lib/commons-jexl3-3.1.jar:/home/hdoop/nutch/build/plugins:/home/hdoop/nutch/build/lib/*

export PATH=$JAVA_HOME/bin:$EXTRA_PATH:$PATH


# Location of Hadoop.  By default, Hadoop will attempt to determine

# this location based upon its execution path.

# export HADOOP_HOME=


# Location of Hadoop's configuration information.  i.e., where this

# file is living. If this is not defined, Hadoop will attempt to

# locate it based upon its execution path.

#

# NOTE: It is recommend that this variable not be set here but in

# /etc/profile.d or equivalent.  Some options (such as

# --config) may react strangely otherwise.

#

# export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop


# The maximum amount of heap to use (Java -Xmx).  If no unit

# is provided, it will be converted to MB.  Daemons will

# prefer any Xmx setting in their respective _OPT variable.

# There is no default; the JVM will autoscale based upon machine

# memory size.

# export HADOOP_HEAPSIZE_MAX=


# The minimum amount of heap to use (Java -Xms).  If no unit

# is provided, it will be converted to MB.  Daemons will

# prefer any Xms setting in their respective _OPT variable.

# There is no default; the JVM will autoscale based upon machine

# memory size.

# export HADOOP_HEAPSIZE_MIN=


# Enable extra debugging of Hadoop's JAAS binding, used to set up

# Kerberos security.

# export HADOOP_JAAS_DEBUG=true


# Extra Java runtime options for all Hadoop commands. We don't support

# IPv6 yet/still, so by default the preference is set to IPv4.

# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"

# For Kerberos debugging, an extended option set logs more information

# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true
-Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"


# Some parts of the shell code may do special things dependent upon

# the operating system.  We have to set this here. See the next

# section as to why....

export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}


# Extra Java runtime options for some Hadoop commands

# and clients (i.e., hdfs dfs -blah).  These get appended to HADOOP_OPTS for

# such commands.  In most cases, # this should be left empty and

# let users supply it on the command line.

# export HADOOP_CLIENT_OPTS=""


#

# A note about classpaths.

#

# By default, Apache Hadoop overrides Java's CLASSPATH

# environment variable.  It is configured such

# that it starts out blank with new entries added after passing

# a series of checks (file/dir exists, not already listed aka

# de-deduplication).  During de-deduplication, wildcards and/or

# directories are *NOT* expanded to keep it simple. Therefore,

# if the computed classpath has two specific mentions of

# awesome-methods-1.0.jar, only the first one added will be seen.

# If two directories are in the classpath that both contain

# awesome-methods-1.0.jar, then Java will pick up both versions.


# An additional, custom CLASSPATH. Site-wide configs should be

# handled via the shellprofile functionality, utilizing the

# hadoop_add_classpath function for greater control and much

# harder for apps/end-users to accidentally override.

# Similarly, end users should utilize ${HOME}/.hadooprc .

# This variable should ideally only be used as a short-cut,

# interactive way for temporary additions on the command line.

export HADOOP_CLASSPATH=$EXTRA_PATH:$JAVA_HOME/bin:$HADOOP_CLASSPATH


# Should HADOOP_CLASSPATH be first in the official CLASSPATH?

# export HADOOP_USER_CLASSPATH_FIRST="yes"


# If HADOOP_USE_CLIENT_CLASSLOADER is set, the classpath along

# with the main jar are handled by a separate isolated

# client classloader when 'hadoop jar', 'yarn jar', or 'mapred job'

# is utilized. If it is set, HADOOP_CLASSPATH and

# HADOOP_USER_CLASSPATH_FIRST are ignored.

# export HADOOP_USE_CLIENT_CLASSLOADER=true


# HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES overrides the default definition
of

# system classes for the client classloader when
HADOOP_USE_CLIENT_CLASSLOADER

# is enabled. Names ending in '.' (period) are treated as package names, and

# names starting with a '-' are treated as negative matches. For example,

# export
HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES="-org.apache.hadoop.UserClass,java.,javax.,org.apache.hadoop."


# Enable optional, bundled Hadoop features

# This is a comma delimited list.  It may NOT be overridden via .hadooprc

# Entries may be added/removed as needed.

# export
HADOOP_OPTIONAL_TOOLS="hadoop-aliyun,hadoop-openstack,hadoop-azure,hadoop-azure-datalake,hadoop-aws,hadoop-kafka"


###

# Options for remote shell connectivity

###


# There are some optional components of hadoop that allow for

# command and control of remote hosts.  For example,

# start-dfs.sh will attempt to bring up all NNs, DNS, etc.


# Options to pass to SSH when one of the "log into a host and

# start/stop daemons" scripts is executed

# export HADOOP_SSH_OPTS="-o BatchMode=yes -o StrictHostKeyChecking=no -o
ConnectTimeout=10s"


# The built-in ssh handler will limit itself to 10 simultaneous connections.

# For pdsh users, this sets the fanout size ( -f )

# Change this to increase/decrease as necessary.

# export HADOOP_SSH_PARALLEL=10


# Filename which contains all of the hosts for any remote execution

# helper scripts # such as workers.sh, start-dfs.sh, etc.

# export HADOOP_WORKERS="${HADOOP_CONF_DIR}/workers"


###

# Options for all daemons

###

#


#

# Many options may also be specified as Java properties.  It is

# very common, and in many cases, desirable, to hard-set these

# in daemon _OPTS variables.  Where applicable, the appropriate

# Java property is also identified.  Note that many are re-used

# or set differently in certain contexts (e.g., secure vs

# non-secure)

#


# Where (primarily) daemon log files are stored.

# ${HADOOP_HOME}/logs by default.

# Java property: hadoop.log.dir

# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs


# A string representing this instance of hadoop. $USER by default.

# This is used in writing log and pid files, so keep that in mind!

# Java property: hadoop.id.str

# export HADOOP_IDENT_STRING=$USER


# How many seconds to pause after stopping a daemon

# export HADOOP_STOP_TIMEOUT=5


# Where pid files are stored.  /tmp by default.

# export HADOOP_PID_DIR=/tmp


# Default log4j setting for interactive commands

# Java property: hadoop.root.logger

# export HADOOP_ROOT_LOGGER=INFO,console


# Default log4j setting for daemons spawned explicitly by

# --daemon option of hadoop, hdfs, mapred and yarn command.

# Java property: hadoop.root.logger

# export HADOOP_DAEMON_ROOT_LOGGER=INFO,RFA


# Default log level and output location for security-related messages.

# You will almost certainly want to change this on a per-daemon basis via

# the Java property (i.e., -Dhadoop.security.logger=foo). (Note that the

# defaults for the NN and 2NN override this by default.)

# Java property: hadoop.security.logger

# export HADOOP_SECURITY_LOGGER=INFO,NullAppender


# Default process priority level

# Note that sub-processes will also run at this level!

# export HADOOP_NICENESS=0


# Default name for the service level authorization file

# Java property: hadoop.policy.file

# export HADOOP_POLICYFILE="hadoop-policy.xml"


#

# NOTE: this is not used by default!  <-----

# You can define variables right here and then re-use them later on.

# For example, it is common to use the same garbage collection settings

# for all the daemons.  So one could define:

#

# export HADOOP_GC_SETTINGS="-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps"

#

# .. and then use it as per the b option under the namenode.


###

# Secure/privileged execution

###


#

# Out of the box, Hadoop uses jsvc from Apache Commons to launch daemons

# on privileged ports.  This functionality can be replaced by providing

# custom functions.  See hadoop-functions.sh for more information.

#


# The jsvc implementation to use. Jsvc is required to run secure datanodes

# that bind to privileged ports to provide authentication of data transfer

# protocol.  Jsvc is not required if SASL is configured for authentication
of

# data transfer protocol using non-privileged ports.

# export JSVC_HOME=/usr/bin


#

# This directory contains pids for secure and privileged processes.

#export HADOOP_SECURE_PID_DIR=${HADOOP_PID_DIR}


#

# This directory contains the logs for secure and privileged processes.

# Java property: hadoop.log.dir

# export HADOOP_SECURE_LOG=${HADOOP_LOG_DIR}


#

# When running a secure daemon, the default value of HADOOP_IDENT_STRING

# ends up being a bit bogus.  Therefore, by default, the code will

# replace HADOOP_IDENT_STRING with HADOOP_xx_SECURE_USER.  If one wants

# to keep HADOOP_IDENT_STRING untouched, then uncomment this line.

# export HADOOP_SECURE_IDENT_PRESERVE="true"


###

# NameNode specific parameters

###


# Default log level and output location for file system related change

# messages. For non-namenode daemons, the Java property must be set in

# the appropriate _OPTS if one wants something other than INFO,NullAppender

# Java property: hdfs.audit.logger

# export HDFS_AUDIT_LOGGER=INFO,NullAppender


# Specify the JVM options to be used when starting the NameNode.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# a) Set JMX options

# export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=1026"

#

# b) Set garbage collection logs

# export HDFS_NAMENODE_OPTS="${HADOOP_GC_SETTINGS}
-Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"

#

# c) ... or set them directly

# export HDFS_NAMENODE_OPTS="-verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"


# this is the default:

# export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"


###

# SecondaryNameNode specific parameters

###

# Specify the JVM options to be used when starting the SecondaryNameNode.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# This is the default:

# export HDFS_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"


###

# DataNode specific parameters

###

# Specify the JVM options to be used when starting the DataNode.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# This is the default:

# export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS"


# On secure datanodes, user to run the datanode as after dropping
privileges.

# This **MUST** be uncommented to enable secure HDFS if using privileged
ports

# to provide authentication of data transfer protocol.  This **MUST NOT** be

# defined if SASL is configured for authentication of data transfer protocol

# using non-privileged ports.

# This will replace the hadoop.id.str Java property in secure mode.

# export HDFS_DATANODE_SECURE_USER=hdfs


# Supplemental options for secure datanodes

# By default, Hadoop uses jsvc which needs to know to launch a

# server jvm.

# export HDFS_DATANODE_SECURE_EXTRA_OPTS="-jvm server"


###

# NFS3 Gateway specific parameters

###

# Specify the JVM options to be used when starting the NFS3 Gateway.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_NFS3_OPTS=""


# Specify the JVM options to be used when starting the Hadoop portmapper.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_PORTMAP_OPTS="-Xmx512m"


# Supplemental options for priviliged gateways

# By default, Hadoop uses jsvc which needs to know to launch a

# server jvm.

# export HDFS_NFS3_SECURE_EXTRA_OPTS="-jvm server"


# On privileged gateways, user to run the gateway as after dropping
privileges

# This will replace the hadoop.id.str Java property in secure mode.

# export HDFS_NFS3_SECURE_USER=nfsserver


###

# ZKFailoverController specific parameters

###

# Specify the JVM options to be used when starting the ZKFailoverController.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_ZKFC_OPTS=""


###

# QuorumJournalNode specific parameters

###

# Specify the JVM options to be used when starting the QuorumJournalNode.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_JOURNALNODE_OPTS=""


###

# HDFS Balancer specific parameters

###

# Specify the JVM options to be used when starting the HDFS Balancer.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_BALANCER_OPTS=""


###

# HDFS Mover specific parameters

###

# Specify the JVM options to be used when starting the HDFS Mover.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_MOVER_OPTS=""


###

# Router-based HDFS Federation specific parameters

# Specify the JVM options to be used when starting the RBF Routers.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_DFSROUTER_OPTS=""


###

# HDFS StorageContainerManager specific parameters

###

# Specify the JVM options to be used when starting the HDFS Storage
Container Manager.

# These options will be appended to the options specified as HADOOP_OPTS

# and therefore may override any similar flags set in HADOOP_OPTS

#

# export HDFS_STORAGECONTAINERMANAGER_OPTS=""


###

# Advanced Users Only!

###


#

# When building Hadoop, one can add the class paths to the commands

# via this special env var:

# export HADOOP_ENABLE_BUILD_PATHS="true"


#

# To prevent accidents, shell commands be (superficially) locked

# to only allow certain users to execute certain subcommands.

# It uses the format of (command)_(subcommand)_USER.

#

# For example, to limit who can execute the namenode command,

# export HDFS_NAMENODE_USER=hdfs

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64


# Enable s3

export HADOOP_OPTIONAL_TOOLS="hadoop-aws"

echo "Ensure AWS Credentials are added to hadoop-env.sh and core-site.xml,
by running add-aws-keys.sh"

export AWS_ACCESS_KEY_ID=KEY_PLACEHOLDER

export AWS_SECRET_ACCESS_KEY=SECRET_PLACEHOLDER



hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--

  Licensed under the Apache License, Version 2.0 (the "License");

  you may not use this file except in compliance with the License.

  You may obtain a copy of the License at


    http://www.apache.org/licenses/LICENSE-2.0


  Unless required by applicable law or agreed to in writing, software

  distributed under the License is distributed on an "AS IS" BASIS,

  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  See the License for the specific language governing permissions and

  limitations under the License. See accompanying LICENSE file.

-->


<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

  <name>dfs.data.dir</name>

  <value>/home/hdoop/dfsdata/namenode</value>

</property>

<property>

  <name>dfs.data.dir</name>

  <value>/home/hdoop/dfsdata/datanode</value>

</property>

<property>

  <name>dfs.replication</name>

  <value>1</value>

</property>

</configuration>



mapred-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--

  Licensed under the Apache License, Version 2.0 (the "License");

  you may not use this file except in compliance with the License.

  You may obtain a copy of the License at


    http://www.apache.org/licenses/LICENSE-2.0


  Unless required by applicable law or agreed to in writing, software

  distributed under the License is distributed on an "AS IS" BASIS,

  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  See the License for the specific language governing permissions and

  limitations under the License. See accompanying LICENSE file.

-->


<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

  <name>mapreduce.framework.name</name>

  <value>yarn</value>

</property>

<property>

            <name>yarn.app.mapreduce.am.env</name>

    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

    </property>

    <property>

            <name>mapreduce.map.env</name>

    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

    </property>

    <property>

            <name>mapreduce.reduce.env</name>

    <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

    </property>

    <!--

    <property>

    <name>mapreduce.application.classpath</name>


  <value>/home/hdoop/hadoop-3.3.0/etc/hadoop:/home/hdoop/hadoop-3.3.0/share/hadoop/common/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/common/*:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs/*:/home/hdoop/hadoop-3.3.0/share/hadoop/mapreduce/*:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn/*:/home/hdoop/hadoop-3.3.0/bin:/home/hdoop/hadoop-3.3.0/sbin</value>

    </property>

 -->

    <property>

    <name>mapreduce.application.classpath</name>


  <value>home/hdoop/hadoop-3.3.0/etc/hadoop:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/common/*:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/*:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/*:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/*:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/accessors-smart-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/animal-sniffer-annotations-1.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/asm-5.0.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/audience-annotations-0.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/avro-1.7.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/checker-qual-2.5.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-beanutils-1.9.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-cli-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-codec-1.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-collections-3.2.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-compress-1.18.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-configuration2-2.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-io-2.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-lang3-3.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-logging-1.1.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-math3-3.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-net-3.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-text-1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/curator-client-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/curator-framework-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/curator-recipes-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/dnsjava-2.1.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/error_prone_annotations-2.2.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/failureaccess-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/gson-2.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/guava-27.0-jre.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/hadoop-annotations-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/hadoop-auth-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/httpclient-4.5.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/httpcore-4.4.10.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/j2objc-annotations-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-annotations-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-core-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-databind-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-xc-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/javax.servlet-api-3.1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jaxb-api-2.2.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jcip-annotations-1.0-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-core-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-json-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-server-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-servlet-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jettison-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-http-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-io-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-security-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-server-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-servlet-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-util-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-webapp-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-xml-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsch-0.1.54.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/json-smart-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsp-api-2.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsr305-3.0.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsr311-api-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jul-to-slf4j-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-admin-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-client-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-common-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-core-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-crypto-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-identity-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-server-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-simplekdc-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-config-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-pkix-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/log4j-1.2.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/metrics-core-3.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/netty-3.10.5.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/nimbus-jose-jwt-4.41.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/paranamer-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/protobuf-java-2.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/re2j-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/slf4j-api-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/snappy-java-1.0.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/stax2-api-3.1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/token-provider-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/woodstox-core-5.0.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/zookeeper-3.4.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-common-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-kms-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-nfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/common/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/common/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/accessors-smart-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/animal-sniffer-annotations-1.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/asm-5.0.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/audience-annotations-0.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/avro-1.7.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/checker-qual-2.5.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-beanutils-1.9.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-cli-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-codec-1.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-collections-3.2.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-compress-1.18.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-configuration2-2.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-io-2.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-lang3-3.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-math3-3.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-net-3.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-text-1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-client-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-framework-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-recipes-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/dnsjava-2.1.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/error_prone_annotations-2.2.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/failureaccess-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/gson-2.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/guava-27.0-jre.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/hadoop-annotations-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/hadoop-auth-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/htrace-core4-4.1.0-incubating.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/httpclient-4.5.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/httpcore-4.4.10.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/j2objc-annotations-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-annotations-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-core-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-databind-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-jaxrs-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-xc-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/javax.servlet-api-3.1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jaxb-api-2.2.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jaxb-impl-2.2.3-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jcip-annotations-1.0-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-core-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-json-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-server-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-servlet-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jettison-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-http-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-io-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-security-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-server-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-servlet-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-util-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-util-ajax-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-webapp-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-xml-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsch-0.1.54.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/json-simple-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/json-smart-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsr311-api-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-admin-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-client-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-common-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-core-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-crypto-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-identity-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-server-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-simplekdc-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-asn1-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-config-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-pkix-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-xdr-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/log4j-1.2.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/netty-3.10.5.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/netty-all-4.0.52.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/nimbus-jose-jwt-4.41.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/okhttp-2.7.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/okio-1.6.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/paranamer-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/re2j-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/snappy-java-1.0.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/stax2-api-3.1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/token-provider-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/woodstox-core-5.0.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/zookeeper-3.4.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-client-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-httpfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-native-client-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-native-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-nfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/junit-4.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-app-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-hs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-nativetask-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-uploader-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib-examples:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/HikariCP-java7-2.4.12.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/aopalliance-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/bcpkix-jdk15on-1.60.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/bcprov-jdk15on-1.60.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/ehcache-3.3.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/fst-2.50.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/guice-4.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/guice-servlet-4.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-jaxrs-base-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-jaxrs-json-provider-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-module-jaxb-annotations-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/java-util-1.9.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/javax.inject-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jersey-client-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jersey-guice-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/json-io-2.5.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/metrics-core-3.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/objenesis-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/snakeyaml-1.16.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/swagger-annotations-1.5.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-api-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-registry-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-nodemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-resourcemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-router-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-tests-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-web-proxy-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-services-api-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-services-core-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-submarine-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/test:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/timelineservice:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/yarn-service-examples:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-3.3.0-sources.jar:/home/hdoop/hadoop-3.3.0//hadoop-tools/hadoop-aws/target/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/hadoop-tools/hadoop-aws/target/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/hadoop-tools/hadoop-aws/target/lib/aws-java-sdk-bundle-1.11.563.jar:/home/hdoop/nutch/build/lib/commons-jexl3-3.1.jar:/home/hdoop/nutch/lib/*</value>

</property>


 <property>

        <name>yarn.app.mapreduce.am.resource.mb</name>

        <value>512</value>

</property>


<property>

        <name>mapreduce.map.memory.mb</name>

        <value>256</value>

</property>


<property>

        <name>mapreduce.reduce.memory.mb</name>

        <value>256</value>

</property>


<!--from NutchHadoop Tutorial -->

<property>

  <name>mapred.system.dir</name>

  <value>/home/hdoop/dfsdata/mapreduce/system</value>

</property>


<property>

  <name>mapred.local.dir</name>

  <value>/home/hdoop/dfsdata/mapreduce/local</value>

</property>


</configuration>



workers

hadoop02N <http://hadoop02.o7.castle.fm/>ame

hadoop01N <http://hadoop01.o7.castle.fm/>ame



yarn-site.xml

<?xml version="1.0"?>

<!--

  Licensed under the Apache License, Version 2.0 (the "License");

  you may not use this file except in compliance with the License.

  You may obtain a copy of the License at


    http://www.apache.org/licenses/LICENSE-2.0


  Unless required by applicable law or agreed to in writing, software

  distributed under the License is distributed on an "AS IS" BASIS,

  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  See the License for the specific language governing permissions and

  limitations under the License. See accompanying LICENSE file.

-->

<configuration>

<property>

  <name>yarn.nodemanager.aux-services</name>

  <value>mapreduce_shuffle</value>

</property>

<property>

  <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

  <value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<property>

  <name>yarn.resourcemanager.hostname</name>

  <value>IP_PLACEHOLDER</value>

</property>

<property>

  <name>yarn.acl.enable</name>

  <value>0</value>

</property>

<property>

  <name>yarn.nodemanager.env-whitelist</name>

  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>


<!--<property>

 <name>yarn.application.classpath</name>

 <value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$USS_HOME/*,$USS_CONF</value>

</property>-->


<property>

        <name>yarn.nodemanager.resource.memory-mb</name>

        <value>1536</value>

</property>


<property>

        <name>yarn.scheduler.maximum-allocation-mb</name>

        <value>1536</value>

</property>


<property>

        <name>yarn.scheduler.minimum-allocation-mb</name>

        <value>128</value>

</property>


<property>

        <name>yarn.nodemanager.vmem-check-enabled</name>

        <value>false</value>

</property>

<property>

        <name>yarn.nodemanager.local-dirs</name>

        <value>${hadoop.tmp.dir}/nm-local-dir</value>

</property>


</configuration>


On Thu, Jun 17, 2021 at 11:55 AM Clark Benham <cl...@thehive.ai> wrote:

>
> Hi Sebastian,
>
> NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built
> hadoop.
> There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or
> Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to
> ${fs.defaultFS}, so s3a://temp-crawler in our case.
> The plugin loader doesn't appear to be able to read from s3 in nutch-1.18
> with hadoop-3.2.1[1].
>
> Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a
> nutch-1.19 I built:
> I can run a mapreduce job on S3; and a Nutch job on hdfs, but running
> nutch on S3 still gives "URLNormalizer not found" with the plugin dir on
> the local filesystem or on S3a.
>
> How would you recommend I go about getting the plugin loader to read from
> other file systems?
>
> [1]  I still get 'x point org.apache.nutch.net.URLNormalizer not found'
> (same stack trace as previous email) with
> `<name>plugin.folders</name>
> <value>s3a://temp-crawler/user/hdoop/nutch-plugins</value>`
> set in my nutch-site.xml while `hadoop fs -ls
> s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there.
>
>
> For posterity:
> I got hadoop-3.3.0 working with a S3 backend by:
>
> cd ~/hadoop-3.3.0
>
> cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib
>
> cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar
> ./share/hadoop/common/lib
> to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not
> found" despite the class existing in
> ~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar  checking it's
> on the classpath with `hadoop classpath | tr ":" "\n"  | grep
> share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to
> hadoop-env.sh.
> see
> https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f
>
> On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel
> <wa...@googlemail.com.invalid> wrote:
>
>>  > The local file system? Or hdfs:// or even s3:// resp. s3a://?
>>
>> Also important: the value of "mapreduce.job.dir" - it's usually
>> on hdfs:// and I'm not sure whether the plugin loader is able to
>> read from other filesystems. At least, I haven't tried.
>>
>>
>> On 6/15/21 10:53 AM, Sebastian Nagel wrote:
>> > Hi Clark,
>> >
>> > sorry, I should read your mail until the end - you mentioned that
>> > you downgraded Nutch to run with JDK 8.
>> >
>> > Could you share to which filesystem does NUTCH_HOME point?
>> > The local file system? Or hdfs:// or even s3:// resp. s3a://?
>> >
>> > Best,
>> > Sebastian
>> >
>> >
>> > On 6/15/21 10:24 AM, Clark Benham wrote:
>> >> Hi,
>> >>
>> >>
>> >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
>> >> backend/filesystem; however I get an error ‘URLNormalizer class not
>> found’.
>> >> I have edited nutch-site.xml so this plugin should be included:
>> >>
>> >> <property>
>> >>
>> >>    <name>plugin.includes</name>
>> >>
>> >>
>> >>
>> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints</value>
>>
>> >>
>> >>
>> >>
>> >> </property>
>> >>
>> >>   and then built on both nodes (I only have 2 machines).  I’ve
>> successfully
>> >> run Nutch locally and in distributed mode using HDFS, and I’ve run a
>> >> mapreduce job with S3 as hadoop’s file system.
>> >>
>> >>
>> >> I thought it was possible nutch is not reading nutch-site.xml because I
>> >> resolve an error by setting the config through the cli, despite this
>> >> duplicating nutch-site.xml.
>> >>
>> >> The command:
>> >>
>> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
>> >> org.apache.nutch.fetcher.Fetcher
>> >> crawl/crawldb crawl/segments`
>> >>
>> >> throws
>> >>
>> >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in '
>> >> http.agent.name' property`
>> >>
>> >> while if I pass a value in for http.agent.name with
>> >> `-Dhttp.agent.name=myScrapper`,
>> >> (making the command `hadoop jar
>> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
>> >> org.apache.nutch.fetcher.Fetcher
>> >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an
>> error
>> >> about there being no input path, which makes sense as I haven’t been
>> able
>> >> to generate any segments.
>> >>
>> >>
>> >>   However this method of setting nutch config’s doesn’t work for
>> injecting
>> >> URLs; eg:
>> >>
>> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
>> >> org.apache.nutch.crawl.Injector
>> >> -Dplugin.includes=".*" crawl/crawldb urls`
>> >>
>> >> fails with the same “URLNormalizer” not found.
>> >>
>> >>
>> >> I tried copying the plugin dir to S3 and setting
>> >> <name>plugin.folders</name> to be a path on S3 without success. (I
>> expect
>> >> the plugin to be bundled with the .job so this step should be
>> unnecessary)
>> >>
>> >>
>> >> The full stack trace for `hadoop jar
>> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
>> >> org.apache.nutch.crawl.Injector
>> >> crawl/crawldb urls`:
>> >>
>> >> SLF4J: Class path contains multiple SLF4J bindings.
>> >>
>> >> SLF4J: Found binding in
>> >>
>> [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> >>
>> >> SLF4J: Found binding in
>> >>
>> [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> >>
>> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> >> explanation.
>> >>
>> >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> >>
>> >> #Took out multiply Info messages
>> >>
>> >> 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
>> >> attempt_1623740678244_0001_m_000001_0, Status : FAILED
>> >>
>> >> Error: java.lang.RuntimeException: x point
>> >> org.apache.nutch.net.URLNormalizer not found.
>> >>
>> >> at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:145)
>> >>
>> >> at
>> org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)
>> >>
>> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>> >>
>> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>> >>
>> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>> >>
>> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>> >>
>> >> at java.security.AccessController.doPrivileged(Native Method)
>> >>
>> >> at javax.security.auth.Subject.doAs(Subject.java:422)
>> >>
>> >> at
>> >>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>> >>
>> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
>> >>
>> >>
>> >> #This error repeats 6 times total, 3 times for each node
>> >>
>> >>
>> >> 2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%
>> >>
>> >> 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
>> >> failed with state FAILED due to: Task failed
>> >> task_1623740678244_0001_m_000001
>> >>
>> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
>> >> killedReduces: 0
>> >>
>> >>
>> >> 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14
>> >>
>> >> Job Counters
>> >>
>> >> Failed map tasks=7
>> >>
>> >> Killed map tasks=1
>> >>
>> >> Killed reduce tasks=1
>> >>
>> >> Launched map tasks=8
>> >>
>> >> Other local map tasks=6
>> >>
>> >> Rack-local map tasks=2
>> >>
>> >> Total time spent by all maps in occupied slots (ms)=63196
>> >>
>> >> Total time spent by all reduces in occupied slots (ms)=0
>> >>
>> >> Total time spent by all map tasks (ms)=31598
>> >>
>> >> Total vcore-milliseconds taken by all map tasks=31598
>> >>
>> >> Total megabyte-milliseconds taken by all map tasks=8089088
>> >>
>> >> Map-Reduce Framework
>> >>
>> >> CPU time spent (ms)=0
>> >>
>> >> Physical memory (bytes) snapshot=0
>> >>
>> >> Virtual memory (bytes) snapshot=0
>> >>
>> >> 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not
>> succeed,
>> >> job status: FAILED, reason: Task failed
>> task_1623740678244_0001_m_000001
>> >>
>> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
>> >> killedReduces: 0
>> >>
>> >>
>> >> 2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
>> >> java.lang.RuntimeException: Injector job did not succeed, job status:
>> >> FAILED, reason: Task failed task_1623740678244_0001_m_000001
>> >>
>> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
>> >> killedReduces: 0
>> >>
>> >>
>> >> at org.apache.nutch.crawl.Injector.inject(Injector.java:444)
>> >>
>> >> at org.apache.nutch.crawl.Injector.run(Injector.java:571)
>> >>
>> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>> >>
>> >> at org.apache.nutch.crawl.Injector.main(Injector.java:535)
>> >>
>> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> >>
>> >> at
>> >>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> >>
>> >> at
>> >>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >>
>> >> at java.lang.reflect.Method.invoke(Method.java:498)
>> >>
>> >> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
>> >>
>> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
>> >>
>> >>
>> >>
>> >>
>> >> P.S.
>> >>
>> >> I am using a downloaded hadoop-3.2.1; and the only odd thing about my
>> nutch
>> >> build is that I had to replace all instances of “javac.verion” with
>> >> “ant.java.version”; as the javac version was 11 to java’s 1.8 giving
>> the
>> >> error ‘javac invalid target release: 11’:
>> >>
>> >> grep -rl "javac.version" . --include "*.xml" | xargs sed -i
>> >> s^javac.version^ant.java.version^g
>> >>
>> >> grep -rl “ant.ant” . --include "*.xml"| xargs sed -i s^ant.ant.^ant.^g
>> >>
>> >
>>
>>

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Posted by Lewis John McGibbney <le...@apache.org>.
Hi Clark,
This is a lot of information... thank you for compiling it all.
Ideally the version of Hadoop being used with Nutch should ALWAYS match the hadoop binaries referenced in https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run into the classpath issues.
I would like to encourage you to create a wiki page so we can document this in a user firnedly way... would you be open to that?
You can create an account at https://cwiki.apache.org/confluence/display/NUTCH/Home
Thanks for your consideration.
lewismc

On 2021/07/14 18:27:23, Clark Benham <cl...@thehive.ai> wrote: 
> Hi All,
> 
> Sebastian Helped fix my issue: using S3 as a backend I was able to get
> nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
> oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
> hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
> `hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
> jars from the hadoop download.
> Also, in the main nutch branch (
> https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
> has dependencies on hadoop-3.1.3; eg.
> <!-- Hadoop Dependencies -->
> <dependency org="org.apache.hadoop" name="hadoop-common" rev="3.1.3"
> conf="*->default">
> <exclude org="hsqldb" name="hsqldb" />
> <exclude org="net.sf.kosmosfs" name="kfs" />z
> <exclude org="net.java.dev.jets3t" name="jets3t" />
> <exclude org="org.eclipse.jdt" name="core" />
> <exclude org="org.mortbay.jetty" name="jsp-*" />
> <exclude org="ant" name="ant" />
> </dependency>
> <dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="3.1.3"
> conf="*->default" />
> <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core"
> rev="3.1.3" conf="*->default" />
> <dependency org="org.apache.hadoop"
> name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" />
> <!-- End of Hadoop Dependencies -->
> 
> I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.
> 
> I didn't change "mapreduce.job.dir" because there's no namenode nor
> datanode processes running when using hadoop with S3, so the UI is blank.
> 
> Copied from Email with Sebastian:
> >  > The plugin loader doesn't appear to be able to read from s3 in
> nutch-1.18
> >  > with hadoop-3.2.1[1].
> 
> > I had a look into the plugin loader: it can only read from the local file
> system.
> > But that's ok because the Nutch job file is copied to the local machine
> > and unpacked. Here the paths how it looks like on one of the running
> Common Crawl
> > task nodes:
> 
> The configs for the working hadoop are as follows:
> 
> core-site.xml
> 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> 
>   <name>hadoop.tmp.dir</name>
> 
>   <value>/home/hdoop/tmpdata</value>
> 
> </property>
> 
> <property>
> 
>   <name>fs.defaultFS</name>
> 
>   <value>s3a://my-bucket</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>fs.s3a.access.key</name>
> 
>         <value>KEY_PLACEHOLDER</value>
> 
>   <description>AWS access key ID.
> 
>    Omit for IAM role-based or provider-based authentication.</description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>fs.s3a.secret.key</name>
> 
>   <value>SECRET_PLACEHOLDER</value>
> 
>   <description>AWS secret key.
> 
>    Omit for IAM role-based or provider-based authentication.</description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>fs.s3a.aws.credentials.provider</name>
> 
>   <description>
> 
>     Comma-separated class names of credential provider classes which
> implement
> 
>     com.amazonaws.auth.AWSCredentialsProvider.
> 
> 
>     These are loaded and queried in sequence for a valid set of credentials.
> 
>     Each listed class must implement one of the following means of
> 
>     construction, which are attempted in order:
> 
>     1. a public constructor accepting java.net.URI and
> 
>         org.apache.hadoop.conf.Configuration,
> 
>     2. a public static method named getInstance that accepts no
> 
>        arguments and returns an instance of
> 
>        com.amazonaws.auth.AWSCredentialsProvider, or
> 
>     3. a public default constructor.
> 
> 
>     Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
> allows
> 
>     anonymous access to a publicly accessible S3 bucket without any
> credentials.
> 
>     Please note that allowing anonymous access to an S3 bucket compromises
> 
>     security and therefore is unsuitable for most use cases. It can be
> useful
> 
>     for accessing public data sets without requiring AWS credentials.
> 
> 
>     If unspecified, then the default list of credential provider classes,
> 
>     queried in sequence, is:
> 
>     1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
> 
>        Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
> 
>     2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
> 
>         configuration of AWS access key ID and secret access key in
> 
>         environment variables named AWS_ACCESS_KEY_ID and
> 
>         AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
> 
>     3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
> 
>         of instance profile credentials if running in an EC2 VM.
> 
>   </description>
> 
> </property>
> 
> 
> 
> <dependencies>
> 
>   <dependency>
> 
>     <groupId>org.apache.hadoop</groupId>
> 
>     <artifactId>hadoop-client</artifactId>
> 
>     <version>${hadoop.version}</version>
> 
>   </dependency>
> 
>   <dependency>
> 
>     <groupId>org.apache.hadoop</groupId>
> 
>     <artifactId>hadoop-aws</artifactId>
> 
>     <version>${hadoop.version}</version>
> 
>   </dependency>
> 
> </dependencies>
> 
> 
> </configuration>
> 
> 
> 
> hadoop-env.sh
> 
> #
> 
> # Licensed to the Apache Software Foundation (ASF) under one
> 
> # omore contributor license agreements.  See the NOTICE file
> 
> # distributed with this work for additional information
> 
> # regarding copyright ownership.  The ASF licenses this file
> 
> # to you under the Apache License, Version 2.0 (the
> 
> # "License"); you may not use this file except in compliance
> 
> #a with the License.  You may obtain a copy of the License at
> 
> #
> 
> #     http://www.apache.org/licenses/LICENSE-2.0
> 
> #
> 
> # Unless required by applicable law or agreed to in writing, software
> 
> # distributed under the License is distributed on an "AS IS" BASIS,
> 
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
> # See the License for the specific language governing permissions and
> 
> # limitations under the License.
> 
> 
> # Set Hadoop-specific environment variables here.
> 
> 
> ##
> 
> ## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.
> 
> ## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,
> 
> ## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE
> 
> ## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh.
> 
> ##
> 
> ## Precedence rules:
> 
> ##
> 
> ## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults
> 
> ##
> 
> ## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults
> 
> ##
> 
> 
> # Many of the options here are built from the perspective that users
> 
> # may want to provide OVERWRITING values on the command line.
> 
> # For example:
> 
> #
> 
> #  JAVA_HOME=/usr/java/testing hdfs dfs -ls
> 
> #
> 
> # Therefore, the vast majority (BUT NOT ALL!) of these defaults
> 
> # are configured for substitution and not append.  If append
> 
> # is preferable, modify this file accordingly.
> 
> 
> ###
> 
> # Generic settings for HADOOP
> 
> ###
> 
> 
> # Technically, the only required environment variable is JAVA_HOME.
> 
> # All others are optional.  However, the defaults are probably not
> 
> # preferred.  Many sites configure these options outside of Hadoop,
> 
> # such as in /etc/profile.d
> 
> 
> # The java implementation to use. By default, this environment
> 
> # variable is REQUIRED on ALL platforms except OS X!
> 
> export HADOOP_HOME=~/hadoop-3.3.0
> 
> export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
> 
> export
> EXTRA_PATH=/home/hdoop/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar:/home/hdoop/nutch:/home/hdoop/nutch/lib/commons-jexl3-3.1.jar:/home/hdoop/nutch/build/plugins:/home/hdoop/nutch/build/lib/*
> 
> export PATH=$JAVA_HOME/bin:$EXTRA_PATH:$PATH
> 
> 
> # Location of Hadoop.  By default, Hadoop will attempt to determine
> 
> # this location based upon its execution path.
> 
> # export HADOOP_HOME=
> 
> 
> # Location of Hadoop's configuration information.  i.e., where this
> 
> # file is living. If this is not defined, Hadoop will attempt to
> 
> # locate it based upon its execution path.
> 
> #
> 
> # NOTE: It is recommend that this variable not be set here but in
> 
> # /etc/profile.d or equivalent.  Some options (such as
> 
> # --config) may react strangely otherwise.
> 
> #
> 
> # export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
> 
> 
> # The maximum amount of heap to use (Java -Xmx).  If no unit
> 
> # is provided, it will be converted to MB.  Daemons will
> 
> # prefer any Xmx setting in their respective _OPT variable.
> 
> # There is no default; the JVM will autoscale based upon machine
> 
> # memory size.
> 
> # export HADOOP_HEAPSIZE_MAX=
> 
> 
> # The minimum amount of heap to use (Java -Xms).  If no unit
> 
> # is provided, it will be converted to MB.  Daemons will
> 
> # prefer any Xms setting in their respective _OPT variable.
> 
> # There is no default; the JVM will autoscale based upon machine
> 
> # memory size.
> 
> # export HADOOP_HEAPSIZE_MIN=
> 
> 
> # Enable extra debugging of Hadoop's JAAS binding, used to set up
> 
> # Kerberos security.
> 
> # export HADOOP_JAAS_DEBUG=true
> 
> 
> # Extra Java runtime options for all Hadoop commands. We don't support
> 
> # IPv6 yet/still, so by default the preference is set to IPv4.
> 
> # export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
> 
> # For Kerberos debugging, an extended option set logs more information
> 
> # export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true
> -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"
> 
> 
> # Some parts of the shell code may do special things dependent upon
> 
> # the operating system.  We have to set this here. See the next
> 
> # section as to why....
> 
> export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}
> 
> 
> # Extra Java runtime options for some Hadoop commands
> 
> # and clients (i.e., hdfs dfs -blah).  These get appended to HADOOP_OPTS for
> 
> # such commands.  In most cases, # this should be left empty and
> 
> # let users supply it on the command line.
> 
> # export HADOOP_CLIENT_OPTS=""
> 
> 
> #
> 
> # A note about classpaths.
> 
> #
> 
> # By default, Apache Hadoop overrides Java's CLASSPATH
> 
> # environment variable.  It is configured such
> 
> # that it starts out blank with new entries added after passing
> 
> # a series of checks (file/dir exists, not already listed aka
> 
> # de-deduplication).  During de-deduplication, wildcards and/or
> 
> # directories are *NOT* expanded to keep it simple. Therefore,
> 
> # if the computed classpath has two specific mentions of
> 
> # awesome-methods-1.0.jar, only the first one added will be seen.
> 
> # If two directories are in the classpath that both contain
> 
> # awesome-methods-1.0.jar, then Java will pick up both versions.
> 
> 
> # An additional, custom CLASSPATH. Site-wide configs should be
> 
> # handled via the shellprofile functionality, utilizing the
> 
> # hadoop_add_classpath function for greater control and much
> 
> # harder for apps/end-users to accidentally override.
> 
> # Similarly, end users should utilize ${HOME}/.hadooprc .
> 
> # This variable should ideally only be used as a short-cut,
> 
> # interactive way for temporary additions on the command line.
> 
> export HADOOP_CLASSPATH=$EXTRA_PATH:$JAVA_HOME/bin:$HADOOP_CLASSPATH
> 
> 
> # Should HADOOP_CLASSPATH be first in the official CLASSPATH?
> 
> # export HADOOP_USER_CLASSPATH_FIRST="yes"
> 
> 
> # If HADOOP_USE_CLIENT_CLASSLOADER is set, the classpath along
> 
> # with the main jar are handled by a separate isolated
> 
> # client classloader when 'hadoop jar', 'yarn jar', or 'mapred job'
> 
> # is utilized. If it is set, HADOOP_CLASSPATH and
> 
> # HADOOP_USER_CLASSPATH_FIRST are ignored.
> 
> # export HADOOP_USE_CLIENT_CLASSLOADER=true
> 
> 
> # HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES overrides the default definition
> of
> 
> # system classes for the client classloader when
> HADOOP_USE_CLIENT_CLASSLOADER
> 
> # is enabled. Names ending in '.' (period) are treated as package names, and
> 
> # names starting with a '-' are treated as negative matches. For example,
> 
> # export
> HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES="-org.apache.hadoop.UserClass,java.,javax.,org.apache.hadoop."
> 
> 
> # Enable optional, bundled Hadoop features
> 
> # This is a comma delimited list.  It may NOT be overridden via .hadooprc
> 
> # Entries may be added/removed as needed.
> 
> # export
> HADOOP_OPTIONAL_TOOLS="hadoop-aliyun,hadoop-openstack,hadoop-azure,hadoop-azure-datalake,hadoop-aws,hadoop-kafka"
> 
> 
> ###
> 
> # Options for remote shell connectivity
> 
> ###
> 
> 
> # There are some optional components of hadoop that allow for
> 
> # command and control of remote hosts.  For example,
> 
> # start-dfs.sh will attempt to bring up all NNs, DNS, etc.
> 
> 
> # Options to pass to SSH when one of the "log into a host and
> 
> # start/stop daemons" scripts is executed
> 
> # export HADOOP_SSH_OPTS="-o BatchMode=yes -o StrictHostKeyChecking=no -o
> ConnectTimeout=10s"
> 
> 
> # The built-in ssh handler will limit itself to 10 simultaneous connections.
> 
> # For pdsh users, this sets the fanout size ( -f )
> 
> # Change this to increase/decrease as necessary.
> 
> # export HADOOP_SSH_PARALLEL=10
> 
> 
> # Filename which contains all of the hosts for any remote execution
> 
> # helper scripts # such as workers.sh, start-dfs.sh, etc.
> 
> # export HADOOP_WORKERS="${HADOOP_CONF_DIR}/workers"
> 
> 
> ###
> 
> # Options for all daemons
> 
> ###
> 
> #
> 
> 
> #
> 
> # Many options may also be specified as Java properties.  It is
> 
> # very common, and in many cases, desirable, to hard-set these
> 
> # in daemon _OPTS variables.  Where applicable, the appropriate
> 
> # Java property is also identified.  Note that many are re-used
> 
> # or set differently in certain contexts (e.g., secure vs
> 
> # non-secure)
> 
> #
> 
> 
> # Where (primarily) daemon log files are stored.
> 
> # ${HADOOP_HOME}/logs by default.
> 
> # Java property: hadoop.log.dir
> 
> # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
> 
> 
> # A string representing this instance of hadoop. $USER by default.
> 
> # This is used in writing log and pid files, so keep that in mind!
> 
> # Java property: hadoop.id.str
> 
> # export HADOOP_IDENT_STRING=$USER
> 
> 
> # How many seconds to pause after stopping a daemon
> 
> # export HADOOP_STOP_TIMEOUT=5
> 
> 
> # Where pid files are stored.  /tmp by default.
> 
> # export HADOOP_PID_DIR=/tmp
> 
> 
> # Default log4j setting for interactive commands
> 
> # Java property: hadoop.root.logger
> 
> # export HADOOP_ROOT_LOGGER=INFO,console
> 
> 
> # Default log4j setting for daemons spawned explicitly by
> 
> # --daemon option of hadoop, hdfs, mapred and yarn command.
> 
> # Java property: hadoop.root.logger
> 
> # export HADOOP_DAEMON_ROOT_LOGGER=INFO,RFA
> 
> 
> # Default log level and output location for security-related messages.
> 
> # You will almost certainly want to change this on a per-daemon basis via
> 
> # the Java property (i.e., -Dhadoop.security.logger=foo). (Note that the
> 
> # defaults for the NN and 2NN override this by default.)
> 
> # Java property: hadoop.security.logger
> 
> # export HADOOP_SECURITY_LOGGER=INFO,NullAppender
> 
> 
> # Default process priority level
> 
> # Note that sub-processes will also run at this level!
> 
> # export HADOOP_NICENESS=0
> 
> 
> # Default name for the service level authorization file
> 
> # Java property: hadoop.policy.file
> 
> # export HADOOP_POLICYFILE="hadoop-policy.xml"
> 
> 
> #
> 
> # NOTE: this is not used by default!  <-----
> 
> # You can define variables right here and then re-use them later on.
> 
> # For example, it is common to use the same garbage collection settings
> 
> # for all the daemons.  So one could define:
> 
> #
> 
> # export HADOOP_GC_SETTINGS="-verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps"
> 
> #
> 
> # .. and then use it as per the b option under the namenode.
> 
> 
> ###
> 
> # Secure/privileged execution
> 
> ###
> 
> 
> #
> 
> # Out of the box, Hadoop uses jsvc from Apache Commons to launch daemons
> 
> # on privileged ports.  This functionality can be replaced by providing
> 
> # custom functions.  See hadoop-functions.sh for more information.
> 
> #
> 
> 
> # The jsvc implementation to use. Jsvc is required to run secure datanodes
> 
> # that bind to privileged ports to provide authentication of data transfer
> 
> # protocol.  Jsvc is not required if SASL is configured for authentication
> of
> 
> # data transfer protocol using non-privileged ports.
> 
> # export JSVC_HOME=/usr/bin
> 
> 
> #
> 
> # This directory contains pids for secure and privileged processes.
> 
> #export HADOOP_SECURE_PID_DIR=${HADOOP_PID_DIR}
> 
> 
> #
> 
> # This directory contains the logs for secure and privileged processes.
> 
> # Java property: hadoop.log.dir
> 
> # export HADOOP_SECURE_LOG=${HADOOP_LOG_DIR}
> 
> 
> #
> 
> # When running a secure daemon, the default value of HADOOP_IDENT_STRING
> 
> # ends up being a bit bogus.  Therefore, by default, the code will
> 
> # replace HADOOP_IDENT_STRING with HADOOP_xx_SECURE_USER.  If one wants
> 
> # to keep HADOOP_IDENT_STRING untouched, then uncomment this line.
> 
> # export HADOOP_SECURE_IDENT_PRESERVE="true"
> 
> 
> ###
> 
> # NameNode specific parameters
> 
> ###
> 
> 
> # Default log level and output location for file system related change
> 
> # messages. For non-namenode daemons, the Java property must be set in
> 
> # the appropriate _OPTS if one wants something other than INFO,NullAppender
> 
> # Java property: hdfs.audit.logger
> 
> # export HDFS_AUDIT_LOGGER=INFO,NullAppender
> 
> 
> # Specify the JVM options to be used when starting the NameNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # a) Set JMX options
> 
> # export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.port=1026"
> 
> #
> 
> # b) Set garbage collection logs
> 
> # export HDFS_NAMENODE_OPTS="${HADOOP_GC_SETTINGS}
> -Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"
> 
> #
> 
> # c) ... or set them directly
> 
> # export HDFS_NAMENODE_OPTS="-verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
> -Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"
> 
> 
> # this is the default:
> 
> # export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"
> 
> 
> ###
> 
> # SecondaryNameNode specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the SecondaryNameNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # This is the default:
> 
> # export HDFS_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"
> 
> 
> ###
> 
> # DataNode specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the DataNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # This is the default:
> 
> # export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS"
> 
> 
> # On secure datanodes, user to run the datanode as after dropping
> privileges.
> 
> # This **MUST** be uncommented to enable secure HDFS if using privileged
> ports
> 
> # to provide authentication of data transfer protocol.  This **MUST NOT** be
> 
> # defined if SASL is configured for authentication of data transfer protocol
> 
> # using non-privileged ports.
> 
> # This will replace the hadoop.id.str Java property in secure mode.
> 
> # export HDFS_DATANODE_SECURE_USER=hdfs
> 
> 
> # Supplemental options for secure datanodes
> 
> # By default, Hadoop uses jsvc which needs to know to launch a
> 
> # server jvm.
> 
> # export HDFS_DATANODE_SECURE_EXTRA_OPTS="-jvm server"
> 
> 
> ###
> 
> # NFS3 Gateway specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the NFS3 Gateway.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_NFS3_OPTS=""
> 
> 
> # Specify the JVM options to be used when starting the Hadoop portmapper.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_PORTMAP_OPTS="-Xmx512m"
> 
> 
> # Supplemental options for priviliged gateways
> 
> # By default, Hadoop uses jsvc which needs to know to launch a
> 
> # server jvm.
> 
> # export HDFS_NFS3_SECURE_EXTRA_OPTS="-jvm server"
> 
> 
> # On privileged gateways, user to run the gateway as after dropping
> privileges
> 
> # This will replace the hadoop.id.str Java property in secure mode.
> 
> # export HDFS_NFS3_SECURE_USER=nfsserver
> 
> 
> ###
> 
> # ZKFailoverController specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the ZKFailoverController.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_ZKFC_OPTS=""
> 
> 
> ###
> 
> # QuorumJournalNode specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the QuorumJournalNode.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_JOURNALNODE_OPTS=""
> 
> 
> ###
> 
> # HDFS Balancer specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the HDFS Balancer.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_BALANCER_OPTS=""
> 
> 
> ###
> 
> # HDFS Mover specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the HDFS Mover.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_MOVER_OPTS=""
> 
> 
> ###
> 
> # Router-based HDFS Federation specific parameters
> 
> # Specify the JVM options to be used when starting the RBF Routers.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_DFSROUTER_OPTS=""
> 
> 
> ###
> 
> # HDFS StorageContainerManager specific parameters
> 
> ###
> 
> # Specify the JVM options to be used when starting the HDFS Storage
> Container Manager.
> 
> # These options will be appended to the options specified as HADOOP_OPTS
> 
> # and therefore may override any similar flags set in HADOOP_OPTS
> 
> #
> 
> # export HDFS_STORAGECONTAINERMANAGER_OPTS=""
> 
> 
> ###
> 
> # Advanced Users Only!
> 
> ###
> 
> 
> #
> 
> # When building Hadoop, one can add the class paths to the commands
> 
> # via this special env var:
> 
> # export HADOOP_ENABLE_BUILD_PATHS="true"
> 
> 
> #
> 
> # To prevent accidents, shell commands be (superficially) locked
> 
> # to only allow certain users to execute certain subcommands.
> 
> # It uses the format of (command)_(subcommand)_USER.
> 
> #
> 
> # For example, to limit who can execute the namenode command,
> 
> # export HDFS_NAMENODE_USER=hdfs
> 
> export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
> 
> 
> # Enable s3
> 
> export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
> 
> echo "Ensure AWS Credentials are added to hadoop-env.sh and core-site.xml,
> by running add-aws-keys.sh"
> 
> export AWS_ACCESS_KEY_ID=KEY_PLACEHOLDER
> 
> export AWS_SECRET_ACCESS_KEY=SECRET_PLACEHOLDER
> 
> 
> 
> hdfs-site.xml
> 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> 
>   <name>dfs.data.dir</name>
> 
>   <value>/home/hdoop/dfsdata/namenode</value>
> 
> </property>
> 
> <property>
> 
>   <name>dfs.data.dir</name>
> 
>   <value>/home/hdoop/dfsdata/datanode</value>
> 
> </property>
> 
> <property>
> 
>   <name>dfs.replication</name>
> 
>   <value>1</value>
> 
> </property>
> 
> </configuration>
> 
> 
> 
> mapred-site.xml
> 
> <?xml version="1.0"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> <configuration>
> 
> <property>
> 
>   <name>mapreduce.framework.name</name>
> 
>   <value>yarn</value>
> 
> </property>
> 
> <property>
> 
>             <name>yarn.app.mapreduce.am.env</name>
> 
>     <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
> 
>     </property>
> 
>     <property>
> 
>             <name>mapreduce.map.env</name>
> 
>     <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
> 
>     </property>
> 
>     <property>
> 
>             <name>mapreduce.reduce.env</name>
> 
>     <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
> 
>     </property>
> 
>     <!--
> 
>     <property>
> 
>     <name>mapreduce.application.classpath</name>
> 
> 
>   <value>/home/hdoop/hadoop-3.3.0/etc/hadoop:/home/hdoop/hadoop-3.3.0/share/hadoop/common/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/common/*:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/hdfs/*:/home/hdoop/hadoop-3.3.0/share/hadoop/mapreduce/*:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn/lib/*:/home/hdoop/hadoop-3.3.0/share/hadoop/yarn/*:/home/hdoop/hadoop-3.3.0/bin:/home/hdoop/hadoop-3.3.0/sbin</value>
> 
>     </property>
> 
>  -->
> 
>     <property>
> 
>     <name>mapreduce.application.classpath</name>
> 
> 
>   <value>home/hdoop/hadoop-3.3.0/etc/hadoop:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/common/*:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/*:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/*:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/*:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/*:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/accessors-smart-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/animal-sniffer-annotations-1.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/asm-5.0.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/audience-annotations-0.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/avro-1.7.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/checker-qual-2.5.2.jar:/home/hdoop/hadoop-3.3.
 0//share/hadoop/common/lib/commons-beanutils-1.9.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-cli-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-codec-1.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-collections-3.2.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-compress-1.18.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-configuration2-2.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-io-2.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-lang3-3.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-logging-1.1.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-math3-3.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-net-3.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/commons-text-1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/curator-client-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/cu
 rator-framework-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/curator-recipes-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/dnsjava-2.1.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/error_prone_annotations-2.2.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/failureaccess-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/gson-2.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/guava-27.0-jre.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/hadoop-annotations-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/hadoop-auth-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/htrace-core4-4.1.0-incubating.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/httpclient-4.5.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/httpcore-4.4.10.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/j2objc-annotations-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-annotations-2.9.8.jar:/h
 ome/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-core-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-databind-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jackson-xc-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/javax.servlet-api-3.1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jaxb-api-2.2.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jcip-annotations-1.0-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-core-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-json-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jersey-server-1.19.jar:/home/hdoop/hadoop-3.3.0//share/
 hadoop/common/lib/jersey-servlet-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jettison-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-http-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-io-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-security-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-server-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-servlet-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-util-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-webapp-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jetty-xml-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsch-0.1.54.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/json-smart-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsp-api-2.1.jar:/home/hdoop/hadoop-3.3.
 0//share/hadoop/common/lib/jsr305-3.0.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jsr311-api-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/jul-to-slf4j-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-admin-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-client-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-common-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-core-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-crypto-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-identity-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-server-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-simplekdc-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerb-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-asn1-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-config-1.0.1.jar:/home/hdoop/hado
 op-3.3.0//share/hadoop/common/lib/kerby-pkix-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/kerby-xdr-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/log4j-1.2.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/metrics-core-3.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/netty-3.10.5.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/nimbus-jose-jwt-4.41.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/paranamer-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/protobuf-java-2.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/re2j-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/slf4j-api-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/l
 ib/snappy-java-1.0.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/stax2-api-3.1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/token-provider-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/woodstox-core-5.0.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib/zookeeper-3.4.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-common-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-kms-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/hadoop-nfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/common/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/common/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/common/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/common/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/accessors-smart-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/animal-sniffer-annotations-1.17.jar:/home/hdoop/hadoop-3.3.0//share/hado
 op/hdfs/lib/asm-5.0.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/audience-annotations-0.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/avro-1.7.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/checker-qual-2.5.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-beanutils-1.9.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-cli-1.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-codec-1.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-collections-3.2.2.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-compress-1.18.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-configuration2-2.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-io-2.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-lang3-3.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/home/hdoop/hadoop-3.3.0/
 /share/hadoop/hdfs/lib/commons-math3-3.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-net-3.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/commons-text-1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-client-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-framework-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/curator-recipes-2.13.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/dnsjava-2.1.7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/error_prone_annotations-2.2.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/failureaccess-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/gson-2.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/guava-27.0-jre.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/hadoop-annotations-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/hadoop-auth-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/htrace-core4-4.1.0-incubating.jar:/home/hdoo
 p/hadoop-3.3.0//share/hadoop/hdfs/lib/httpclient-4.5.6.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/httpcore-4.4.10.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/j2objc-annotations-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-annotations-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-core-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-databind-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-jaxrs-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jackson-xc-1.9.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/javax.servlet-api-3.1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jaxb-api-2.2.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jaxb-impl-2.2.3-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jcip-annotat
 ions-1.0-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-core-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-json-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-server-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jersey-servlet-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jettison-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-http-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-io-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-security-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-server-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-servlet-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-util-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-util-ajax-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-webapp-9.3
 .24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jetty-xml-9.3.24.v20180605.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsch-0.1.54.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/json-simple-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/json-smart-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/jsr311-api-1.1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-admin-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-client-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-common-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-core-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-crypto-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-identity-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-server-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-simplekdc-1.0.1.jar:/ho
 me/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerb-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-asn1-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-config-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-pkix-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-util-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/kerby-xdr-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/log4j-1.2.17.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/netty-3.10.5.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/netty-all-4.0.52.Final.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/nimbus-jose-jwt-4.41.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/okhttp-2.7.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/okio-
 1.6.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/paranamer-2.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/re2j-1.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/snappy-java-1.0.5.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/stax2-api-3.1.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/token-provider-1.0.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/woodstox-core-5.0.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib/zookeeper-3.4.13.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-client-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-httpfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-native-client-3.3
 .0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-native-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-nfs-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/hadoop-hdfs-rbf-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/hdfs/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib/junit-4.11.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-app-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-core-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-hs-3.
 3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.0-tests.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-nativetask-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-client-uploader-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/jdiff:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/lib-examples:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/HikariCP-java7-2.4.12.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/y
 arn/lib/aopalliance-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/bcpkix-jdk15on-1.60.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/bcprov-jdk15on-1.60.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/ehcache-3.3.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/fst-2.50.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/geronimo-jcache_1.0_spec-1.0-alpha-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/guice-4.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/guice-servlet-4.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-jaxrs-base-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-jaxrs-json-provider-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jackson-module-jaxb-annotations-2.9.8.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/java-util-1.9.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/javax.inject-1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/jersey-client-1.19.jar:/home/hdoop/hadoop
 -3.3.0//share/hadoop/yarn/lib/jersey-guice-1.19.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/json-io-2.5.1.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/metrics-core-3.2.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/mssql-jdbc-6.2.1.jre7.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/objenesis-1.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/snakeyaml-1.16.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/lib/swagger-annotations-1.5.4.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-api-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-client-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-registry-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/had
 oop/yarn/hadoop-yarn-server-applicationhistoryservice-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-common-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-nodemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-resourcemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-router-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-tests-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-server-web-proxy-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-services-api-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-services-core-3.3.0.jar:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/hadoop-yarn-submarine-3.3.0.jar:/home/hdoop/ha
 doop-3.3.0//share/hadoop/yarn/lib:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/sources:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/test:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/timelineservice:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/webapps:/home/hdoop/hadoop-3.3.0//share/hadoop/yarn/yarn-service-examples:/home/hdoop/hadoop-3.3.0//share/hadoop/mapreduce/sources/hadoop-mapreduce-client-app-3.3.0-sources.jar:/home/hdoop/hadoop-3.3.0//hadoop-tools/hadoop-aws/target/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/hadoop-tools/hadoop-aws/target/hadoop-aws-3.3.0.jar:/home/hdoop/hadoop-3.3.0/hadoop-tools/hadoop-aws/target/lib/aws-java-sdk-bundle-1.11.563.jar:/home/hdoop/nutch/build/lib/commons-jexl3-3.1.jar:/home/hdoop/nutch/lib/*</value>
> 
> </property>
> 
> 
>  <property>
> 
>         <name>yarn.app.mapreduce.am.resource.mb</name>
> 
>         <value>512</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>mapreduce.map.memory.mb</name>
> 
>         <value>256</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>mapreduce.reduce.memory.mb</name>
> 
>         <value>256</value>
> 
> </property>
> 
> 
> <!--from NutchHadoop Tutorial -->
> 
> <property>
> 
>   <name>mapred.system.dir</name>
> 
>   <value>/home/hdoop/dfsdata/mapreduce/system</value>
> 
> </property>
> 
> 
> <property>
> 
>   <name>mapred.local.dir</name>
> 
>   <value>/home/hdoop/dfsdata/mapreduce/local</value>
> 
> </property>
> 
> 
> </configuration>
> 
> 
> 
> workers
> 
> hadoop02N <http://hadoop02.o7.castle.fm/>ame
> 
> hadoop01N <http://hadoop01.o7.castle.fm/>ame
> 
> 
> 
> yarn-site.xml
> 
> <?xml version="1.0"?>
> 
> <!--
> 
>   Licensed under the Apache License, Version 2.0 (the "License");
> 
>   you may not use this file except in compliance with the License.
> 
>   You may obtain a copy of the License at
> 
> 
>     http://www.apache.org/licenses/LICENSE-2.0
> 
> 
>   Unless required by applicable law or agreed to in writing, software
> 
>   distributed under the License is distributed on an "AS IS" BASIS,
> 
>   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> 
>   See the License for the specific language governing permissions and
> 
>   limitations under the License. See accompanying LICENSE file.
> 
> -->
> 
> <configuration>
> 
> <property>
> 
>   <name>yarn.nodemanager.aux-services</name>
> 
>   <value>mapreduce_shuffle</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
> 
>   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.resourcemanager.hostname</name>
> 
>   <value>IP_PLACEHOLDER</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.acl.enable</name>
> 
>   <value>0</value>
> 
> </property>
> 
> <property>
> 
>   <name>yarn.nodemanager.env-whitelist</name>
> 
>   <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
> 
> </property>
> 
> 
> <!--<property>
> 
>  <name>yarn.application.classpath</name>
> 
>  <value>$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$USS_HOME/*,$USS_CONF</value>
> 
> </property>-->
> 
> 
> <property>
> 
>         <name>yarn.nodemanager.resource.memory-mb</name>
> 
>         <value>1536</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>yarn.scheduler.maximum-allocation-mb</name>
> 
>         <value>1536</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>yarn.scheduler.minimum-allocation-mb</name>
> 
>         <value>128</value>
> 
> </property>
> 
> 
> <property>
> 
>         <name>yarn.nodemanager.vmem-check-enabled</name>
> 
>         <value>false</value>
> 
> </property>
> 
> <property>
> 
>         <name>yarn.nodemanager.local-dirs</name>
> 
>         <value>${hadoop.tmp.dir}/nm-local-dir</value>
> 
> </property>
> 
> 
> </configuration>
> 
> 
> On Thu, Jun 17, 2021 at 11:55 AM Clark Benham <cl...@thehive.ai> wrote:
> 
> >
> > Hi Sebastian,
> >
> > NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built
> > hadoop.
> > There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or
> > Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to
> > ${fs.defaultFS}, so s3a://temp-crawler in our case.
> > The plugin loader doesn't appear to be able to read from s3 in nutch-1.18
> > with hadoop-3.2.1[1].
> >
> > Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a
> > nutch-1.19 I built:
> > I can run a mapreduce job on S3; and a Nutch job on hdfs, but running
> > nutch on S3 still gives "URLNormalizer not found" with the plugin dir on
> > the local filesystem or on S3a.
> >
> > How would you recommend I go about getting the plugin loader to read from
> > other file systems?
> >
> > [1]  I still get 'x point org.apache.nutch.net.URLNormalizer not found'
> > (same stack trace as previous email) with
> > `<name>plugin.folders</name>
> > <value>s3a://temp-crawler/user/hdoop/nutch-plugins</value>`
> > set in my nutch-site.xml while `hadoop fs -ls
> > s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there.
> >
> >
> > For posterity:
> > I got hadoop-3.3.0 working with a S3 backend by:
> >
> > cd ~/hadoop-3.3.0
> >
> > cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib
> >
> > cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar
> > ./share/hadoop/common/lib
> > to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not
> > found" despite the class existing in
> > ~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar  checking it's
> > on the classpath with `hadoop classpath | tr ":" "\n"  | grep
> > share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to
> > hadoop-env.sh.
> > see
> > https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f
> >
> > On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel
> > <wa...@googlemail.com.invalid> wrote:
> >
> >>  > The local file system? Or hdfs:// or even s3:// resp. s3a://?
> >>
> >> Also important: the value of "mapreduce.job.dir" - it's usually
> >> on hdfs:// and I'm not sure whether the plugin loader is able to
> >> read from other filesystems. At least, I haven't tried.
> >>
> >>
> >> On 6/15/21 10:53 AM, Sebastian Nagel wrote:
> >> > Hi Clark,
> >> >
> >> > sorry, I should read your mail until the end - you mentioned that
> >> > you downgraded Nutch to run with JDK 8.
> >> >
> >> > Could you share to which filesystem does NUTCH_HOME point?
> >> > The local file system? Or hdfs:// or even s3:// resp. s3a://?
> >> >
> >> > Best,
> >> > Sebastian
> >> >
> >> >
> >> > On 6/15/21 10:24 AM, Clark Benham wrote:
> >> >> Hi,
> >> >>
> >> >>
> >> >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
> >> >> backend/filesystem; however I get an error ‘URLNormalizer class not
> >> found’.
> >> >> I have edited nutch-site.xml so this plugin should be included:
> >> >>
> >> >> <property>
> >> >>
> >> >>    <name>plugin.includes</name>
> >> >>
> >> >>
> >> >>
> >> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints</value>
> >>
> >> >>
> >> >>
> >> >>
> >> >> </property>
> >> >>
> >> >>   and then built on both nodes (I only have 2 machines).  I’ve
> >> successfully
> >> >> run Nutch locally and in distributed mode using HDFS, and I’ve run a
> >> >> mapreduce job with S3 as hadoop’s file system.
> >> >>
> >> >>
> >> >> I thought it was possible nutch is not reading nutch-site.xml because I
> >> >> resolve an error by setting the config through the cli, despite this
> >> >> duplicating nutch-site.xml.
> >> >>
> >> >> The command:
> >> >>
> >> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.fetcher.Fetcher
> >> >> crawl/crawldb crawl/segments`
> >> >>
> >> >> throws
> >> >>
> >> >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in '
> >> >> http.agent.name' property`
> >> >>
> >> >> while if I pass a value in for http.agent.name with
> >> >> `-Dhttp.agent.name=myScrapper`,
> >> >> (making the command `hadoop jar
> >> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.fetcher.Fetcher
> >> >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an
> >> error
> >> >> about there being no input path, which makes sense as I haven’t been
> >> able
> >> >> to generate any segments.
> >> >>
> >> >>
> >> >>   However this method of setting nutch config’s doesn’t work for
> >> injecting
> >> >> URLs; eg:
> >> >>
> >> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.crawl.Injector
> >> >> -Dplugin.includes=".*" crawl/crawldb urls`
> >> >>
> >> >> fails with the same “URLNormalizer” not found.
> >> >>
> >> >>
> >> >> I tried copying the plugin dir to S3 and setting
> >> >> <name>plugin.folders</name> to be a path on S3 without success. (I
> >> expect
> >> >> the plugin to be bundled with the .job so this step should be
> >> unnecessary)
> >> >>
> >> >>
> >> >> The full stack trace for `hadoop jar
> >> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> >> org.apache.nutch.crawl.Injector
> >> >> crawl/crawldb urls`:
> >> >>
> >> >> SLF4J: Class path contains multiple SLF4J bindings.
> >> >>
> >> >> SLF4J: Found binding in
> >> >>
> >> [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >> >>
> >> >> SLF4J: Found binding in
> >> >>
> >> [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >> >>
> >> >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> >> >> explanation.
> >> >>
> >> >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> >> >>
> >> >> #Took out multiply Info messages
> >> >>
> >> >> 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
> >> >> attempt_1623740678244_0001_m_000001_0, Status : FAILED
> >> >>
> >> >> Error: java.lang.RuntimeException: x point
> >> >> org.apache.nutch.net.URLNormalizer not found.
> >> >>
> >> >> at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:145)
> >> >>
> >> >> at
> >> org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)
> >> >>
> >> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
> >> >>
> >> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
> >> >>
> >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
> >> >>
> >> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
> >> >>
> >> >> at java.security.AccessController.doPrivileged(Native Method)
> >> >>
> >> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> >>
> >> >> at
> >> >>
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> >> >>
> >> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> >> >>
> >> >>
> >> >> #This error repeats 6 times total, 3 times for each node
> >> >>
> >> >>
> >> >> 2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%
> >> >>
> >> >> 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
> >> >> failed with state FAILED due to: Task failed
> >> >> task_1623740678244_0001_m_000001
> >> >>
> >> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> >> killedReduces: 0
> >> >>
> >> >>
> >> >> 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14
> >> >>
> >> >> Job Counters
> >> >>
> >> >> Failed map tasks=7
> >> >>
> >> >> Killed map tasks=1
> >> >>
> >> >> Killed reduce tasks=1
> >> >>
> >> >> Launched map tasks=8
> >> >>
> >> >> Other local map tasks=6
> >> >>
> >> >> Rack-local map tasks=2
> >> >>
> >> >> Total time spent by all maps in occupied slots (ms)=63196
> >> >>
> >> >> Total time spent by all reduces in occupied slots (ms)=0
> >> >>
> >> >> Total time spent by all map tasks (ms)=31598
> >> >>
> >> >> Total vcore-milliseconds taken by all map tasks=31598
> >> >>
> >> >> Total megabyte-milliseconds taken by all map tasks=8089088
> >> >>
> >> >> Map-Reduce Framework
> >> >>
> >> >> CPU time spent (ms)=0
> >> >>
> >> >> Physical memory (bytes) snapshot=0
> >> >>
> >> >> Virtual memory (bytes) snapshot=0
> >> >>
> >> >> 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not
> >> succeed,
> >> >> job status: FAILED, reason: Task failed
> >> task_1623740678244_0001_m_000001
> >> >>
> >> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> >> killedReduces: 0
> >> >>
> >> >>
> >> >> 2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
> >> >> java.lang.RuntimeException: Injector job did not succeed, job status:
> >> >> FAILED, reason: Task failed task_1623740678244_0001_m_000001
> >> >>
> >> >> Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
> >> >> killedReduces: 0
> >> >>
> >> >>
> >> >> at org.apache.nutch.crawl.Injector.inject(Injector.java:444)
> >> >>
> >> >> at org.apache.nutch.crawl.Injector.run(Injector.java:571)
> >> >>
> >> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> >> >>
> >> >> at org.apache.nutch.crawl.Injector.main(Injector.java:535)
> >> >>
> >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> >>
> >> >> at
> >> >>
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >> >>
> >> >> at
> >> >>
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> >>
> >> >> at java.lang.reflect.Method.invoke(Method.java:498)
> >> >>
> >> >> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> >> >>
> >> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> P.S.
> >> >>
> >> >> I am using a downloaded hadoop-3.2.1; and the only odd thing about my
> >> nutch
> >> >> build is that I had to replace all instances of “javac.verion” with
> >> >> “ant.java.version”; as the javac version was 11 to java’s 1.8 giving
> >> the
> >> >> error ‘javac invalid target release: 11’:
> >> >>
> >> >> grep -rl "javac.version" . --include "*.xml" | xargs sed -i
> >> >> s^javac.version^ant.java.version^g
> >> >>
> >> >> grep -rl “ant.ant” . --include "*.xml"| xargs sed -i s^ant.ant.^ant.^g
> >> >>
> >> >
> >>
> >>
> 

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Clark,

thanks for summarizing this discussion and sharing the final configuration!

Good to know that it's possible to run Nutch on Hadoop using S3A without
using HDFS (no namenode/datanodes running).

Best,
Sebastian