You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by SOUFIANI Mustapha | السفياني مصطفى <s....@gmail.com> on 2016/05/02 12:12:56 UTC

Can not load files/data from a remote hadoop cluster or MongoDB database

Hi all,
here is the architechture I'm using : a local java client (pentaho) that
interacts with a remote hadoop cluster and a remote MongoDB database server.

When I try to execute a pig script (it could load data from a file under
hadoop or direclty from MongoDb server using hadoop mongoDB connector) with
my java client, each time the system responds  that it can not find from
where those datas are loaded, but in the reality they absolutly exists.

here is the first pig script that loads data from an HDFS file :

*********************************************************************
weblogs = LOAD
'hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/weblogs_parse.txt'
USING PigStorage('\t')
        AS (
client_ip:chararray,
full_request_date:chararray,
day:int,
month:chararray,
month_num:int,
year:int,
hour:int,
minute:int,
second:int,
timezone:chararray,
http_verb:chararray,
uri:chararray,
http_status_code:chararray,
bytes_returned:chararray,
referrer:chararray,
user_agent:chararray
);

weblog_group = GROUP weblogs by (client_ip, year, month_num);
weblog_count = FOREACH weblog_group GENERATE group.client_ip, group.year,
group.month_num,  COUNT_STAR(weblogs) as pageviews;

STORE weblog_count INTO
'hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/mustapha-pentaho.txt';
*********************************************************************

here is the output of the Java client :

*********************************************************************
2016/05/02 11:04:28 - Pentaho Data Integration - Démarrage tâche ...
2016/05/02 11:04:28 - Pig_script_executor - Démarrage tâche
2016/05/02 11:04:28 - Pig_script_executor - Démarrage exécution entrée [Pig
Script Executor]
2016/05/02 11:04:28 - Pig_script_executor - Fin exécution  entrée tâche
[Pig Script Executor] (résultat=[true])
2016/05/02 11:04:28 - Pig_script_executor - Fin exécution tâche
2016/05/02 11:04:28 - Pentaho Data Integration - L'exécution de la tâche a
été achevée.
2016/05/02 11:04:28 - Pig Script Executor - Pig Script Executor in
Pig_script_executor has been started asynchronously. Pig_script_executor
has been finished and logs from Pig Script Executor can be lost
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 -
Connecting to hadoop file system at: hdfs://sigma-server:54310
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 -
Connecting to map-reduce job tracker at: sigma-server:8032
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Empty
string specified for jar path
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Pig
features used in the script: GROUP_BY
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
DuplicateForEachColumnRewrite, FilterLogicExpressionSimplifier,
GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer,
LoadTypeCastInserter, MergeFilter, MergeForEach,
NewPartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter,
SplitFilter, StreamTypeCastInserter],
RULES_DISABLED=[PartitionFilterOptimizer]}
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - File
concatenation threshold: 100 optimistic? false
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Choosing
to move algebraic foreach to combiner
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - MR plan
size before optimization: 1
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - MR plan
size after optimization: 1
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Pig
script settings are added to the job
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 -
mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Reduce
phase detected, estimating # of required reducers.
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Using
reducer estimator:
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.InputSizeReducerEstimator
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 -
BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=81468050
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - Setting
Parallelism to 1
2016/05/02 11:04:28 - Pig Script Executor - 2016/05/02 11:04:28 - creating
jar file Job6383680088751493933.jar
2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - jar file
Job6383680088751493933.jar created
2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Setting
up single store job
2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Key
[pig.schematuple] is false, will not generate code.
2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Starting
process to move generated code to distributed cache
2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - Setting
key [pig.schematuple.classes] with classes to deserialize []
2016/05/02 11:04:30 - Pig Script Executor - 2016/05/02 11:04:30 - 1
map-reduce job(s) waiting for submission.
2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - Total
input paths to process : 1
2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - Total
input paths (combined) to process : 1
2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 -
HadoopJobId: job_1462181691937_0009
2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 -
Processing aliases weblog_count,weblog_group,weblogs
2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - detailed
locations: M:
weblogs[1,10],weblogs[-1,-1],weblog_count[22,15],weblog_group[21,15] C:
weblog_count[22,15],weblog_group[21,15] R: weblog_count[22,15]
2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - More
information at:
http://sigma-server:50030/jobdetails.jsp?jobid=job_1462181691937_0009
2016/05/02 11:04:31 - Pig Script Executor - 2016/05/02 11:04:31 - 0%
complete
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - Ooops!
Some job has failed! Specify -stop_on_failure if you want Pig to stop
immediately on failure.
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - job
job_1462181691937_0009 has failed! Stop running all dependent jobs
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - 100%
complete
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - 1 map
reduce job(s) failed!
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - Script
Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
2.6.0-cdh5.5.0    0.12.0-cdh5.5.0    msoufiani    2016-05-02 11:04:28
2016-05-02 11:04:36    GROUP_BY

Failed!

Failed Jobs:
JobId    Alias    Feature    Message    Outputs
job_1462181691937_0009    weblog_count,weblog_group,weblogs
GROUP_BY,COMBINER    Message: Job failed!
hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/mustapha-pentaho.txt,

Input(s):
Failed to read data from
"hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/weblogs_parse.txt"

Output(s):
Failed to produce result in
"hdfs://sigma-server:54310/user/hduser/pdi/weblogs/parse/part/mustapha-pentaho.txt"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1462181691937_0009
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - Failed!
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - ERROR
2244: Job failed, hadoop does not return any error message
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 - There is
no log file to write to.
2016/05/02 11:04:36 - Pig Script Executor - 2016/05/02 11:04:36 -
org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job
failed, hadoop does not return any error message
2016/05/02 11:04:36 - Pig Script Executor -     at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148)
2016/05/02 11:04:36 - Pig Script Executor -     at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
2016/05/02 11:04:36 - Pig Script Executor -     at
org.pentaho.hadoop.shim.common.PigShimImpl.executeScript(PigShimImpl.java:46)
2016/05/02 11:04:36 - Pig Script Executor -     at
org.pentaho.hadoop.shim.common.delegating.DelegatingPigShim.executeScript(DelegatingPigShim.java:65)
2016/05/02 11:04:36 - Pig Script Executor -     at
org.pentaho.big.data.impl.shim.pig.PigServiceImpl.executeScript(PigServiceImpl.java:103)
2016/05/02 11:04:36 - Pig Script Executor -     at
org.pentaho.big.data.kettle.plugins.pig.JobEntryPigScriptExecutor$1.run(JobEntryPigScriptExecutor.java:499)
2016/05/02 11:04:36 - Pig Script Executor - Num successful jobs: 0 num
failed jobs: 1
*********************************************************************

here is the pig script that loads data from a MongoDB server :

*********************************************************************
REGISTER
C:\Users\msoufiani\Desktop\pig.mongo.connector\mongo-java-driver-3.2.3-SNAPSHOT.jar;
REGISTER
C:\Users\msoufiani\Desktop\pig.mongo.connector\mongo-hadoop-pig-1.5.2.jar;
REGISTER
C:\Users\msoufiani\Desktop\pig.mongo.connector\mongo-hadoop-core-1.5.2.jar;

raw = LOAD 'mongodb://sigma-server:27017/mongo_hadoop.MapReduce_test_in'
USING com.mongodb.hadoop.pig.MongoLoader('id, CONTRACT ,PL_PRODUCT_AMC',
'id');
raw_limited = LIMIT raw 100;
--DUMP raw_limited;


STORE raw_limited INTO
'mongodb://sigma-server:27017/mongo_hadoop.MapReduce_test_out' USING
com.mongodb.hadoop.pig.MongoInsertStorage('');
*********************************************************************

Here is the output:

*********************************************************************
2016/05/02 11:08:49 - Pentaho Data Integration - Démarrage tâche ...
2016/05/02 11:08:49 - Pig_script_executor - Démarrage tâche
2016/05/02 11:08:49 - Pig_script_executor - Démarrage exécution entrée [Pig
Script Executor]
2016/05/02 11:08:49 - Pig_script_executor - Fin exécution  entrée tâche
[Pig Script Executor] (résultat=[true])
2016/05/02 11:08:49 - Pig_script_executor - Fin exécution tâche
2016/05/02 11:08:49 - Pentaho Data Integration - L'exécution de la tâche a
été achevée.
2016/05/02 11:08:49 - Pig Script Executor - Pig Script Executor in
Pig_script_executor has been started asynchronously. Pig_script_executor
has been finished and logs from Pig Script Executor can be lost
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 -
Connecting to hadoop file system at: hdfs://sigma-server:54310
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 -
Connecting to map-reduce job tracker at: sigma-server:8032
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Empty
string specified for jar path
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Pig
features used in the script: LIMIT
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 -
{RULES_ENABLED=[AddForEach, ColumnMapKeyPrune,
DuplicateForEachColumnRewrite, FilterLogicExpressionSimplifier,
GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer,
LoadTypeCastInserter, MergeFilter, MergeForEach,
NewPartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter,
SplitFilter, StreamTypeCastInserter],
RULES_DISABLED=[PartitionFilterOptimizer]}
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - File
concatenation threshold: 100 optimistic? false
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - MR plan
size before optimization: 2
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - MR plan
size after optimization: 2
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Pig
script settings are added to the job
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 -
mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Reduce
phase detected, estimating # of required reducers.
2016/05/02 11:08:49 - Pig Script Executor - 2016/05/02 11:08:49 - Setting
Parallelism to 1
2016/05/02 11:08:50 - Pig Script Executor - 2016/05/02 11:08:50 - creating
jar file Job8094356474794659191.jar
2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - jar file
Job8094356474794659191.jar created
2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Setting
up single store job
2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Key
[pig.schematuple] is false, will not generate code.
2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Starting
process to move generated code to distributed cache
2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - Setting
key [pig.schematuple.classes] with classes to deserialize []
2016/05/02 11:08:52 - Pig Script Executor - 2016/05/02 11:08:52 - 1
map-reduce job(s) waiting for submission.
2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - Total
input paths (combined) to process : 52
2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 -
HadoopJobId: job_1462181691937_0011
2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 -
Processing aliases raw,raw_limited
2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - detailed
locations: M: raw[5,6],raw_limited[6,14] C:  R:
2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - More
information at:
http://sigma-server:50030/jobdetails.jsp?jobid=job_1462181691937_0011
2016/05/02 11:08:53 - Pig Script Executor - 2016/05/02 11:08:53 - 0%
complete
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - Ooops!
Some job has failed! Specify -stop_on_failure if you want Pig to stop
immediately on failure.
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - job
job_1462181691937_0011 has failed! Stop running all dependent jobs
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - 100%
complete
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - 1 map
reduce job(s) failed!
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - Script
Statistics:

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
2.6.0-cdh5.5.0    0.12.0-cdh5.5.0    msoufiani    2016-05-02 11:08:49
2016-05-02 11:08:58    LIMIT

Failed!

Failed Jobs:
JobId    Alias    Feature    Message    Outputs
job_1462181691937_0011    raw,raw_limited        Message: Job failed!

Input(s):
Failed to read data from
"mongodb://sigma-server:27017/mongo_hadoop.MapReduce_test_in"

Output(s):

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1462181691937_0011    ->    null,
null
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - Failed!
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - ERROR
2244: Job failed, hadoop does not return any error message
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - There is
no log file to write to.
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 -
org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job
failed, hadoop does not return any error message
2016/05/02 11:08:58 - Pig Script Executor -     at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.hadoop.shim.common.PigShimImpl.executeScript(PigShimImpl.java:46)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.hadoop.shim.common.delegating.DelegatingPigShim.executeScript(DelegatingPigShim.java:65)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.big.data.impl.shim.pig.PigServiceImpl.executeScript(PigServiceImpl.java:103)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.big.data.kettle.plugins.pig.JobEntryPigScriptExecutor$1.run(JobEntryPigScriptExecutor.java:499)
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - ERROR
2244: Job failed, hadoop does not return any error message
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 - There is
no log file to write to.
2016/05/02 11:08:58 - Pig Script Executor - 2016/05/02 11:08:58 -
org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job
failed, hadoop does not return any error message
2016/05/02 11:08:58 - Pig Script Executor -     at
org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.hadoop.shim.common.PigShimImpl.executeScript(PigShimImpl.java:46)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.hadoop.shim.common.delegating.DelegatingPigShim.executeScript(DelegatingPigShim.java:65)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.big.data.impl.shim.pig.PigServiceImpl.executeScript(PigServiceImpl.java:103)
2016/05/02 11:08:58 - Pig Script Executor -     at
org.pentaho.big.data.kettle.plugins.pig.JobEntryPigScriptExecutor$1.run(JobEntryPigScriptExecutor.java:499)
2016/05/02 11:08:58 - Pig Script Executor - Num successful jobs: 0 num
failed jobs: 2
*********************************************************************

Both of these scripts are successfully runnable via pig shell on the server.

Can you help me on this please ?
thanks in advance