You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "Periya.Data" <pe...@gmail.com> on 2012/08/30 05:52:39 UTC

no output written to HDFS

Hi All,
   My Hadoop streaming job (in Python) runs to "completion" (both map and
reduce says 100% complete). But, when I look at the output directory in
HDFS, the part files are empty. I do not know what might be causing this
behavior. I understand that the percentages represent the records that have
been read in (not processed).

The following are some of the logs. The detailed logs from Cloudera Manager
says that there were no Map Outputs...which is interesting. Any suggestions?


12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
12/08/30 03:27:14 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop
job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill job_201208232245_3182
12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
job_201208232245_3182
12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
Thu Aug 30 03:27:24 GMT 2012
*** END
bash-3.2$
bash-3.2$ hadoop fs -ls /user/ghu/
Found 5 items
-rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/_SUCCESS
drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
-rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00000
-rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00001
-rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00002
bash-3.2$
--------------------------------------------------------------------------------------------------------------------


Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
Name CaidMatch
 User srisrini  Mapper class PipeMapper  Reducer class
 Scheduler pool name default  Job input directory
hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job output
directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time Wed, 29
Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM






 Progress and Scheduling Map Progress
100.0%
 Reduce Progress
100.0%
 Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
 Desired maps 3  Launched reducers
 Desired reducers 0  Fairscheduler running tasks
 Fairscheduler minimum share
 Fairscheduler demand
 Current Resource Usage Current User CPUs 0  Current System CPUs 0  Resident
memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce slot
time 0s  Cumulative disk reads
 Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB  Cumulative
HDFS writes
 Map input bytes 2.5 KiB  Map input records 45  Map output records 0  Reducer
input groups
 Reducer input records
 Reducer output records
 Reducer shuffle bytes
 Spilled records

Re: no output written to HDFS

Posted by Håvard Wahl Kongsgård <ha...@gmail.com>.

For python streaming go with dumbo https://github.com/klbostee/dumbo/wiki

or pipes with pydoop http://pydoop.sourceforge.net/docs/pipes

-Håvard

On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <pe...@gmail.com> wrote:
> Hi All,
>    My Hadoop streaming job (in Python) runs to "completion" (both map and
> reduce says 100% complete). But, when I look at the output directory in
> HDFS, the part files are empty. I do not know what might be causing this
> behavior. I understand that the percentages represent the records that have
> been read in (not processed).
>
> The following are some of the logs. The detailed logs from Cloudera Manager
> says that there were no Map Outputs...which is interesting. Any suggestions?
>
>
> 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
> 12/08/30 03:27:14 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop
> job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill job_201208232245_3182
> 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
> http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
> 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
> 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
> 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
> 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
> 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
> 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
> job_201208232245_3182
> 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
> Thu Aug 30 03:27:24 GMT 2012
> *** END
> bash-3.2$
> bash-3.2$ hadoop fs -ls /user/ghu/
> Found 5 items
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/_SUCCESS
> drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00000
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00001
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00002
> bash-3.2$
> --------------------------------------------------------------------------------------------------------------------
>
>
> Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
> Name CaidMatch
>  User srisrini  Mapper class PipeMapper  Reducer class
>  Scheduler pool name default  Job input directory
> hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job output
> directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
> Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time Wed, 29
> Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
>
>
>
>
>
>
>  Progress and Scheduling Map Progress
> 100.0%
>  Reduce Progress
> 100.0%
>  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
>  Desired maps 3  Launched reducers
>  Desired reducers 0  Fairscheduler running tasks
>  Fairscheduler minimum share
>  Fairscheduler demand
>  Current Resource Usage Current User CPUs 0  Current System CPUs 0  Resident
> memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
> and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce slot
> time 0s  Cumulative disk reads
>  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB  Cumulative
> HDFS writes
>  Map input bytes 2.5 KiB  Map input records 45  Map output records 0  Reducer
> input groups
>  Reducer input records
>  Reducer output records
>  Reducer shuffle bytes
>  Spilled records



-- 
Håvard Wahl Kongsgård
Faculty of Medicine &
Department of Mathematical Sciences
NTNU

http://havard.security-review.net/

Re: no output written to HDFS

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

Do both input files contain data that needs to be processed by the
mapper in the same fashion ? In which case, you could just put the
input files under a directory in HDFS and provide that as input. The
-input option does accept a directory as argument.

Otherwise, can you please explain a little more what you're trying to
do with the two inputs.

Thanks
Hemanth

On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data <pe...@gmail.com> wrote:
> This is interesting. I changed my command to:
>
> -mapper "cat $1 |  $GHU_HOME/test2.py $2" \
>
> is producing output to HDFS. But, the output is not what I expected and is
> not the same as when I do "cat | map " on Linux. It is producing
> part-00000, part-00001 and part-00002. I expected only one output file with
> just 2 records.
>
> I think I have to understand what exactly "-file" does and what exactly
> "-input" does. I am experimenting what happens if I give my input files on
> the command line (like: test2.py arg1 arg2) as against specifying the input
> files via "-file" and "-input" options...
>
> The problem is I have 2 input files...and have no idea how to pass them.
> SHould I keep one in HDFS and stream in the other?
>
> More digging,
> PD/
>
>
>
> On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <pe...@gmail.com> wrote:
>
>> Hi Bertrand,
>>     No, I do not observe the same when I run using cat | map. I can see
>> the output in STDOUT when I run my program.
>>
>> I do not have any reducer. In my command, I provide
>> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
>> written directly to HDFS.
>>
>> Your suspicion maybe right..about the output. In my counters, the "map
>> input records" = 40 and "map.output records" = 0. I am trying to see if I
>> am messing up in my command...(see below)
>>
>> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
>> streaming one file in and test2.py takes in only one argument. How should I
>> frame my command below? I think that is where I am messing up..
>>
>>
>> run.sh:        (run as:   cat <arg2> | ./run.sh <arg1> )
>> -----------
>>
>> hadoop jar
>> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>>         -D mapred.reduce.tasks=0 \
>>         -verbose \
>>         -input "$HDFS_INPUT" \
>>         -input "$HDFS_INPUT_2" \
>>         -output "$HDFS_OUTPUT" \
>>         -file   "$GHU_HOME/test2.py" \
>>         -mapper "python $GHU_HOME/test2.py $1" \
>>         -file   "$GHU_HOME/$1"
>>
>>
>>
>> If I modify my mapper to take in 2 arguments, then, I would run it as:
>>
>> run.sh:        (run as:   ./run.sh <arg1>  <arg2>)
>> -----------
>>
>> hadoop jar
>> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>>         -D mapred.reduce.tasks=0 \
>>         -verbose \
>>         -input "$HDFS_INPUT" \
>>         -input "$HDFS_INPUT_2" \
>>         -output "$HDFS_OUTPUT" \
>>         -file   "$GHU_HOME/test2.py" \
>>         -mapper "python $GHU_HOME/test2.py $1 $2" \
>>         -file   "$GHU_HOME/$1" \
>>         -file   "GHU_HOME/$2"
>>
>>
>> Please let me know if I am making a mistake here.
>>
>>
>> Thanks.
>> PD
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> Do you observe the same thing when running without Hadoop? (cat, map, sort
>>> and then reduce)
>>>
>>> Could you provide the counters of your job? You should be able to get them
>>> using the job tracker interface.
>>>
>>> The most probable answer without more information would be that your
>>> reducer do not output any <key,value>s.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>>
>>>
>>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <pe...@gmail.com>
>>> wrote:
>>>
>>> > Hi All,
>>> >    My Hadoop streaming job (in Python) runs to "completion" (both map
>>> and
>>> > reduce says 100% complete). But, when I look at the output directory in
>>> > HDFS, the part files are empty. I do not know what might be causing this
>>> > behavior. I understand that the percentages represent the records that
>>> have
>>> > been read in (not processed).
>>> >
>>> > The following are some of the logs. The detailed logs from Cloudera
>>> Manager
>>> > says that there were no Map Outputs...which is interesting. Any
>>> > suggestions?
>>> >
>>> >
>>> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
>>> > 12/08/30 03:27:14 INFO streaming.StreamJob:
>>> /usr/lib/hadoop-0.20/bin/hadoop
>>> > job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill
>>> job_201208232245_3182
>>> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
>>> > http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
>>> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
>>> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
>>> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
>>> > 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
>>> > 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
>>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
>>> > job_201208232245_3182
>>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
>>> > Thu Aug 30 03:27:24 GMT 2012
>>> > *** END
>>> > bash-3.2$
>>> > bash-3.2$ hadoop fs -ls /user/ghu/
>>> > Found 5 items
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/_SUCCESS
>>> > drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>>> /user/GHU/part-00000
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>>> /user/GHU/part-00001
>>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>>> /user/GHU/part-00002
>>> > bash-3.2$
>>> >
>>> >
>>> --------------------------------------------------------------------------------------------------------------------
>>> >
>>> >
>>> > Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
>>> > Name CaidMatch
>>> >  User srisrini  Mapper class PipeMapper  Reducer class
>>> >  Scheduler pool name default  Job input directory
>>> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job
>>> output
>>> > directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
>>> > Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time
>>> Wed, 29
>>> > Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >  Progress and Scheduling Map Progress
>>> > 100.0%
>>> >  Reduce Progress
>>> > 100.0%
>>> >  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
>>> >  Desired maps 3  Launched reducers
>>> >  Desired reducers 0  Fairscheduler running tasks
>>> >  Fairscheduler minimum share
>>> >  Fairscheduler demand
>>> >  Current Resource Usage Current User CPUs 0  Current System CPUs 0
>>> >  Resident
>>> > memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
>>> > and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce
>>> slot
>>> > time 0s  Cumulative disk reads
>>> >  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB
>>> >  Cumulative
>>> > HDFS writes
>>> >  Map input bytes 2.5 KiB  Map input records 45  Map output records 0
>>> >  Reducer
>>> > input groups
>>> >  Reducer input records
>>> >  Reducer output records
>>> >  Reducer shuffle bytes
>>> >  Spilled records
>>> >
>>>
>>>
>>>
>>> --
>>> Bertrand Dechoux
>>>
>>
>>

Re: no output written to HDFS

Posted by "Periya.Data" <pe...@gmail.com>.

This is interesting. I changed my command to:

-mapper "cat $1 |  $GHU_HOME/test2.py $2" \

is producing output to HDFS. But, the output is not what I expected and is
not the same as when I do "cat | map " on Linux. It is producing
part-00000, part-00001 and part-00002. I expected only one output file with
just 2 records.

I think I have to understand what exactly "-file" does and what exactly
"-input" does. I am experimenting what happens if I give my input files on
the command line (like: test2.py arg1 arg2) as against specifying the input
files via "-file" and "-input" options...

The problem is I have 2 input files...and have no idea how to pass them.
SHould I keep one in HDFS and stream in the other?

More digging,
PD/



On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <pe...@gmail.com> wrote:

> Hi Bertrand,
>     No, I do not observe the same when I run using cat | map. I can see
> the output in STDOUT when I run my program.
>
> I do not have any reducer. In my command, I provide
> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
> written directly to HDFS.
>
> Your suspicion maybe right..about the output. In my counters, the "map
> input records" = 40 and "map.output records" = 0. I am trying to see if I
> am messing up in my command...(see below)
>
> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
> streaming one file in and test2.py takes in only one argument. How should I
> frame my command below? I think that is where I am messing up..
>
>
> run.sh:        (run as:   cat <arg2> | ./run.sh <arg1> )
> -----------
>
> hadoop jar
> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>         -D mapred.reduce.tasks=0 \
>         -verbose \
>         -input "$HDFS_INPUT" \
>         -input "$HDFS_INPUT_2" \
>         -output "$HDFS_OUTPUT" \
>         -file   "$GHU_HOME/test2.py" \
>         -mapper "python $GHU_HOME/test2.py $1" \
>         -file   "$GHU_HOME/$1"
>
>
>
> If I modify my mapper to take in 2 arguments, then, I would run it as:
>
> run.sh:        (run as:   ./run.sh <arg1>  <arg2>)
> -----------
>
> hadoop jar
> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>         -D mapred.reduce.tasks=0 \
>         -verbose \
>         -input "$HDFS_INPUT" \
>         -input "$HDFS_INPUT_2" \
>         -output "$HDFS_OUTPUT" \
>         -file   "$GHU_HOME/test2.py" \
>         -mapper "python $GHU_HOME/test2.py $1 $2" \
>         -file   "$GHU_HOME/$1" \
>         -file   "GHU_HOME/$2"
>
>
> Please let me know if I am making a mistake here.
>
>
> Thanks.
> PD
>
>
>
>
>
>
> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> Do you observe the same thing when running without Hadoop? (cat, map, sort
>> and then reduce)
>>
>> Could you provide the counters of your job? You should be able to get them
>> using the job tracker interface.
>>
>> The most probable answer without more information would be that your
>> reducer do not output any <key,value>s.
>>
>> Regards
>>
>> Bertrand
>>
>>
>>
>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <pe...@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >    My Hadoop streaming job (in Python) runs to "completion" (both map
>> and
>> > reduce says 100% complete). But, when I look at the output directory in
>> > HDFS, the part files are empty. I do not know what might be causing this
>> > behavior. I understand that the percentages represent the records that
>> have
>> > been read in (not processed).
>> >
>> > The following are some of the logs. The detailed logs from Cloudera
>> Manager
>> > says that there were no Map Outputs...which is interesting. Any
>> > suggestions?
>> >
>> >
>> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
>> > 12/08/30 03:27:14 INFO streaming.StreamJob:
>> /usr/lib/hadoop-0.20/bin/hadoop
>> > job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill
>> job_201208232245_3182
>> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
>> > http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
>> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
>> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
>> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
>> > 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
>> > 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
>> > job_201208232245_3182
>> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
>> > Thu Aug 30 03:27:24 GMT 2012
>> > *** END
>> > bash-3.2$
>> > bash-3.2$ hadoop fs -ls /user/ghu/
>> > Found 5 items
>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/_SUCCESS
>> > drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>> /user/GHU/part-00000
>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>> /user/GHU/part-00001
>> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
>> /user/GHU/part-00002
>> > bash-3.2$
>> >
>> >
>> --------------------------------------------------------------------------------------------------------------------
>> >
>> >
>> > Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
>> > Name CaidMatch
>> >  User srisrini  Mapper class PipeMapper  Reducer class
>> >  Scheduler pool name default  Job input directory
>> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job
>> output
>> > directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
>> > Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time
>> Wed, 29
>> > Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
>> >
>> >
>> >
>> >
>> >
>> >
>> >  Progress and Scheduling Map Progress
>> > 100.0%
>> >  Reduce Progress
>> > 100.0%
>> >  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
>> >  Desired maps 3  Launched reducers
>> >  Desired reducers 0  Fairscheduler running tasks
>> >  Fairscheduler minimum share
>> >  Fairscheduler demand
>> >  Current Resource Usage Current User CPUs 0  Current System CPUs 0
>> >  Resident
>> > memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
>> > and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce
>> slot
>> > time 0s  Cumulative disk reads
>> >  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB
>> >  Cumulative
>> > HDFS writes
>> >  Map input bytes 2.5 KiB  Map input records 45  Map output records 0
>> >  Reducer
>> > input groups
>> >  Reducer input records
>> >  Reducer output records
>> >  Reducer shuffle bytes
>> >  Spilled records
>> >
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>

Re: no output written to HDFS

Posted by "Periya.Data" <pe...@gmail.com>.

Hi Bertrand,
    No, I do not observe the same when I run using cat | map. I can see the
output in STDOUT when I run my program.

I do not have any reducer. In my command, I provide
"-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
written directly to HDFS.

Your suspicion maybe right..about the output. In my counters, the "map
input records" = 40 and "map.output records" = 0. I am trying to see if I
am messing up in my command...(see below)

Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
streaming one file in and test2.py takes in only one argument. How should I
frame my command below? I think that is where I am messing up..


run.sh:        (run as:   cat <arg2> | ./run.sh <arg1> )
-----------

hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
        -D mapred.reduce.tasks=0 \
        -verbose \
        -input "$HDFS_INPUT" \
        -input "$HDFS_INPUT_2" \
        -output "$HDFS_OUTPUT" \
        -file   "$GHU_HOME/test2.py" \
        -mapper "python $GHU_HOME/test2.py $1" \
        -file   "$GHU_HOME/$1"



If I modify my mapper to take in 2 arguments, then, I would run it as:

run.sh:        (run as:   ./run.sh <arg1>  <arg2>)
-----------

hadoop jar
/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
        -D mapred.reduce.tasks=0 \
        -verbose \
        -input "$HDFS_INPUT" \
        -input "$HDFS_INPUT_2" \
        -output "$HDFS_OUTPUT" \
        -file   "$GHU_HOME/test2.py" \
        -mapper "python $GHU_HOME/test2.py $1 $2" \
        -file   "$GHU_HOME/$1" \
        -file   "GHU_HOME/$2"


Please let me know if I am making a mistake here.


Thanks.
PD





On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <de...@gmail.com>wrote:

> Do you observe the same thing when running without Hadoop? (cat, map, sort
> and then reduce)
>
> Could you provide the counters of your job? You should be able to get them
> using the job tracker interface.
>
> The most probable answer without more information would be that your
> reducer do not output any <key,value>s.
>
> Regards
>
> Bertrand
>
>
>
> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <pe...@gmail.com>
> wrote:
>
> > Hi All,
> >    My Hadoop streaming job (in Python) runs to "completion" (both map and
> > reduce says 100% complete). But, when I look at the output directory in
> > HDFS, the part files are empty. I do not know what might be causing this
> > behavior. I understand that the percentages represent the records that
> have
> > been read in (not processed).
> >
> > The following are some of the logs. The detailed logs from Cloudera
> Manager
> > says that there were no Map Outputs...which is interesting. Any
> > suggestions?
> >
> >
> > 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
> > 12/08/30 03:27:14 INFO streaming.StreamJob:
> /usr/lib/hadoop-0.20/bin/hadoop
> > job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill job_201208232245_3182
> > 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
> > http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
> > 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
> > 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
> > 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
> > 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
> > 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
> > 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
> > job_201208232245_3182
> > 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
> > Thu Aug 30 03:27:24 GMT 2012
> > *** END
> > bash-3.2$
> > bash-3.2$ hadoop fs -ls /user/ghu/
> > Found 5 items
> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/_SUCCESS
> > drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
> /user/GHU/part-00000
> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
> /user/GHU/part-00001
> > -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27
> /user/GHU/part-00002
> > bash-3.2$
> >
> >
> --------------------------------------------------------------------------------------------------------------------
> >
> >
> > Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
> > Name CaidMatch
> >  User srisrini  Mapper class PipeMapper  Reducer class
> >  Scheduler pool name default  Job input directory
> > hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job
> output
> > directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
> > Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time Wed,
> 29
> > Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
> >
> >
> >
> >
> >
> >
> >  Progress and Scheduling Map Progress
> > 100.0%
> >  Reduce Progress
> > 100.0%
> >  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
> >  Desired maps 3  Launched reducers
> >  Desired reducers 0  Fairscheduler running tasks
> >  Fairscheduler minimum share
> >  Fairscheduler demand
> >  Current Resource Usage Current User CPUs 0  Current System CPUs 0
> >  Resident
> > memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
> > and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce
> slot
> > time 0s  Cumulative disk reads
> >  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB
> >  Cumulative
> > HDFS writes
> >  Map input bytes 2.5 KiB  Map input records 45  Map output records 0
> >  Reducer
> > input groups
> >  Reducer input records
> >  Reducer output records
> >  Reducer shuffle bytes
> >  Spilled records
> >
>
>
>
> --
> Bertrand Dechoux
>

Re: no output written to HDFS

Posted by Bertrand Dechoux <de...@gmail.com>.

Do you observe the same thing when running without Hadoop? (cat, map, sort
and then reduce)

Could you provide the counters of your job? You should be able to get them
using the job tracker interface.

The most probable answer without more information would be that your
reducer do not output any <key,value>s.

Regards

Bertrand



On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <pe...@gmail.com> wrote:

> Hi All,
>    My Hadoop streaming job (in Python) runs to "completion" (both map and
> reduce says 100% complete). But, when I look at the output directory in
> HDFS, the part files are empty. I do not know what might be causing this
> behavior. I understand that the percentages represent the records that have
> been read in (not processed).
>
> The following are some of the logs. The detailed logs from Cloudera Manager
> says that there were no Map Outputs...which is interesting. Any
> suggestions?
>
>
> 12/08/30 03:27:14 INFO streaming.StreamJob: To kill this job, run:
> 12/08/30 03:27:14 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop
> job  -Dmapred.job.tracker=xxxxx.yyy.com:8021 -kill job_201208232245_3182
> 12/08/30 03:27:14 INFO streaming.StreamJob: Tracking URL:
> http://xxxxxx.yyyy.com:60030/jobdetails.jsp?jobid=job_201208232245_3182
> 12/08/30 03:27:15 INFO streaming.StreamJob:  map 0%  reduce 0%
> 12/08/30 03:27:20 INFO streaming.StreamJob:  map 33%  reduce 0%
> 12/08/30 03:27:23 INFO streaming.StreamJob:  map 67%  reduce 0%
> 12/08/30 03:27:29 INFO streaming.StreamJob:  map 100%  reduce 0%
> 12/08/30 03:27:33 INFO streaming.StreamJob:  map 100%  reduce 100%
> 12/08/30 03:27:35 INFO streaming.StreamJob: Job complete:
> job_201208232245_3182
> 12/08/30 03:27:35 INFO streaming.StreamJob: Output: /user/GHU
> Thu Aug 30 03:27:24 GMT 2012
> *** END
> bash-3.2$
> bash-3.2$ hadoop fs -ls /user/ghu/
> Found 5 items
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/_SUCCESS
> drwxrwxrwx   - ghu hadoop          0 2012-08-30 03:27 /user/GHU/_logs
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00000
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00001
> -rw-r--r--   3 ghu hadoop          0 2012-08-30 03:27 /user/GHU/part-00002
> bash-3.2$
>
> --------------------------------------------------------------------------------------------------------------------
>
>
> Metadata Status Succeeded  Type MapReduce  Id job_201208232245_3182
> Name CaidMatch
>  User srisrini  Mapper class PipeMapper  Reducer class
>  Scheduler pool name default  Job input directory
> hdfs://xxxxx.yyy.txt,hdfs://xxxx.yyyy.com/user/GHUcaidlist.txt  Job output
> directory hdfs://xxxx.yyyy.com/user/GHU/  Timing
> Duration 20.977s  Submit time Wed, 29 Aug 2012 08:27 PM  Start time Wed, 29
> Aug 2012 08:27 PM  Finish time Wed, 29 Aug 2012 08:27 PM
>
>
>
>
>
>
>  Progress and Scheduling Map Progress
> 100.0%
>  Reduce Progress
> 100.0%
>  Launched maps 4  Data-local maps 3  Rack-local maps 1  Other local maps
>  Desired maps 3  Launched reducers
>  Desired reducers 0  Fairscheduler running tasks
>  Fairscheduler minimum share
>  Fairscheduler demand
>  Current Resource Usage Current User CPUs 0  Current System CPUs 0
>  Resident
> memory 0 B  Running maps 0  Running reducers 0  Aggregate Resource Usage
> and Counters User CPU 0s  System CPU 0s  Map Slot Time 12.135s  Reduce slot
> time 0s  Cumulative disk reads
>  Cumulative disk writes 155.0 KiB  Cumulative HDFS reads 3.6 KiB
>  Cumulative
> HDFS writes
>  Map input bytes 2.5 KiB  Map input records 45  Map output records 0
>  Reducer
> input groups
>  Reducer input records
>  Reducer output records
>  Reducer shuffle bytes
>  Spilled records
>



-- 
Bertrand Dechoux