You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@chukwa.apache.org by Guillermo Pérez <bi...@tuenti.com> on 2010/03/08 16:05:49 UTC

Launching different record & reducers from mapper

I'm launching several chukwa records with different keys and reducers,
so I can generate some aggregated records data directly while loading
data.

The map / redux works well, but the data that passes to the aggregator
reducer is not stored to HDFS. Anybody knows why?

2010-03-08 15:28:28,784 INFO main JobClient - Job complete:
job_201002191418_1463
2010-03-08 15:28:28,800 INFO main JobClient - Counters: 29
2010-03-08 15:28:28,800 INFO main JobClient -   DemuxReduceOutput
2010-03-08 15:28:28,800 INFO main JobClient -     total records=1416018
2010-03-08 15:28:28,800 INFO main JobClient -     ActionLog records=1416018
2010-03-08 15:28:28,800 INFO main JobClient -   DemuxMapOutput
2010-03-08 15:28:28,800 INFO main JobClient -
ActionLogAggregateWeights records=1416018
2010-03-08 15:28:28,800 INFO main JobClient -     total records=2832036
2010-03-08 15:28:28,800 INFO main JobClient -     ActionLog records=1416018
2010-03-08 15:28:28,801 INFO main JobClient -   Job Counters
2010-03-08 15:28:28,801 INFO main JobClient -     Launched reduce tasks=9
2010-03-08 15:28:28,801 INFO main JobClient -     Rack-local map tasks=1
2010-03-08 15:28:28,801 INFO main JobClient -     Launched map tasks=2
2010-03-08 15:28:28,801 INFO main JobClient -     Data-local map tasks=1
2010-03-08 15:28:28,801 INFO main JobClient -   DemuxMapInput
2010-03-08 15:28:28,801 INFO main JobClient -     ActionLog chunks=610
2010-03-08 15:28:28,801 INFO main JobClient -     total chunks=610
2010-03-08 15:28:28,801 INFO main JobClient -   DemuxReduceInput
2010-03-08 15:28:28,801 INFO main JobClient -     total distinct keys=58400
2010-03-08 15:28:28,801 INFO main JobClient -     ActionLog total
distinct keys=57600
2010-03-08 15:28:28,801 INFO main JobClient -
ActionLogAggregateWeights total distinct keys=800
2010-03-08 15:28:28,801 INFO main JobClient -   FileSystemCounters
2010-03-08 15:28:28,801 INFO main JobClient -     FILE_BYTES_READ=1001600403
2010-03-08 15:28:28,802 INFO main JobClient -     HDFS_BYTES_READ=85558794
2010-03-08 15:28:28,802 INFO main JobClient -     FILE_BYTES_WRITTEN=1501914817
2010-03-08 15:28:28,802 INFO main JobClient -     HDFS_BYTES_WRITTEN=325688807
2010-03-08 15:28:28,802 INFO main JobClient -   Map-Reduce Framework
2010-03-08 15:28:28,802 INFO main JobClient -     Reduce input groups=58400
2010-03-08 15:28:28,802 INFO main JobClient -     Combine output records=0
2010-03-08 15:28:28,802 INFO main JobClient -     Map input records=610
2010-03-08 15:28:28,802 INFO main JobClient -     Reduce shuffle bytes=342907105
2010-03-08 15:28:28,802 INFO main JobClient -     Reduce output records=1416018
2010-03-08 15:28:28,802 INFO main JobClient -     Spilled Records=8496108
2010-03-08 15:28:28,802 INFO main JobClient -     Map output bytes=493557805
2010-03-08 15:28:28,802 INFO main JobClient -     Map input bytes=85558588
2010-03-08 15:28:28,802 INFO main JobClient -     Combine input records=0
2010-03-08 15:28:28,802 INFO main JobClient -     Map output records=2832036
2010-03-08 15:28:28,802 INFO main JobClient -     Reduce input records=2832036

ActionLog is stored in the repository dir, but I can't find anything
about ActionLogAggregateWeights...


-- 
Guille -ℬḭṩḩø- <bi...@tuenti.com>
:wq

Re: Launching different record & reducers from mapper

Posted by Corbin Hoenes <co...@tynt.com>.
logs are in /var/log/hadoop/userlogs/attempt*

You can also setup your own log4j logger for your class to log your own messages to, very helpful when creating custom demux processors.

static Logger log = Logger.getLogger(MyClass.class);

On Mar 9, 2010, at 1:57 AM, Guillermo Pérez wrote:

> Ok, I think I've found the problem. The reducer class was failing.
> Using a simpler one works, so I must fix the complex one :). There is
> any way of capturing the error log of the map / reduce classes?
> 
> -- 
> Guille -ℬḭṩḩø- <bi...@tuenti.com>
> :wq


Re: Launching different record & reducers from mapper

Posted by Eric Yang <ey...@yahoo-inc.com>.
Yes, hook up your hadoop with ChukwaLog4J appender, and stream over the task
tracker logs and run another demux to figure out the problem.  Or look at
the log for the task attempts from jobtracker UI.  I think the latter case
is more efficient.

Regards,
Eric

On 3/9/10 12:57 AM, "Guillermo Pérez" <bi...@tuenti.com> wrote:

> Ok, I think I've found the problem. The reducer class was failing.
> Using a simpler one works, so I must fix the complex one :). There is
> any way of capturing the error log of the map / reduce classes?


Re: Launching different record & reducers from mapper

Posted by Guillermo Pérez <bi...@tuenti.com>.
Ok, I think I've found the problem. The reducer class was failing.
Using a simpler one works, so I must fix the complex one :). There is
any way of capturing the error log of the map / reduce classes?

-- 
Guille -ℬḭṩḩø- <bi...@tuenti.com>
:wq

Re: Launching different record & reducers from mapper

Posted by Guillermo Pérez <bi...@tuenti.com>.
On Mon, Mar 8, 2010 at 20:15, Eric Yang <ey...@yahoo-inc.com> wrote:
> It doesn't look like you are splitting record in the mapper phase to reducer
> type ActionLogAggregateWeights.  The current demux is partitioned by the
> reducer record type.  Hence, if the record is split in the reducer phase, it
> will not work.  Take a look at Top mapper class.  It is calling
> buildGenericRecord to partition reducer type.  ActionLog mapper should
> mirror the data and send to both ActionLog and ActionLogAggregateWeights
> reducer class.  Hope this helps.

I think I'm doing that. In the mapper I prepare two records, two keys,
and I set a different key.setReduceType(). One uses the default
identity, and the other a special redux class that combines records to
generate aggregates.

> Note, Reducer partition by RecordType is not correctly implemented in the
> current demux.  Chukwa requires single reducer per data type to run
> correctly.  If a single record type generates large amount of data, the
> reducer for the large record type become the bottle neck of demux.  Hence,
> Demux is going to change when Avro Input/Output format is ready.  I am not
> sure if it may impact your implementation but something to keep in mind.

I'm just generating two records out of each record I map. One for just
log it, and the other just for aggregation, including more fields in
the key and just a counter in the record itself.

-- 
Guille -ℬḭṩḩø- <bi...@tuenti.com>
:wq

Re: Launching different record & reducers from mapper

Posted by Eric Yang <ey...@yahoo-inc.com>.
It doesn't look like you are splitting record in the mapper phase to reducer
type ActionLogAggregateWeights.  The current demux is partitioned by the
reducer record type.  Hence, if the record is split in the reducer phase, it
will not work.  Take a look at Top mapper class.  It is calling
buildGenericRecord to partition reducer type.  ActionLog mapper should
mirror the data and send to both ActionLog and ActionLogAggregateWeights
reducer class.  Hope this helps.

Note, Reducer partition by RecordType is not correctly implemented in the
current demux.  Chukwa requires single reducer per data type to run
correctly.  If a single record type generates large amount of data, the
reducer for the large record type become the bottle neck of demux.  Hence,
Demux is going to change when Avro Input/Output format is ready.  I am not
sure if it may impact your implementation but something to keep in mind.

Regards,
Eric

On 3/8/10 7:05 AM, "Guillermo Pérez" <bi...@tuenti.com> wrote:

> I'm launching several chukwa records with different keys and reducers,
> so I can generate some aggregated records data directly while loading
> data.
> 
> The map / redux works well, but the data that passes to the aggregator
> reducer is not stored to HDFS. Anybody knows why?
> 
> 2010-03-08 15:28:28,784 INFO main JobClient - Job complete:
> job_201002191418_1463
> 2010-03-08 15:28:28,800 INFO main JobClient - Counters: 29
> 2010-03-08 15:28:28,800 INFO main JobClient -   DemuxReduceOutput
> 2010-03-08 15:28:28,800 INFO main JobClient -     total records=1416018
> 2010-03-08 15:28:28,800 INFO main JobClient -     ActionLog records=1416018
> 2010-03-08 15:28:28,800 INFO main JobClient -   DemuxMapOutput
> 2010-03-08 15:28:28,800 INFO main JobClient -
> ActionLogAggregateWeights records=1416018
> 2010-03-08 15:28:28,800 INFO main JobClient -     total records=2832036
> 2010-03-08 15:28:28,800 INFO main JobClient -     ActionLog records=1416018
> 2010-03-08 15:28:28,801 INFO main JobClient -   Job Counters
> 2010-03-08 15:28:28,801 INFO main JobClient -     Launched reduce tasks=9
> 2010-03-08 15:28:28,801 INFO main JobClient -     Rack-local map tasks=1
> 2010-03-08 15:28:28,801 INFO main JobClient -     Launched map tasks=2
> 2010-03-08 15:28:28,801 INFO main JobClient -     Data-local map tasks=1
> 2010-03-08 15:28:28,801 INFO main JobClient -   DemuxMapInput
> 2010-03-08 15:28:28,801 INFO main JobClient -     ActionLog chunks=610
> 2010-03-08 15:28:28,801 INFO main JobClient -     total chunks=610
> 2010-03-08 15:28:28,801 INFO main JobClient -   DemuxReduceInput
> 2010-03-08 15:28:28,801 INFO main JobClient -     total distinct keys=58400
> 2010-03-08 15:28:28,801 INFO main JobClient -     ActionLog total
> distinct keys=57600
> 2010-03-08 15:28:28,801 INFO main JobClient -
> ActionLogAggregateWeights total distinct keys=800
> 2010-03-08 15:28:28,801 INFO main JobClient -   FileSystemCounters
> 2010-03-08 15:28:28,801 INFO main JobClient -     FILE_BYTES_READ=1001600403
> 2010-03-08 15:28:28,802 INFO main JobClient -     HDFS_BYTES_READ=85558794
> 2010-03-08 15:28:28,802 INFO main JobClient -
> FILE_BYTES_WRITTEN=1501914817
> 2010-03-08 15:28:28,802 INFO main JobClient -     HDFS_BYTES_WRITTEN=325688807
> 2010-03-08 15:28:28,802 INFO main JobClient -   Map-Reduce Framework
> 2010-03-08 15:28:28,802 INFO main JobClient -     Reduce input groups=58400
> 2010-03-08 15:28:28,802 INFO main JobClient -     Combine output records=0
> 2010-03-08 15:28:28,802 INFO main JobClient -     Map input records=610
> 2010-03-08 15:28:28,802 INFO main JobClient -     Reduce shuffle
> bytes=342907105
> 2010-03-08 15:28:28,802 INFO main JobClient -     Reduce output
> records=1416018
> 2010-03-08 15:28:28,802 INFO main JobClient -     Spilled Records=8496108
> 2010-03-08 15:28:28,802 INFO main JobClient -     Map output bytes=493557805
> 2010-03-08 15:28:28,802 INFO main JobClient -     Map input bytes=85558588
> 2010-03-08 15:28:28,802 INFO main JobClient -     Combine input records=0
> 2010-03-08 15:28:28,802 INFO main JobClient -     Map output records=2832036
> 2010-03-08 15:28:28,802 INFO main JobClient -     Reduce input records=2832036
> 
> ActionLog is stored in the repository dir, but I can't find anything
> about ActionLogAggregateWeights...
>