You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/08/09 23:11:45 UTC

Can you see the name of the document being loaded?

I want to calculate some statistics on a per document basis, and it seems
like the only way to do this would be to emit a compound key of
(key,documentname).
1) Is this the case, or is there a better way to do this?
2) If this is the only way to calculate a per input file basis, where is the
right place to grab this? A custom line reader? What object is exposed to
this?

Re: Can you see the name of the document being loaded?

Posted by Jonathan Coveney <jc...@gmail.com>.

Much obliged, Harsh. looks perfect.

2011/8/9 Harsh J <ha...@cloudera.com>

> Jonathan,
>
> 1. is correct with the compound key method, since you need document-ID
> and then work upon it. If you don't want it grouped/sorted by
> document, consider adding it as a value attribute instead, of course.
>
> 2. The record reader is the right place. The FileSplit object's path
> attribute specifically. I've detailed how to extract information from
> Mappers before (both old and new APIs of MR):
> http://search-hadoop.com/m/9Nqjm1aqu8a1 has the pointers.
>
> On Wed, Aug 10, 2011 at 2:41 AM, Jonathan Coveney <jc...@gmail.com>
> wrote:
> > I want to calculate some statistics on a per document basis, and it seems
> > like the only way to do this would be to emit a compound key of
> > (key,documentname).
> > 1) Is this the case, or is there a better way to do this?
> > 2) If this is the only way to calculate a per input file basis, where is
> the
> > right place to grab this? A custom line reader? What object is exposed to
> > this?
>
>
>
> --
> Harsh J
>

Re: Can you see the name of the document being loaded?

Posted by Harsh J <ha...@cloudera.com>.

Jonathan,

1. is correct with the compound key method, since you need document-ID
and then work upon it. If you don't want it grouped/sorted by
document, consider adding it as a value attribute instead, of course.

2. The record reader is the right place. The FileSplit object's path
attribute specifically. I've detailed how to extract information from
Mappers before (both old and new APIs of MR):
http://search-hadoop.com/m/9Nqjm1aqu8a1 has the pointers.

On Wed, Aug 10, 2011 at 2:41 AM, Jonathan Coveney <jc...@gmail.com> wrote:
> I want to calculate some statistics on a per document basis, and it seems
> like the only way to do this would be to emit a compound key of
> (key,documentname).
> 1) Is this the case, or is there a better way to do this?
> 2) If this is the only way to calculate a per input file basis, where is the
> right place to grab this? A custom line reader? What object is exposed to
> this?

-- 
Harsh J