You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@crunch.apache.org by David Ortiz <dp...@gmail.com> on 2015/06/22 20:28:59 UTC

Retrieving Input File Name with MRPipeline

Hello,

      Is there a way in my crunch pipeline that I can retrieve the file
name of the input file for my MapFn?  This function is definitely applied
as a Mapper, so I think it should be possible, just having some difficulty
working through the exact method of doing so.

Thanks,
      Dave

Re: Retrieving Input File Name with MRPipeline

Posted by David Ortiz <dp...@gmail.com>.

That did it.  Thanks Josh

On Mon, Jun 22, 2015 at 3:59 PM Josh Wills <jw...@cloudera.com> wrote:

> The InputSplit on the MapContext implements the InputSupplier interface,
> which allows you to get the underlying FileSplit that the map task is
> processing. So you have to do a bunch of casting, but you can get at it.
>
> On Monday, June 22, 2015, David Ortiz <dp...@gmail.com> wrote:
>
>> Gave it a shot in the following MapFn, but it seems to always return null.
>>
>> new MapFn<String, Pair<String, String>>() {
>>
>>    private static final long serialVersionUID = 1L;
>>    int min = minColumns;
>>    int max = maxColumns;
>>
>>    @Override
>>    public Pair<String, String> map(String input) {
>>       //int columns = StringUtils.countMatches(input, "\t") + 1;
>>       int columns = input.split("\t").length;
>>       if (columns >= min && columns <= max) {
>>          StringBuilder output = new StringBuilder(input);
>>          output.append('\t');
>>          String loc = this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE);
>>          output.append(loc);
>>          return new Pair<>(output.toString(), null);
>>       } else {
>>          return new Pair<>(null, input);
>>       }
>>    }
>>
>> }
>>
>>
>> Also tried setting crunch.disable.combine.file to true figuring that combine files might mess with it.  No dice.  Does anything look suspect in that snippet?
>>
>>
>> Thanks,
>>
>>     Dave
>>
>>
>> On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <mk...@gmail.com>
>> wrote:
>>
>>> The DoFn should give you access to the TaskInputOutputContext[1] which
>>> should contain that information.  I believe the context then should hold
>>> the file as a config like "MAP_INPUT_FILE".  I haven't really tested
>>> this out so definitely verify.
>>>
>>>
>>> [1] -
>>> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html
>>>
>>> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <dp...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>>       Is there a way in my crunch pipeline that I can retrieve the file
>>>> name of the input file for my MapFn?  This function is definitely applied
>>>> as a Mapper, so I think it should be possible, just having some difficulty
>>>> working through the exact method of doing so.
>>>>
>>>> Thanks,
>>>>       Dave
>>>>
>>>
>>>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>
>

Re: Retrieving Input File Name with MRPipeline

Posted by Josh Wills <jw...@cloudera.com>.

The InputSplit on the MapContext implements the InputSupplier interface,
which allows you to get the underlying FileSplit that the map task is
processing. So you have to do a bunch of casting, but you can get at it.

On Monday, June 22, 2015, David Ortiz <dp...@gmail.com> wrote:

> Gave it a shot in the following MapFn, but it seems to always return null.
>
> new MapFn<String, Pair<String, String>>() {
>
>    private static final long serialVersionUID = 1L;
>    int min = minColumns;
>    int max = maxColumns;
>
>    @Override
>    public Pair<String, String> map(String input) {
>       //int columns = StringUtils.countMatches(input, "\t") + 1;
>       int columns = input.split("\t").length;
>       if (columns >= min && columns <= max) {
>          StringBuilder output = new StringBuilder(input);
>          output.append('\t');
>          String loc = this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE);
>          output.append(loc);
>          return new Pair<>(output.toString(), null);
>       } else {
>          return new Pair<>(null, input);
>       }
>    }
>
> }
>
>
> Also tried setting crunch.disable.combine.file to true figuring that combine files might mess with it.  No dice.  Does anything look suspect in that snippet?
>
>
> Thanks,
>
>     Dave
>
>
> On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <mkwhitacre@gmail.com
> <javascript:_e(%7B%7D,'cvml','mkwhitacre@gmail.com');>> wrote:
>
>> The DoFn should give you access to the TaskInputOutputContext[1] which
>> should contain that information.  I believe the context then should hold
>> the file as a config like "MAP_INPUT_FILE".  I haven't really tested
>> this out so definitely verify.
>>
>>
>> [1] -
>> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html
>>
>> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <dpo5003@gmail.com
>> <javascript:_e(%7B%7D,'cvml','dpo5003@gmail.com');>> wrote:
>>
>>> Hello,
>>>
>>>       Is there a way in my crunch pipeline that I can retrieve the file
>>> name of the input file for my MapFn?  This function is definitely applied
>>> as a Mapper, so I think it should be possible, just having some difficulty
>>> working through the exact method of doing so.
>>>
>>> Thanks,
>>>       Dave
>>>
>>
>>

-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Retrieving Input File Name with MRPipeline

Posted by David Ortiz <dp...@gmail.com>.

Gave it a shot in the following MapFn, but it seems to always return null.

new MapFn<String, Pair<String, String>>() {

   private static final long serialVersionUID = 1L;
   int min = minColumns;
   int max = maxColumns;

   @Override
   public Pair<String, String> map(String input) {
      //int columns = StringUtils.countMatches(input, "\t") + 1;
      int columns = input.split("\t").length;
      if (columns >= min && columns <= max) {
         StringBuilder output = new StringBuilder(input);
         output.append('\t');
         String loc =
this.getContext().getConfiguration().get(TaskInputOutputContext.MAP_INPUT_FILE);
         output.append(loc);
         return new Pair<>(output.toString(), null);
      } else {
         return new Pair<>(null, input);
      }
   }

}


Also tried setting crunch.disable.combine.file to true figuring that
combine files might mess with it.  No dice.  Does anything look
suspect in that snippet?


Thanks,

    Dave


On Mon, Jun 22, 2015 at 2:41 PM Micah Whitacre <mk...@gmail.com> wrote:

> The DoFn should give you access to the TaskInputOutputContext[1] which
> should contain that information.  I believe the context then should hold
> the file as a config like "MAP_INPUT_FILE".  I haven't really tested this
> out so definitely verify.
>
>
> [1] -
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html
>
> On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <dp...@gmail.com> wrote:
>
>> Hello,
>>
>>       Is there a way in my crunch pipeline that I can retrieve the file
>> name of the input file for my MapFn?  This function is definitely applied
>> as a Mapper, so I think it should be possible, just having some difficulty
>> working through the exact method of doing so.
>>
>> Thanks,
>>       Dave
>>
>
>

Re: Retrieving Input File Name with MRPipeline

Posted by Micah Whitacre <mk...@gmail.com>.

The DoFn should give you access to the TaskInputOutputContext[1] which
should contain that information.  I believe the context then should hold
the file as a config like "MAP_INPUT_FILE".  I haven't really tested this
out so definitely verify.

[1] -
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/TaskInputOutputContext.html

On Mon, Jun 22, 2015 at 1:28 PM, David Ortiz <dp...@gmail.com> wrote:

> Hello,
>
>       Is there a way in my crunch pipeline that I can retrieve the file
> name of the input file for my MapFn?  This function is definitely applied
> as a Mapper, so I think it should be possible, just having some difficulty
> working through the exact method of doing so.
>
> Thanks,
>       Dave
>