You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tom Melendez <to...@supertom.com> on 2011/07/25 22:25:33 UTC

Custom FileOutputFormat / RecordWriter

Hi Folks,

Just doing a sanity check here.

I have a map-only job, which produces a filename for a key and data as
a value.  I want to write the value (data) into the key (filename) in
the path specified when I run the job.

The value (data) doesn't need any formatting, I can just write it to
HDFS without modification.

So, looking at this link (the Output Formats section):

http://developer.yahoo.com/hadoop/tutorial/module5.html

Looks like I want to:
- create a new output format
- override write, tell it not to call writekey as I don't want that written
- new getRecordWriter method that use the key as the filename and
calls my outputformat

Sound reasonable?

Thanks,

Tom

-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

Posted by Harsh J <ha...@cloudera.com>.

Tom,

You can theoretically add N amounts of named outputs from a single
task itself, even from within the map() calls (addNamedOutputs or
addMultiNamedOutputs checks within itself for dupes, so you don't have
to). So yes, you can keep adding outputs and using them per-key, and
given your earlier details of how many that's gonna be, I think MO
would behave just fine with its cache of record writers.

Regarding your other question, there are certain restrictions to the
names provided to MultipleOutputs as a named output. Specifically,
they accept only [A-Za-z0-9] and auto-include an "_" if you are using
multi-named outputs. These may be going away in the future (0.23+) to
allow for more flexible naming, however.

On Tue, Jul 26, 2011 at 9:21 PM, Tom Melendez <to...@supertom.com> wrote:
> Hi Harsh,
>
> Cool, thanks for the details.  For anyone interested, with your tip
> and description I was able to find an example inside the "Hadoop in
> Action" (Chapter 7, p168) book.
>
> Another question, though, it doesn't look like MultipleOutputs will
> let me control the filename in a per-key (per map) manner.  So,
> basically, if my map receives a key of "mykey", I want my file to be
> "mykey-someotherstuff.foo" (this is a binary file).  Am I right about
> this?
>
> Thanks,
>
> Tom
>
> On Tue, Jul 26, 2011 at 1:34 AM, Harsh J <ha...@cloudera.com> wrote:
>> Tom,
>>
>> What I meant to say was that doing this is well supported with
>> existing API/libraries itself:
>>
>> - The class MultipleOutputs supports providing a filename for an
>> output. See MultipleOutputs.addNamedOutput usage [1].
>> - The type 'NullWritable' is a special writable that doesn't do
>> anything. So if its configured into the above filename addition as a
>> key-type, and you pass NullWritable.get() as the key in every write
>> operation, you will end up just writing the value part of (key,
>> value).
>> - This way you do not have to write a custom OutputFormat for your use-case.
>>
>> [1] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
>> (Also available for the new API, depending on which
>> version/distribution of Hadoop you are on)
>>
>> On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez <to...@supertom.com> wrote:
>>> Hi Harsh,
>>>
>>> Thanks for the response.  Unfortunately, I'm not following your response.  :-)
>>>
>>> Could you elaborate a bit?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J <ha...@cloudera.com> wrote:
>>>> You can use MultipleOutputs (or MultiTextOutputFormat for direct
>>>> key-file mapping, but I'd still prefer the stable MultipleOutputs).
>>>> Your sinking Key can be of NullWritable type, and you can keep passing
>>>> an instance of NullWritable.get() to it in every cycle. This would
>>>> write just the value, while the filenames are added/sourced from the
>>>> key inside the mapper code.
>>>>
>>>> This, if you are not comfortable writing your own code and maintaining
>>>> it, I s'pose. Your approach is correct as well, if the question was
>>>> specifically that.
>>>>
>>>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <to...@supertom.com> wrote:
>>>>> Hi Folks,
>>>>>
>>>>> Just doing a sanity check here.
>>>>>
>>>>> I have a map-only job, which produces a filename for a key and data as
>>>>> a value.  I want to write the value (data) into the key (filename) in
>>>>> the path specified when I run the job.
>>>>>
>>>>> The value (data) doesn't need any formatting, I can just write it to
>>>>> HDFS without modification.
>>>>>
>>>>> So, looking at this link (the Output Formats section):
>>>>>
>>>>> http://developer.yahoo.com/hadoop/tutorial/module5.html
>>>>>
>>>>> Looks like I want to:
>>>>> - create a new output format
>>>>> - override write, tell it not to call writekey as I don't want that written
>>>>> - new getRecordWriter method that use the key as the filename and
>>>>> calls my outputformat
>>>>>
>>>>> Sound reasonable?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Tom
>>>>>
>>>>> --
>>>>> ===================
>>>>> Skybox is hiring.
>>>>> http://www.skyboximaging.com/careers/jobs
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>>>
>>>
>>>
>>>
>>> --
>>> ===================
>>> Skybox is hiring.
>>> http://www.skyboximaging.com/careers/jobs
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> ===================
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>



-- 
Harsh J

Re: Custom FileOutputFormat / RecordWriter

Posted by Tom Melendez <to...@supertom.com>.

Hi Harsh,

Cool, thanks for the details.  For anyone interested, with your tip
and description I was able to find an example inside the "Hadoop in
Action" (Chapter 7, p168) book.

Another question, though, it doesn't look like MultipleOutputs will
let me control the filename in a per-key (per map) manner.  So,
basically, if my map receives a key of "mykey", I want my file to be
"mykey-someotherstuff.foo" (this is a binary file).  Am I right about
this?

Thanks,

Tom

On Tue, Jul 26, 2011 at 1:34 AM, Harsh J <ha...@cloudera.com> wrote:
> Tom,
>
> What I meant to say was that doing this is well supported with
> existing API/libraries itself:
>
> - The class MultipleOutputs supports providing a filename for an
> output. See MultipleOutputs.addNamedOutput usage [1].
> - The type 'NullWritable' is a special writable that doesn't do
> anything. So if its configured into the above filename addition as a
> key-type, and you pass NullWritable.get() as the key in every write
> operation, you will end up just writing the value part of (key,
> value).
> - This way you do not have to write a custom OutputFormat for your use-case.
>
> [1] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
> (Also available for the new API, depending on which
> version/distribution of Hadoop you are on)
>
> On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez <to...@supertom.com> wrote:
>> Hi Harsh,
>>
>> Thanks for the response.  Unfortunately, I'm not following your response.  :-)
>>
>> Could you elaborate a bit?
>>
>> Thanks,
>>
>> Tom
>>
>> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J <ha...@cloudera.com> wrote:
>>> You can use MultipleOutputs (or MultiTextOutputFormat for direct
>>> key-file mapping, but I'd still prefer the stable MultipleOutputs).
>>> Your sinking Key can be of NullWritable type, and you can keep passing
>>> an instance of NullWritable.get() to it in every cycle. This would
>>> write just the value, while the filenames are added/sourced from the
>>> key inside the mapper code.
>>>
>>> This, if you are not comfortable writing your own code and maintaining
>>> it, I s'pose. Your approach is correct as well, if the question was
>>> specifically that.
>>>
>>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <to...@supertom.com> wrote:
>>>> Hi Folks,
>>>>
>>>> Just doing a sanity check here.
>>>>
>>>> I have a map-only job, which produces a filename for a key and data as
>>>> a value.  I want to write the value (data) into the key (filename) in
>>>> the path specified when I run the job.
>>>>
>>>> The value (data) doesn't need any formatting, I can just write it to
>>>> HDFS without modification.
>>>>
>>>> So, looking at this link (the Output Formats section):
>>>>
>>>> http://developer.yahoo.com/hadoop/tutorial/module5.html
>>>>
>>>> Looks like I want to:
>>>> - create a new output format
>>>> - override write, tell it not to call writekey as I don't want that written
>>>> - new getRecordWriter method that use the key as the filename and
>>>> calls my outputformat
>>>>
>>>> Sound reasonable?
>>>>
>>>> Thanks,
>>>>
>>>> Tom
>>>>
>>>> --
>>>> ===================
>>>> Skybox is hiring.
>>>> http://www.skyboximaging.com/careers/jobs
>>>>
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> ===================
>> Skybox is hiring.
>> http://www.skyboximaging.com/careers/jobs
>>
>
>
>
> --
> Harsh J
>



-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

Posted by Harsh J <ha...@cloudera.com>.

Tom,

What I meant to say was that doing this is well supported with
existing API/libraries itself:

- The class MultipleOutputs supports providing a filename for an
output. See MultipleOutputs.addNamedOutput usage [1].
- The type 'NullWritable' is a special writable that doesn't do
anything. So if its configured into the above filename addition as a
key-type, and you pass NullWritable.get() as the key in every write
operation, you will end up just writing the value part of (key,
value).
- This way you do not have to write a custom OutputFormat for your use-case.

[1] - http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
(Also available for the new API, depending on which
version/distribution of Hadoop you are on)

On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez <to...@supertom.com> wrote:
> Hi Harsh,
>
> Thanks for the response.  Unfortunately, I'm not following your response.  :-)
>
> Could you elaborate a bit?
>
> Thanks,
>
> Tom
>
> On Mon, Jul 25, 2011 at 2:10 PM, Harsh J <ha...@cloudera.com> wrote:
>> You can use MultipleOutputs (or MultiTextOutputFormat for direct
>> key-file mapping, but I'd still prefer the stable MultipleOutputs).
>> Your sinking Key can be of NullWritable type, and you can keep passing
>> an instance of NullWritable.get() to it in every cycle. This would
>> write just the value, while the filenames are added/sourced from the
>> key inside the mapper code.
>>
>> This, if you are not comfortable writing your own code and maintaining
>> it, I s'pose. Your approach is correct as well, if the question was
>> specifically that.
>>
>> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <to...@supertom.com> wrote:
>>> Hi Folks,
>>>
>>> Just doing a sanity check here.
>>>
>>> I have a map-only job, which produces a filename for a key and data as
>>> a value.  I want to write the value (data) into the key (filename) in
>>> the path specified when I run the job.
>>>
>>> The value (data) doesn't need any formatting, I can just write it to
>>> HDFS without modification.
>>>
>>> So, looking at this link (the Output Formats section):
>>>
>>> http://developer.yahoo.com/hadoop/tutorial/module5.html
>>>
>>> Looks like I want to:
>>> - create a new output format
>>> - override write, tell it not to call writekey as I don't want that written
>>> - new getRecordWriter method that use the key as the filename and
>>> calls my outputformat
>>>
>>> Sound reasonable?
>>>
>>> Thanks,
>>>
>>> Tom
>>>
>>> --
>>> ===================
>>> Skybox is hiring.
>>> http://www.skyboximaging.com/careers/jobs
>>>
>>
>>
>>
>> --
>> Harsh J
>>
>
>
>
> --
> ===================
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>



-- 
Harsh J

Re: Custom FileOutputFormat / RecordWriter

Posted by Tom Melendez <to...@supertom.com>.

Hi Harsh,

Thanks for the response.  Unfortunately, I'm not following your response.  :-)

Could you elaborate a bit?

Thanks,

Tom

On Mon, Jul 25, 2011 at 2:10 PM, Harsh J <ha...@cloudera.com> wrote:
> You can use MultipleOutputs (or MultiTextOutputFormat for direct
> key-file mapping, but I'd still prefer the stable MultipleOutputs).
> Your sinking Key can be of NullWritable type, and you can keep passing
> an instance of NullWritable.get() to it in every cycle. This would
> write just the value, while the filenames are added/sourced from the
> key inside the mapper code.
>
> This, if you are not comfortable writing your own code and maintaining
> it, I s'pose. Your approach is correct as well, if the question was
> specifically that.
>
> On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <to...@supertom.com> wrote:
>> Hi Folks,
>>
>> Just doing a sanity check here.
>>
>> I have a map-only job, which produces a filename for a key and data as
>> a value.  I want to write the value (data) into the key (filename) in
>> the path specified when I run the job.
>>
>> The value (data) doesn't need any formatting, I can just write it to
>> HDFS without modification.
>>
>> So, looking at this link (the Output Formats section):
>>
>> http://developer.yahoo.com/hadoop/tutorial/module5.html
>>
>> Looks like I want to:
>> - create a new output format
>> - override write, tell it not to call writekey as I don't want that written
>> - new getRecordWriter method that use the key as the filename and
>> calls my outputformat
>>
>> Sound reasonable?
>>
>> Thanks,
>>
>> Tom
>>
>> --
>> ===================
>> Skybox is hiring.
>> http://www.skyboximaging.com/careers/jobs
>>
>
>
>
> --
> Harsh J
>



-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

Posted by Harsh J <ha...@cloudera.com>.

You can use MultipleOutputs (or MultiTextOutputFormat for direct
key-file mapping, but I'd still prefer the stable MultipleOutputs).
Your sinking Key can be of NullWritable type, and you can keep passing
an instance of NullWritable.get() to it in every cycle. This would
write just the value, while the filenames are added/sourced from the
key inside the mapper code.

This, if you are not comfortable writing your own code and maintaining
it, I s'pose. Your approach is correct as well, if the question was
specifically that.

On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez <to...@supertom.com> wrote:
> Hi Folks,
>
> Just doing a sanity check here.
>
> I have a map-only job, which produces a filename for a key and data as
> a value.  I want to write the value (data) into the key (filename) in
> the path specified when I run the job.
>
> The value (data) doesn't need any formatting, I can just write it to
> HDFS without modification.
>
> So, looking at this link (the Output Formats section):
>
> http://developer.yahoo.com/hadoop/tutorial/module5.html
>
> Looks like I want to:
> - create a new output format
> - override write, tell it not to call writekey as I don't want that written
> - new getRecordWriter method that use the key as the filename and
> calls my outputformat
>
> Sound reasonable?
>
> Thanks,
>
> Tom
>
> --
> ===================
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>

-- 
Harsh J

Re: Custom FileOutputFormat / RecordWriter

Posted by Tom Melendez <to...@supertom.com>.

Hi Robert,

In this specific case, that's OK.  I'll never write to the same file
from two different mappers.  Otherwise, think it's cool?  I haven't
played with the outputformat before.

Thanks,

Tom

On Mon, Jul 25, 2011 at 1:30 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> Tom,
>
> That assumes that you will never write to the same file from two different mappers or processes.  HDFS currently does not support writing to a single file from multiple processes.
>
> --Bobby
>
> On 7/25/11 3:25 PM, "Tom Melendez" <to...@supertom.com> wrote:
>
> Hi Folks,
>
> Just doing a sanity check here.
>
> I have a map-only job, which produces a filename for a key and data as
> a value.  I want to write the value (data) into the key (filename) in
> the path specified when I run the job.
>
> The value (data) doesn't need any formatting, I can just write it to
> HDFS without modification.
>
> So, looking at this link (the Output Formats section):
>
> http://developer.yahoo.com/hadoop/tutorial/module5.html
>
> Looks like I want to:
> - create a new output format
> - override write, tell it not to call writekey as I don't want that written
> - new getRecordWriter method that use the key as the filename and
> calls my outputformat
>
> Sound reasonable?
>
> Thanks,
>
> Tom
>
> --
> ===================
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>
>



-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

Posted by Tom Melendez <to...@supertom.com>.

Hi Bobby,

Yeah, that won't be a big deal in this case.  It will create about 40
files, each about 60MB each.  This job is kind of an odd one that
won't be run very often.

Thanks,

Tom

On Mon, Jul 25, 2011 at 1:34 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> Tom,
>
> I also forgot to mention that if you are writing to lots of little files it could cause issues too.  HDFS is designed to handle relatively few BIG files.  There is some work to improve this, but it is still a ways off.  So it is likely going to be very slow and put a big load on the namenode if you are going to create lot of small files using this method.
>
> --Bobby
>
>
> On 7/25/11 3:30 PM, "Robert Evans" <ev...@yahoo-inc.com> wrote:
>
> Tom,
>
> That assumes that you will never write to the same file from two different mappers or processes.  HDFS currently does not support writing to a single file from multiple processes.
>
> --Bobby
>
> On 7/25/11 3:25 PM, "Tom Melendez" <to...@supertom.com> wrote:
>
> Hi Folks,
>
> Just doing a sanity check here.
>
> I have a map-only job, which produces a filename for a key and data as
> a value.  I want to write the value (data) into the key (filename) in
> the path specified when I run the job.
>
> The value (data) doesn't need any formatting, I can just write it to
> HDFS without modification.
>
> So, looking at this link (the Output Formats section):
>
> http://developer.yahoo.com/hadoop/tutorial/module5.html
>
> Looks like I want to:
> - create a new output format
> - override write, tell it not to call writekey as I don't want that written
> - new getRecordWriter method that use the key as the filename and
> calls my outputformat
>
> Sound reasonable?
>
> Thanks,
>
> Tom
>
> --
> ===================
> Skybox is hiring.
> http://www.skyboximaging.com/careers/jobs
>
>
>



-- 
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

Posted by Robert Evans <ev...@yahoo-inc.com>.

Tom,

I also forgot to mention that if you are writing to lots of little files it could cause issues too.  HDFS is designed to handle relatively few BIG files.  There is some work to improve this, but it is still a ways off.  So it is likely going to be very slow and put a big load on the namenode if you are going to create lot of small files using this method.

--Bobby


On 7/25/11 3:30 PM, "Robert Evans" <ev...@yahoo-inc.com> wrote:

Tom,

That assumes that you will never write to the same file from two different mappers or processes.  HDFS currently does not support writing to a single file from multiple processes.

--Bobby

On 7/25/11 3:25 PM, "Tom Melendez" <to...@supertom.com> wrote:

Hi Folks,

Just doing a sanity check here.

I have a map-only job, which produces a filename for a key and data as
a value.  I want to write the value (data) into the key (filename) in
the path specified when I run the job.

The value (data) doesn't need any formatting, I can just write it to
HDFS without modification.

So, looking at this link (the Output Formats section):

http://developer.yahoo.com/hadoop/tutorial/module5.html

Looks like I want to:
- create a new output format
- override write, tell it not to call writekey as I don't want that written
- new getRecordWriter method that use the key as the filename and
calls my outputformat

Sound reasonable?

Thanks,

Tom

--
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs

Re: Custom FileOutputFormat / RecordWriter

Posted by Robert Evans <ev...@yahoo-inc.com>.

Tom,

That assumes that you will never write to the same file from two different mappers or processes.  HDFS currently does not support writing to a single file from multiple processes.

--Bobby

On 7/25/11 3:25 PM, "Tom Melendez" <to...@supertom.com> wrote:

Hi Folks,

Just doing a sanity check here.

I have a map-only job, which produces a filename for a key and data as
a value.  I want to write the value (data) into the key (filename) in
the path specified when I run the job.

The value (data) doesn't need any formatting, I can just write it to
HDFS without modification.

So, looking at this link (the Output Formats section):

http://developer.yahoo.com/hadoop/tutorial/module5.html

Looks like I want to:
- create a new output format
- override write, tell it not to call writekey as I don't want that written
- new getRecordWriter method that use the key as the filename and
calls my outputformat

Sound reasonable?

Thanks,

Tom

--
===================
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs