You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tony Burton <TB...@SportingIndex.com> on 2013/02/01 16:12:42 UTC

RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Thanks for the reply Alejandro. Using a temp output directory was my first guess as well. What's the best way to proceed? I've come across FileSystem.rename but it's consistently returning false for whatever Paths I provide. Specifically, I need to copy the following:

s3://<path to data>/<tmp folder>/<object type 1>/part-00000
...
s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn
s3://<path to data>/<tmp folder>/<object type 2>/part-00000
...
s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn
...
s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn

to

s3://<path to data>/<object type 1>/part-00000
...
s3://<path to data>/<object type 1>/part-nnnnn
s3://<path to data>/<object type 2>/part-00000
...
s3://<path to data>/<object type 2>/part-nnnnn
...
s3://<path to data>/<object type m>/part-nnnnn

without doing a copyToLocal.

Any tips? Are there any better alternatives to FileSystem.rename? Or would using the AWS Java SDK be a better solution?

Thanks!

Tony






From: Alejandro Abdelnur [mailto:tucu@cloudera.com]
Sent: 31 January 2013 18:45
To: common-user@hadoop.apache.org
Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Hi Tony, from what i understand your prob is not with MTOF but with you wanting to run 2 jobs using the same output directory, the second job will fail because the output dir already existed. My take would be tweaking your jobs to use a temp output dir, and moving them to the required (final) location upon completion.

thx


On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <TB...@sportingindex.com>> wrote:
Hi everyone,

Some of you might recall this topic, which I worked on with the list's help back in August last year - see email trail below. Despite initial success of the discovery, I had the shelve the approach as I ended up using a different solution (for reasons I forget!) with the implementation that was ultimately used for that particular project.

I'm now in a position to be working on a similar new task, where I've successfully implemented the combination of LazyOutputFormat and MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output locations. However, I've hit another snag which I'm hoping you might help me work through.

I'm going to be running daily tasks to extract data from XML files (specifically, the data stored in certain nodes of the XML), stored on AWS S3 using object names with the following format:

s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>

I want to extract items from the XML and write out as follows:

s3://outputbucket/path/<xml node name>/20130113/<output from MR job>

For one day of data, this works fine. I pass in s3://inputbucket/data and s3://outputbucket/path as input and output arguments, along with my run date (20130113) which gets manipulated and appended where appropriate to form the precise read and write locations, for example

FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");

Then MultipleOutputs adds on my XML node names underneath s3://outputbucket/path automatically.

However, for the next day's run, the job gets to FileOutputFormat.setOutputPath and sees that the output path (s3://outputbucket/path) already exists, and throws a FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even though my ultimate subdirectory, to be constructed by MultipleOutputs does not already exist.

Is there any way around this? I'm given hope by this, from http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: "public class FileAlreadyExistsException extends IOException - Used when target file already exists for any operation *and is not configured to be overwritten*" (my emphasis). Is it possible to deconfigure the overwrite protection?

If not, I suppose one other way ahead is to create my own FileOutputFormat where the checkOutputSpecs() is a bit less fussy; another might be to write to a "temp" directory and programmatically move it to the desired output when the job completes successfully, although this is getting to feel a bit "hacky" to me.

Thanks for any feedback!

Tony







________________________________________
From: Harsh J [harsh@cloudera.com<ma...@cloudera.com>]
Sent: 31 August 2012 10:47
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Good finding, that OF slipped my mind. We can mention on the MultipleOutputs javadocs for the new API to use the LazyOutputFormat for the job-level config. Please file a JIRA for this under MAPREDUCE project on the Apache JIRA?

On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <TB...@sportingindex.com>> wrote:
> Hi Harsh,
>
> I tried using NullOutputFormat as you suggested, however simply using
>
> job.setOutputFormatClass(NullOutputFormat.class);
>
> resulted in no output at all. Although I've not tried overriding getOutputCommitter in NullOutputFormat as you suggested, I discovered LazyOutputFormat which only writes when it has to, "the output file is created only when the first record is emitted for a given partition" (from "Hadoop: The Definitive Guide").
>
> Instead of
>
> job.setOutputFormatClass(TextOutputFormat.class);
>
> use LazyOutputFormat like this:
>
> LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
>
> So now my unnamed MultipleOutputs are handling to segmented results, and LazyOutputFormat is suppressing the default output. Good job!
>
> Tony
>
>
>
>
>
> ________________________________________
> From: Harsh J [harsh@cloudera.com<ma...@cloudera.com>]
> Sent: 29 August 2012 17:05
> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Hi Tony,
>
> On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <TB...@sportingindex.com>> wrote:
>> Success so far!
>>
>> I followed the example given by Tom on the link to the MultipleOutputs.html API you suggested.
>>
>> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the output depending on word length: output to directory "sml" for less than 10 characters, "med" for between 10 and 20 characters, "lrg" otherwise.
>>
>> I used out.write(key, new IntWritable(sum), generateFilename(key,
>> sum)); to write the output, and generateFileName to create the custom
>> directory name/filename. You need to provide the start of the
>> filename as well otherwise your output files will be -r-00000,
>> -r-00001 etc. (so, for example, return "sml/part"; etc)
>
> Thanks for these notes, should come helpful for those who search!
>
>> Also required: as Tom states, override Reducer.setup() to create the MultipleOutputs. However, Tom's puzzle left for the reader is that you also need to override Reducer.cleanup() and call close() on your MultipleOutputs object. Forget to do this and your segmented files will be empty.
>
> Ah yes this is important. Non closure of files would have you wait for
> an hour for data to get available to readers (open writer lease expiry
> period).
>
>> One observation: although it's not the end of the world, as well as my segmented output I also get a zero-size part-r-00000 file in the base of my output path. Is there any way to prevent creation of this file?
>
> Set the OutputFormat to NullOutputFormat.
>
> In case you face issues doing this in new API (you may notice some odd
> behavior) try to extend NullOutputFormat and in its getOutputCommitter
> method i.e.
> http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr
> educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h
> adoop.mapreduce.TaskAttemptContext),
> return a FileOutputCommitter object. By default it returns a no-op
> OutputCommitter that may not gel well with a file-based writer such as
> MultipleOutputs. Then set this new OutputFormat as your job's output
> format.
>
>> Thanks again Harsh for pointing the way.
>>
>> Tony
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Tony Burton [mailto:TBurton@SportingIndex.com<ma...@SportingIndex.com>]
>> Sent: 29 August 2012 11:38
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>>
>> Thanks Harsh! Will try it out and report back later.
>>
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
>> Sent: 29 August 2012 11:12
>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>>
>> Hi Tony,
>>
>> Seeing your new question, I recalled Tom's post to a user once, here:
>> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1
>> CaLukt4v1AJ
>>
>> This specific call allows you to specify / characters in your name,
>> that gets translated into creation of directories automatically:
>> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map
>> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja
>> va.lang.String) (The last argument is where you will need to specify
>> the path)
>>
>> Try it out and let us know!
>>
>> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TB...@sportingindex.com>> wrote:
>>> Hi Harsh
>>>
>>> Thanks for the reply - my understanding is that with MultipleOutputs I can write differently named files into the same target directory. With MultipleTextOutputFormat I was able to override the target directory name to perform the segmentation, by overriding generateFileNameForKeyValue().
>>>
>>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target directory name as well as the file name?
>>>
>>> Thanks,
>>>
>>> Tony
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Harsh J [mailto:harsh@cloudera.com<ma...@cloudera.com>]
>>> Sent: 28 August 2012 13:44
>>> To: user@hadoop.apache.org<ma...@hadoop.apache.org>
>>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>>>
>>> The Multiple*OutputFormat have been deprecated in favor of the
>>> generic MultipleOutputs API. Would using that instead work for you?
>>>
>>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton <TB...@sportingindex.com>> wrote:
>>>> Hi,
>>>>
>>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat is good for writing results into (for example) different directories created on the fly. However, now I'm implementing a MapReduce job using Hadoop 1.0.3, I see that the new API no longer supports MultipleTextOutputFormat. Is there an equivalent that I can use, or will it be supported in a future release?
>>>>
>>>> Thanks,
>>>>
>>>> Tony
>>>>
>>>>
>>>> *******************************************************************
>>>> *** This email and any attachments are confidential, protected by
>>>> copyright and may be legally privileged.  If you are not the
>>>> intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.
>>>>
>>>> Outbound email has been scanned for viruses and SPAM
>>>>
>>>
>>>
>>>
>>> --
>>> Harsh J
>>> www.sportingindex.com<http://www.sportingindex.com>
>>> Inbound Email has been scanned for viruses and SPAM
>>> ********************************************************************
>>> ** This email and any attachments are confidential, protected by
>>> copyright and may be legally privileged.  If you are not the
>>> intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.
>>>
>>> Outbound email has been scanned for viruses and SPAM
>>
>>
>>
>> --
>> Harsh J
>> www.sportingindex.com<http://www.sportingindex.com>
>> Inbound Email has been scanned for viruses and SPAM
>> *********************************************************************
>> * This email and any attachments are confidential, protected by
>> copyright and may be legally privileged.  If you are not the intended
>> recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.
>>
>> Outbound email has been scanned for viruses and SPAM
>> www.sportingindex.com<http://www.sportingindex.com> Inbound Email has been scanned for viruses and
>> SPAM
>> *********************************************************************
>> * This email and any attachments are confidential, protected by
>> copyright and may be legally privileged.  If you are not the intended
>> recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.
>>
>> Outbound email has been scanned for viruses and SPAM
>
>
>
> --
> Harsh J
> www.sportingindex.com<http://www.sportingindex.com>
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by
> copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM



--
Harsh J
www.sportingindex.com<http://www.sportingindex.com>
Inbound Email has been scanned for viruses and SPAM
**********************************************************************
This email and any attachments are confidential, protected by copyright and may be legally privileged.  If you are not the intended recipient, then the dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system.  Neither Sporting Index nor the sender accepts responsibility for any virus, or any other defect which might affect any computer or IT system into which the email is received and/or opened.  It is the responsibility of the recipient to scan the email and no responsibility is accepted for any loss or damage arising in any way from receipt or use of this email.  Sporting Index Ltd is a company registered in England and Wales with company number 2636842, whose registered office is at Gateway House, Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial Services Authority (reg. no. 150404) and Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial promotion contained herein has been issued and approved by Sporting Index Ltd.

Outbound email has been scanned for viruses and SPAM www.sportingindex.com<http://www.sportingindex.com> Inbound Email has been scanned for viruses and SPAM



--
Alejandro


*****************************************************************************
P Please consider the environment before printing this email

www.sportingindex.com<http://www.sportingindex.com>

Inbound email has been scanned for viruses & spam

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx


On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <TB...@sportingindex.com>wrote:

> ** **
>
> Thanks for the reply Alejandro. Using a temp output directory was my first
> guess as well. What’s the best way to proceed? I’ve come across
> FileSystem.rename but it’s consistently returning false for whatever Paths
> I provide. Specifically, I need to copy the following:****
>
> ** **
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn****
>
> ** **
>
> to ****
>
> ** **
>
> s3://<path to data>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<object type m>/part-nnnnn****
>
> ** **
>
> without doing a copyToLocal.****
>
> ** **
>
> Any tips? Are there any better alternatives to FileSystem.rename? Or would
> using the AWS Java SDK be a better solution?****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Tony****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com]
> *Sent:* 31 January 2013 18:45
> *To:* common-user@hadoop.apache.org
>
> *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat****
>
> ** **
>
> Hi Tony, from what i understand your prob is not with MTOF but with you
> wanting to run 2 jobs using the same output directory, the second job will
> fail because the output dir already existed. My take would be tweaking your
> jobs to use a temp output dir, and moving them to the required (final)
> location upon completion.****
>
> ** **
>
> thx****
>
> ** **
>
> ** **
>
> On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <TB...@sportingindex.com>
> wrote:****
>
> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
>
> Is there any way around this? I'm given hope by this, from
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> protection?
>
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
>
> Thanks for any feedback!
>
> Tony
>
>
>
>
>
>
>
> ________________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: 31 August 2012 10:47
> To: user@hadoop.apache.org
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
>
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> > Hi Harsh,
> >
> > I tried using NullOutputFormat as you suggested, however simply using
> >
> > job.setOutputFormatClass(NullOutputFormat.class);
> >
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is
> created only when the first record is emitted for a given partition" (from
> "Hadoop: The Definitive Guide").
> >
> > Instead of
> >
> > job.setOutputFormatClass(TextOutputFormat.class);
> >
> > use LazyOutputFormat like this:
> >
> > LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
> >
> > So now my unnamed MultipleOutputs are handling to segmented results, and
> LazyOutputFormat is suppressing the default output. Good job!
> >
> > Tony
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Harsh J [harsh@cloudera.com]
> > Sent: 29 August 2012 17:05
> > To: user@hadoop.apache.org
> > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >
> > Hi Tony,
> >
> > On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >> Success so far!
> >>
> >> I followed the example given by Tom on the link to the
> MultipleOutputs.html API you suggested.
> >>
> >> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the
> output depending on word length: output to directory "sml" for less than 10
> characters, "med" for between 10 and 20 characters, "lrg" otherwise.
> >>
> >> I used out.write(key, new IntWritable(sum), generateFilename(key,
> >> sum)); to write the output, and generateFileName to create the custom
> >> directory name/filename. You need to provide the start of the
> >> filename as well otherwise your output files will be -r-00000,
> >> -r-00001 etc. (so, for example, return "sml/part"; etc)
> >
> > Thanks for these notes, should come helpful for those who search!
> >
> >> Also required: as Tom states, override Reducer.setup() to create the
> MultipleOutputs. However, Tom's puzzle left for the reader is that you also
> need to override Reducer.cleanup() and call close() on your MultipleOutputs
> object. Forget to do this and your segmented files will be empty.
> >
> > Ah yes this is important. Non closure of files would have you wait for
> > an hour for data to get available to readers (open writer lease expiry
> > period).
> >
> >> One observation: although it's not the end of the world, as well as my
> segmented output I also get a zero-size part-r-00000 file in the base of my
> output path. Is there any way to prevent creation of this file?
> >
> > Set the OutputFormat to NullOutputFormat.
> >
> > In case you face issues doing this in new API (you may notice some odd
> > behavior) try to extend NullOutputFormat and in its getOutputCommitter
> > method i.e.
> > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr
> > educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h
> > adoop.mapreduce.TaskAttemptContext),
> > return a FileOutputCommitter object. By default it returns a no-op
> > OutputCommitter that may not gel well with a file-based writer such as
> > MultipleOutputs. Then set this new OutputFormat as your job's output
> > format.
> >
> >> Thanks again Harsh for pointing the way.
> >>
> >> Tony
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tony Burton [mailto:TBurton@SportingIndex.com]
> >> Sent: 29 August 2012 11:38
> >> To: user@hadoop.apache.org
> >> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Thanks Harsh! Will try it out and report back later.
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 29 August 2012 11:12
> >> To: user@hadoop.apache.org
> >> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Hi Tony,
> >>
> >> Seeing your new question, I recalled Tom's post to a user once, here:
> >> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1
> >> CaLukt4v1AJ
> >>
> >> This specific call allows you to specify / characters in your name,
> >> that gets translated into creation of directories automatically:
> >> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map
> >> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja
> >> va.lang.String) (The last argument is where you will need to specify
> >> the path)
> >>
> >> Try it out and let us know!
> >>
> >> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >>> Hi Harsh
> >>>
> >>> Thanks for the reply - my understanding is that with MultipleOutputs I
> can write differently named files into the same target directory. With
> MultipleTextOutputFormat I was able to override the target directory name
> to perform the segmentation, by overriding generateFileNameForKeyValue().
> >>>
> >>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target
> directory name as well as the file name?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Harsh J [mailto:harsh@cloudera.com]
> >>> Sent: 28 August 2012 13:44
> >>> To: user@hadoop.apache.org
> >>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>>
> >>> The Multiple*OutputFormat have been deprecated in favor of the
> >>> generic MultipleOutputs API. Would using that instead work for you?
> >>>
> >>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton <
> TBurton@sportingindex.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
> is good for writing results into (for example) different directories
> created on the fly. However, now I'm implementing a MapReduce job using
> Hadoop 1.0.3, I see that the new API no longer supports
> MultipleTextOutputFormat. Is there an equivalent that I can use, or will it
> be supported in a future release?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Tony
> >>>>
> >>>>
> >>>> *******************************************************************
> >>>> *** This email and any attachments are confidential, protected by
> >>>> copyright and may be legally privileged.  If you are not the
> >>>> intended recipient, then the dissemination or copying of this email
> is prohibited. If you have received this in error, please notify the sender
> by replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>>
> >>>> Outbound email has been scanned for viruses and SPAM
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>> www.sportingindex.com
> >>> Inbound Email has been scanned for viruses and SPAM
> >>> ********************************************************************
> >>> ** This email and any attachments are confidential, protected by
> >>> copyright and may be legally privileged.  If you are not the
> >>> intended recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >> www.sportingindex.com Inbound Email has been scanned for viruses and
> >> SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
> >
> >
> > --
> > Harsh J
> > www.sportingindex.com
> > Inbound Email has been scanned for viruses and SPAM
> > **********************************************************************
> > This email and any attachments are confidential, protected by
> > copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Gateway House, Milverton Street, London, SE11
> 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
> >
> > Outbound email has been scanned for viruses and SPAM
>
>
>
> --
> Harsh J
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Gateway House, Milverton Street, London, SE11 4AP.
>  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM www.sportingindex.comInbound Email has been scanned for viruses and SPAM
> ****
>
>
>
> ****
>
> ** **
>
> --
> Alejandro ****
>
> ** **
>
>
>
> *****************************************************************************
> P *Please consider the environment before printing this email* ****
>
>
> www.sportingindex.com
>
> Inbound email has been scanned for viruses & spam****
>



-- 
Alejandro

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx


On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <TB...@sportingindex.com>wrote:

> ** **
>
> Thanks for the reply Alejandro. Using a temp output directory was my first
> guess as well. What’s the best way to proceed? I’ve come across
> FileSystem.rename but it’s consistently returning false for whatever Paths
> I provide. Specifically, I need to copy the following:****
>
> ** **
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn****
>
> ** **
>
> to ****
>
> ** **
>
> s3://<path to data>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<object type m>/part-nnnnn****
>
> ** **
>
> without doing a copyToLocal.****
>
> ** **
>
> Any tips? Are there any better alternatives to FileSystem.rename? Or would
> using the AWS Java SDK be a better solution?****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Tony****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com]
> *Sent:* 31 January 2013 18:45
> *To:* common-user@hadoop.apache.org
>
> *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat****
>
> ** **
>
> Hi Tony, from what i understand your prob is not with MTOF but with you
> wanting to run 2 jobs using the same output directory, the second job will
> fail because the output dir already existed. My take would be tweaking your
> jobs to use a temp output dir, and moving them to the required (final)
> location upon completion.****
>
> ** **
>
> thx****
>
> ** **
>
> ** **
>
> On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <TB...@sportingindex.com>
> wrote:****
>
> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
>
> Is there any way around this? I'm given hope by this, from
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> protection?
>
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
>
> Thanks for any feedback!
>
> Tony
>
>
>
>
>
>
>
> ________________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: 31 August 2012 10:47
> To: user@hadoop.apache.org
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
>
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> > Hi Harsh,
> >
> > I tried using NullOutputFormat as you suggested, however simply using
> >
> > job.setOutputFormatClass(NullOutputFormat.class);
> >
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is
> created only when the first record is emitted for a given partition" (from
> "Hadoop: The Definitive Guide").
> >
> > Instead of
> >
> > job.setOutputFormatClass(TextOutputFormat.class);
> >
> > use LazyOutputFormat like this:
> >
> > LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
> >
> > So now my unnamed MultipleOutputs are handling to segmented results, and
> LazyOutputFormat is suppressing the default output. Good job!
> >
> > Tony
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Harsh J [harsh@cloudera.com]
> > Sent: 29 August 2012 17:05
> > To: user@hadoop.apache.org
> > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >
> > Hi Tony,
> >
> > On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >> Success so far!
> >>
> >> I followed the example given by Tom on the link to the
> MultipleOutputs.html API you suggested.
> >>
> >> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the
> output depending on word length: output to directory "sml" for less than 10
> characters, "med" for between 10 and 20 characters, "lrg" otherwise.
> >>
> >> I used out.write(key, new IntWritable(sum), generateFilename(key,
> >> sum)); to write the output, and generateFileName to create the custom
> >> directory name/filename. You need to provide the start of the
> >> filename as well otherwise your output files will be -r-00000,
> >> -r-00001 etc. (so, for example, return "sml/part"; etc)
> >
> > Thanks for these notes, should come helpful for those who search!
> >
> >> Also required: as Tom states, override Reducer.setup() to create the
> MultipleOutputs. However, Tom's puzzle left for the reader is that you also
> need to override Reducer.cleanup() and call close() on your MultipleOutputs
> object. Forget to do this and your segmented files will be empty.
> >
> > Ah yes this is important. Non closure of files would have you wait for
> > an hour for data to get available to readers (open writer lease expiry
> > period).
> >
> >> One observation: although it's not the end of the world, as well as my
> segmented output I also get a zero-size part-r-00000 file in the base of my
> output path. Is there any way to prevent creation of this file?
> >
> > Set the OutputFormat to NullOutputFormat.
> >
> > In case you face issues doing this in new API (you may notice some odd
> > behavior) try to extend NullOutputFormat and in its getOutputCommitter
> > method i.e.
> > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr
> > educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h
> > adoop.mapreduce.TaskAttemptContext),
> > return a FileOutputCommitter object. By default it returns a no-op
> > OutputCommitter that may not gel well with a file-based writer such as
> > MultipleOutputs. Then set this new OutputFormat as your job's output
> > format.
> >
> >> Thanks again Harsh for pointing the way.
> >>
> >> Tony
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tony Burton [mailto:TBurton@SportingIndex.com]
> >> Sent: 29 August 2012 11:38
> >> To: user@hadoop.apache.org
> >> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Thanks Harsh! Will try it out and report back later.
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 29 August 2012 11:12
> >> To: user@hadoop.apache.org
> >> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Hi Tony,
> >>
> >> Seeing your new question, I recalled Tom's post to a user once, here:
> >> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1
> >> CaLukt4v1AJ
> >>
> >> This specific call allows you to specify / characters in your name,
> >> that gets translated into creation of directories automatically:
> >> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map
> >> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja
> >> va.lang.String) (The last argument is where you will need to specify
> >> the path)
> >>
> >> Try it out and let us know!
> >>
> >> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >>> Hi Harsh
> >>>
> >>> Thanks for the reply - my understanding is that with MultipleOutputs I
> can write differently named files into the same target directory. With
> MultipleTextOutputFormat I was able to override the target directory name
> to perform the segmentation, by overriding generateFileNameForKeyValue().
> >>>
> >>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target
> directory name as well as the file name?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Harsh J [mailto:harsh@cloudera.com]
> >>> Sent: 28 August 2012 13:44
> >>> To: user@hadoop.apache.org
> >>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>>
> >>> The Multiple*OutputFormat have been deprecated in favor of the
> >>> generic MultipleOutputs API. Would using that instead work for you?
> >>>
> >>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton <
> TBurton@sportingindex.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
> is good for writing results into (for example) different directories
> created on the fly. However, now I'm implementing a MapReduce job using
> Hadoop 1.0.3, I see that the new API no longer supports
> MultipleTextOutputFormat. Is there an equivalent that I can use, or will it
> be supported in a future release?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Tony
> >>>>
> >>>>
> >>>> *******************************************************************
> >>>> *** This email and any attachments are confidential, protected by
> >>>> copyright and may be legally privileged.  If you are not the
> >>>> intended recipient, then the dissemination or copying of this email
> is prohibited. If you have received this in error, please notify the sender
> by replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>>
> >>>> Outbound email has been scanned for viruses and SPAM
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>> www.sportingindex.com
> >>> Inbound Email has been scanned for viruses and SPAM
> >>> ********************************************************************
> >>> ** This email and any attachments are confidential, protected by
> >>> copyright and may be legally privileged.  If you are not the
> >>> intended recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >> www.sportingindex.com Inbound Email has been scanned for viruses and
> >> SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
> >
> >
> > --
> > Harsh J
> > www.sportingindex.com
> > Inbound Email has been scanned for viruses and SPAM
> > **********************************************************************
> > This email and any attachments are confidential, protected by
> > copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Gateway House, Milverton Street, London, SE11
> 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
> >
> > Outbound email has been scanned for viruses and SPAM
>
>
>
> --
> Harsh J
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Gateway House, Milverton Street, London, SE11 4AP.
>  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM www.sportingindex.comInbound Email has been scanned for viruses and SPAM
> ****
>
>
>
> ****
>
> ** **
>
> --
> Alejandro ****
>
> ** **
>
>
>
> *****************************************************************************
> P *Please consider the environment before printing this email* ****
>
>
> www.sportingindex.com
>
> Inbound email has been scanned for viruses & spam****
>



-- 
Alejandro

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx


On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <TB...@sportingindex.com>wrote:

> ** **
>
> Thanks for the reply Alejandro. Using a temp output directory was my first
> guess as well. What’s the best way to proceed? I’ve come across
> FileSystem.rename but it’s consistently returning false for whatever Paths
> I provide. Specifically, I need to copy the following:****
>
> ** **
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn****
>
> ** **
>
> to ****
>
> ** **
>
> s3://<path to data>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<object type m>/part-nnnnn****
>
> ** **
>
> without doing a copyToLocal.****
>
> ** **
>
> Any tips? Are there any better alternatives to FileSystem.rename? Or would
> using the AWS Java SDK be a better solution?****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Tony****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com]
> *Sent:* 31 January 2013 18:45
> *To:* common-user@hadoop.apache.org
>
> *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat****
>
> ** **
>
> Hi Tony, from what i understand your prob is not with MTOF but with you
> wanting to run 2 jobs using the same output directory, the second job will
> fail because the output dir already existed. My take would be tweaking your
> jobs to use a temp output dir, and moving them to the required (final)
> location upon completion.****
>
> ** **
>
> thx****
>
> ** **
>
> ** **
>
> On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <TB...@sportingindex.com>
> wrote:****
>
> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
>
> Is there any way around this? I'm given hope by this, from
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> protection?
>
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
>
> Thanks for any feedback!
>
> Tony
>
>
>
>
>
>
>
> ________________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: 31 August 2012 10:47
> To: user@hadoop.apache.org
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
>
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> > Hi Harsh,
> >
> > I tried using NullOutputFormat as you suggested, however simply using
> >
> > job.setOutputFormatClass(NullOutputFormat.class);
> >
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is
> created only when the first record is emitted for a given partition" (from
> "Hadoop: The Definitive Guide").
> >
> > Instead of
> >
> > job.setOutputFormatClass(TextOutputFormat.class);
> >
> > use LazyOutputFormat like this:
> >
> > LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
> >
> > So now my unnamed MultipleOutputs are handling to segmented results, and
> LazyOutputFormat is suppressing the default output. Good job!
> >
> > Tony
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Harsh J [harsh@cloudera.com]
> > Sent: 29 August 2012 17:05
> > To: user@hadoop.apache.org
> > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >
> > Hi Tony,
> >
> > On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >> Success so far!
> >>
> >> I followed the example given by Tom on the link to the
> MultipleOutputs.html API you suggested.
> >>
> >> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the
> output depending on word length: output to directory "sml" for less than 10
> characters, "med" for between 10 and 20 characters, "lrg" otherwise.
> >>
> >> I used out.write(key, new IntWritable(sum), generateFilename(key,
> >> sum)); to write the output, and generateFileName to create the custom
> >> directory name/filename. You need to provide the start of the
> >> filename as well otherwise your output files will be -r-00000,
> >> -r-00001 etc. (so, for example, return "sml/part"; etc)
> >
> > Thanks for these notes, should come helpful for those who search!
> >
> >> Also required: as Tom states, override Reducer.setup() to create the
> MultipleOutputs. However, Tom's puzzle left for the reader is that you also
> need to override Reducer.cleanup() and call close() on your MultipleOutputs
> object. Forget to do this and your segmented files will be empty.
> >
> > Ah yes this is important. Non closure of files would have you wait for
> > an hour for data to get available to readers (open writer lease expiry
> > period).
> >
> >> One observation: although it's not the end of the world, as well as my
> segmented output I also get a zero-size part-r-00000 file in the base of my
> output path. Is there any way to prevent creation of this file?
> >
> > Set the OutputFormat to NullOutputFormat.
> >
> > In case you face issues doing this in new API (you may notice some odd
> > behavior) try to extend NullOutputFormat and in its getOutputCommitter
> > method i.e.
> > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr
> > educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h
> > adoop.mapreduce.TaskAttemptContext),
> > return a FileOutputCommitter object. By default it returns a no-op
> > OutputCommitter that may not gel well with a file-based writer such as
> > MultipleOutputs. Then set this new OutputFormat as your job's output
> > format.
> >
> >> Thanks again Harsh for pointing the way.
> >>
> >> Tony
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tony Burton [mailto:TBurton@SportingIndex.com]
> >> Sent: 29 August 2012 11:38
> >> To: user@hadoop.apache.org
> >> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Thanks Harsh! Will try it out and report back later.
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 29 August 2012 11:12
> >> To: user@hadoop.apache.org
> >> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Hi Tony,
> >>
> >> Seeing your new question, I recalled Tom's post to a user once, here:
> >> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1
> >> CaLukt4v1AJ
> >>
> >> This specific call allows you to specify / characters in your name,
> >> that gets translated into creation of directories automatically:
> >> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map
> >> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja
> >> va.lang.String) (The last argument is where you will need to specify
> >> the path)
> >>
> >> Try it out and let us know!
> >>
> >> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >>> Hi Harsh
> >>>
> >>> Thanks for the reply - my understanding is that with MultipleOutputs I
> can write differently named files into the same target directory. With
> MultipleTextOutputFormat I was able to override the target directory name
> to perform the segmentation, by overriding generateFileNameForKeyValue().
> >>>
> >>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target
> directory name as well as the file name?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Harsh J [mailto:harsh@cloudera.com]
> >>> Sent: 28 August 2012 13:44
> >>> To: user@hadoop.apache.org
> >>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>>
> >>> The Multiple*OutputFormat have been deprecated in favor of the
> >>> generic MultipleOutputs API. Would using that instead work for you?
> >>>
> >>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton <
> TBurton@sportingindex.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
> is good for writing results into (for example) different directories
> created on the fly. However, now I'm implementing a MapReduce job using
> Hadoop 1.0.3, I see that the new API no longer supports
> MultipleTextOutputFormat. Is there an equivalent that I can use, or will it
> be supported in a future release?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Tony
> >>>>
> >>>>
> >>>> *******************************************************************
> >>>> *** This email and any attachments are confidential, protected by
> >>>> copyright and may be legally privileged.  If you are not the
> >>>> intended recipient, then the dissemination or copying of this email
> is prohibited. If you have received this in error, please notify the sender
> by replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>>
> >>>> Outbound email has been scanned for viruses and SPAM
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>> www.sportingindex.com
> >>> Inbound Email has been scanned for viruses and SPAM
> >>> ********************************************************************
> >>> ** This email and any attachments are confidential, protected by
> >>> copyright and may be legally privileged.  If you are not the
> >>> intended recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >> www.sportingindex.com Inbound Email has been scanned for viruses and
> >> SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
> >
> >
> > --
> > Harsh J
> > www.sportingindex.com
> > Inbound Email has been scanned for viruses and SPAM
> > **********************************************************************
> > This email and any attachments are confidential, protected by
> > copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Gateway House, Milverton Street, London, SE11
> 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
> >
> > Outbound email has been scanned for viruses and SPAM
>
>
>
> --
> Harsh J
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Gateway House, Milverton Street, London, SE11 4AP.
>  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM www.sportingindex.comInbound Email has been scanned for viruses and SPAM
> ****
>
>
>
> ****
>
> ** **
>
> --
> Alejandro ****
>
> ** **
>
>
>
> *****************************************************************************
> P *Please consider the environment before printing this email* ****
>
>
> www.sportingindex.com
>
> Inbound email has been scanned for viruses & spam****
>



-- 
Alejandro

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

Posted by Alejandro Abdelnur <tu...@cloudera.com>.

Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx


On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <TB...@sportingindex.com>wrote:

> ** **
>
> Thanks for the reply Alejandro. Using a temp output directory was my first
> guess as well. What’s the best way to proceed? I’ve come across
> FileSystem.rename but it’s consistently returning false for whatever Paths
> I provide. Specifically, I need to copy the following:****
>
> ** **
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn****
>
> ** **
>
> to ****
>
> ** **
>
> s3://<path to data>/<object type 1>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 1>/part-nnnnn****
>
> s3://<path to data>/<object type 2>/part-00000****
>
> …****
>
> s3://<path to data>/<object type 2>/part-nnnnn****
>
> …****
>
> s3://<path to data>/<object type m>/part-nnnnn****
>
> ** **
>
> without doing a copyToLocal.****
>
> ** **
>
> Any tips? Are there any better alternatives to FileSystem.rename? Or would
> using the AWS Java SDK be a better solution?****
>
> ** **
>
> Thanks!****
>
> ** **
>
> Tony****
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> ** **
>
> *From:* Alejandro Abdelnur [mailto:tucu@cloudera.com]
> *Sent:* 31 January 2013 18:45
> *To:* common-user@hadoop.apache.org
>
> *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat****
>
> ** **
>
> Hi Tony, from what i understand your prob is not with MTOF but with you
> wanting to run 2 jobs using the same output directory, the second job will
> fail because the output dir already existed. My take would be tweaking your
> jobs to use a temp output dir, and moving them to the required (final)
> location upon completion.****
>
> ** **
>
> thx****
>
> ** **
>
> ** **
>
> On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <TB...@sportingindex.com>
> wrote:****
>
> Hi everyone,
>
> Some of you might recall this topic, which I worked on with the list's
> help back in August last year - see email trail below. Despite initial
> success of the discovery, I had the shelve the approach as I ended up using
> a different solution (for reasons I forget!) with the implementation that
> was ultimately used for that particular project.
>
> I'm now in a position to be working on a similar new task, where I've
> successfully implemented the combination of LazyOutputFormat and
> MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
> locations. However, I've hit another snag which I'm hoping you might help
> me work through.
>
> I'm going to be running daily tasks to extract data from XML files
> (specifically, the data stored in certain nodes of the XML), stored on AWS
> S3 using object names with the following format:
>
> s3://inputbucket/data/2013/1/13/<list of xml data files.bz2>
>
> I want to extract items from the XML and write out as follows:
>
> s3://outputbucket/path/<xml node name>/20130113/<output from MR job>
>
> For one day of data, this works fine. I pass in s3://inputbucket/data and
> s3://outputbucket/path as input and output arguments, along with my run
> date (20130113) which gets manipulated and appended where appropriate to
> form the precise read and write locations, for example
>
> FileInputFormat.setInputhPath(job, " s3://inputbucket/data");
> FileOutputFormat.setOutputPath(job, "s3://outputbucket/path");
>
> Then MultipleOutputs adds on my XML node names underneath
> s3://outputbucket/path automatically.
>
> However, for the next day's run, the job gets to
> FileOutputFormat.setOutputPath and sees that the output path
> (s3://outputbucket/path) already exists, and throws a
> FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
> though my ultimate subdirectory, to be constructed by MultipleOutputs does
> not already exist.
>
> Is there any way around this? I'm given hope by this, from
> http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
> "public class FileAlreadyExistsException extends IOException - Used when
> target file already exists for any operation *and is not configured to be
> overwritten*" (my emphasis). Is it possible to deconfigure the overwrite
> protection?
>
> If not, I suppose one other way ahead is to create my own FileOutputFormat
> where the checkOutputSpecs() is a bit less fussy; another might be to write
> to a "temp" directory and programmatically move it to the desired output
> when the job completes successfully, although this is getting to feel a bit
> "hacky" to me.
>
> Thanks for any feedback!
>
> Tony
>
>
>
>
>
>
>
> ________________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: 31 August 2012 10:47
> To: user@hadoop.apache.org
> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
>
> Good finding, that OF slipped my mind. We can mention on the
> MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
> the job-level config. Please file a JIRA for this under MAPREDUCE project
> on the Apache JIRA?
>
> On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> > Hi Harsh,
> >
> > I tried using NullOutputFormat as you suggested, however simply using
> >
> > job.setOutputFormatClass(NullOutputFormat.class);
> >
> > resulted in no output at all. Although I've not tried overriding
> getOutputCommitter in NullOutputFormat as you suggested, I discovered
> LazyOutputFormat which only writes when it has to, "the output file is
> created only when the first record is emitted for a given partition" (from
> "Hadoop: The Definitive Guide").
> >
> > Instead of
> >
> > job.setOutputFormatClass(TextOutputFormat.class);
> >
> > use LazyOutputFormat like this:
> >
> > LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
> >
> > So now my unnamed MultipleOutputs are handling to segmented results, and
> LazyOutputFormat is suppressing the default output. Good job!
> >
> > Tony
> >
> >
> >
> >
> >
> > ________________________________________
> > From: Harsh J [harsh@cloudera.com]
> > Sent: 29 August 2012 17:05
> > To: user@hadoop.apache.org
> > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >
> > Hi Tony,
> >
> > On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >> Success so far!
> >>
> >> I followed the example given by Tom on the link to the
> MultipleOutputs.html API you suggested.
> >>
> >> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the
> output depending on word length: output to directory "sml" for less than 10
> characters, "med" for between 10 and 20 characters, "lrg" otherwise.
> >>
> >> I used out.write(key, new IntWritable(sum), generateFilename(key,
> >> sum)); to write the output, and generateFileName to create the custom
> >> directory name/filename. You need to provide the start of the
> >> filename as well otherwise your output files will be -r-00000,
> >> -r-00001 etc. (so, for example, return "sml/part"; etc)
> >
> > Thanks for these notes, should come helpful for those who search!
> >
> >> Also required: as Tom states, override Reducer.setup() to create the
> MultipleOutputs. However, Tom's puzzle left for the reader is that you also
> need to override Reducer.cleanup() and call close() on your MultipleOutputs
> object. Forget to do this and your segmented files will be empty.
> >
> > Ah yes this is important. Non closure of files would have you wait for
> > an hour for data to get available to readers (open writer lease expiry
> > period).
> >
> >> One observation: although it's not the end of the world, as well as my
> segmented output I also get a zero-size part-r-00000 file in the base of my
> output path. Is there any way to prevent creation of this file?
> >
> > Set the OutputFormat to NullOutputFormat.
> >
> > In case you face issues doing this in new API (you may notice some odd
> > behavior) try to extend NullOutputFormat and in its getOutputCommitter
> > method i.e.
> > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr
> > educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h
> > adoop.mapreduce.TaskAttemptContext),
> > return a FileOutputCommitter object. By default it returns a no-op
> > OutputCommitter that may not gel well with a file-based writer such as
> > MultipleOutputs. Then set this new OutputFormat as your job's output
> > format.
> >
> >> Thanks again Harsh for pointing the way.
> >>
> >> Tony
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Tony Burton [mailto:TBurton@SportingIndex.com]
> >> Sent: 29 August 2012 11:38
> >> To: user@hadoop.apache.org
> >> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Thanks Harsh! Will try it out and report back later.
> >>
> >>
> >> -----Original Message-----
> >> From: Harsh J [mailto:harsh@cloudera.com]
> >> Sent: 29 August 2012 11:12
> >> To: user@hadoop.apache.org
> >> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>
> >> Hi Tony,
> >>
> >> Seeing your new question, I recalled Tom's post to a user once, here:
> >> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1
> >> CaLukt4v1AJ
> >>
> >> This specific call allows you to specify / characters in your name,
> >> that gets translated into creation of directories automatically:
> >> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map
> >> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja
> >> va.lang.String) (The last argument is where you will need to specify
> >> the path)
> >>
> >> Try it out and let us know!
> >>
> >> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <TB...@sportingindex.com>
> wrote:
> >>> Hi Harsh
> >>>
> >>> Thanks for the reply - my understanding is that with MultipleOutputs I
> can write differently named files into the same target directory. With
> MultipleTextOutputFormat I was able to override the target directory name
> to perform the segmentation, by overriding generateFileNameForKeyValue().
> >>>
> >>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target
> directory name as well as the file name?
> >>>
> >>> Thanks,
> >>>
> >>> Tony
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Harsh J [mailto:harsh@cloudera.com]
> >>> Sent: 28 August 2012 13:44
> >>> To: user@hadoop.apache.org
> >>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
> >>>
> >>> The Multiple*OutputFormat have been deprecated in favor of the
> >>> generic MultipleOutputs API. Would using that instead work for you?
> >>>
> >>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton <
> TBurton@sportingindex.com> wrote:
> >>>> Hi,
> >>>>
> >>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
> is good for writing results into (for example) different directories
> created on the fly. However, now I'm implementing a MapReduce job using
> Hadoop 1.0.3, I see that the new API no longer supports
> MultipleTextOutputFormat. Is there an equivalent that I can use, or will it
> be supported in a future release?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Tony
> >>>>
> >>>>
> >>>> *******************************************************************
> >>>> *** This email and any attachments are confidential, protected by
> >>>> copyright and may be legally privileged.  If you are not the
> >>>> intended recipient, then the dissemination or copying of this email
> is prohibited. If you have received this in error, please notify the sender
> by replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>>
> >>>> Outbound email has been scanned for viruses and SPAM
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>> www.sportingindex.com
> >>> Inbound Email has been scanned for viruses and SPAM
> >>> ********************************************************************
> >>> ** This email and any attachments are confidential, protected by
> >>> copyright and may be legally privileged.  If you are not the
> >>> intended recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>>
> >>> Outbound email has been scanned for viruses and SPAM
> >>
> >>
> >>
> >> --
> >> Harsh J
> >> www.sportingindex.com
> >> Inbound Email has been scanned for viruses and SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >> www.sportingindex.com Inbound Email has been scanned for viruses and
> >> SPAM
> >> *********************************************************************
> >> * This email and any attachments are confidential, protected by
> >> copyright and may be legally privileged.  If you are not the intended
> >> recipient, then the dissemination or copying of this email is
> prohibited. If you have received this in error, please notify the sender by
> replying by email and then delete the email completely from your system.
>  Neither Sporting Index nor the sender accepts responsibility for any
> virus, or any other defect which might affect any computer or IT system
> into which the email is received and/or opened.  It is the responsibility
> of the recipient to scan the email and no responsibility is accepted for
> any loss or damage arising in any way from receipt or use of this email.
>  Sporting Index Ltd is a company registered in England and Wales with
> company number 2636842, whose registered office is at Gateway House,
> Milverton Street, London, SE11 4AP.  Sporting Index Ltd is authorised and
> regulated by the UK Financial Services Authority (reg. no. 150404) and
> Gambling Commission (reg. no. 000-027343-R-308898-001).  Any financial
> promotion contained herein has been issued and approved by Sporting Index
> Ltd.
> >>
> >> Outbound email has been scanned for viruses and SPAM
> >
> >
> >
> > --
> > Harsh J
> > www.sportingindex.com
> > Inbound Email has been scanned for viruses and SPAM
> > **********************************************************************
> > This email and any attachments are confidential, protected by
> > copyright and may be legally privileged.  If you are not the intended
> recipient, then the dissemination or copying of this email is prohibited.
> If you have received this in error, please notify the sender by replying by
> email and then delete the email completely from your system.  Neither
> Sporting Index nor the sender accepts responsibility for any virus, or any
> other defect which might affect any computer or IT system into which the
> email is received and/or opened.  It is the responsibility of the recipient
> to scan the email and no responsibility is accepted for any loss or damage
> arising in any way from receipt or use of this email.  Sporting Index Ltd
> is a company registered in England and Wales with company number 2636842,
> whose registered office is at Gateway House, Milverton Street, London, SE11
> 4AP.  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
> >
> > Outbound email has been scanned for viruses and SPAM
>
>
>
> --
> Harsh J
> www.sportingindex.com
> Inbound Email has been scanned for viruses and SPAM
> **********************************************************************
> This email and any attachments are confidential, protected by copyright
> and may be legally privileged.  If you are not the intended recipient, then
> the dissemination or copying of this email is prohibited. If you have
> received this in error, please notify the sender by replying by email and
> then delete the email completely from your system.  Neither Sporting Index
> nor the sender accepts responsibility for any virus, or any other defect
> which might affect any computer or IT system into which the email is
> received and/or opened.  It is the responsibility of the recipient to scan
> the email and no responsibility is accepted for any loss or damage arising
> in any way from receipt or use of this email.  Sporting Index Ltd is a
> company registered in England and Wales with company number 2636842, whose
> registered office is at Gateway House, Milverton Street, London, SE11 4AP.
>  Sporting Index Ltd is authorised and regulated by the UK Financial
> Services Authority (reg. no. 150404) and Gambling Commission (reg. no.
> 000-027343-R-308898-001).  Any financial promotion contained herein has
> been issued and approved by Sporting Index Ltd.
>
> Outbound email has been scanned for viruses and SPAM www.sportingindex.comInbound Email has been scanned for viruses and SPAM
> ****
>
>
>
> ****
>
> ** **
>
> --
> Alejandro ****
>
> ** **
>
>
>
> *****************************************************************************
> P *Please consider the environment before printing this email* ****
>
>
> www.sportingindex.com
>
> Inbound email has been scanned for viruses & spam****
>



-- 
Alejandro