You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tez.apache.org by Wojciech Indyk <wo...@gmail.com> on 2014/05/21 00:19:18 UTC

Sequence file as an output

Hi all!
I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.
I use:
finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
SequenceFileOutputFormat.class.getName());
The problem is the output is saved as TextOutputFormat. I use Sequence file
as an input to DAG and it works fine (I use SequenceFileInputFormat).

Kindly regards
Wojciech Indyk

Re: Sequence file as an output

Posted by Hitesh Shah <hi...@apache.org>.

@Bikas, @Wolciech, 

I think 0.5 should likely work without changes as HDP-2.1 is based on Apache Hadoop 2.4 

( http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_releasenotes_hdp_2.1/content/ch_relnotes-hdp-2.1.1-product.html )

thanks
— HItesh

On May 22, 2014, at 2:59 PM, Bikas Saha <bi...@hortonworks.com> wrote:

> If you see issues in your 0.5 build while running on the cluster you may want to follow the latest instructions in BUILDING.txt to target Hadoop 2.2 (HDP 2.1).
>  
> From: Bikas Saha [mailto:bikas@hortonworks.com] 
> Sent: Thursday, May 22, 2014 2:56 PM
> To: user@tez.incubator.apache.org
> Subject: RE: Sequence file as an output
>  
> That’s good news. The gains with larger data set may be lower because the time is dominated by the actual code that’s doing work. You may check that.
>  
> You can actually build 0.5 and use it on your cluster because Tez is a client side application. You only need to have the correct jars on the local client classpath and on HDFS location pointed to by TEZ_LIB_URI in your tez-site.xml.
>  
> Bikas
>  
> From: Wojciech Indyk [mailto:wojciechindyk@gmail.com] 
> Sent: Thursday, May 22, 2014 2:33 PM
> To: user@tez.incubator.apache.org
> Subject: Re: Sequence file as an output
>  
> I wrote my own processors, as in WordCount in v.0.4.
> Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in context of Wordcount from 0.5 version. So That I decided to base on Wordcount from 0.4 version. It worked ok until the output format problem. 
> Nevertheless, I made a workaround to just check performance of TEZ with sessions. I generated sequenceFileInput for each iteration by MapReduce algorithm. Then I used this input for TEZ version of the algorithm (I saved TEZ output in an other place). Results are very promising. By small dataset (~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster.
> I don't have time now to work on problem with SequenceFile as an output. I would rather to rewrite the code according to best practices. I think also update TEZ 0.4 to 0.5 will be required.
> 
> Kindly regards
> Wojciech Indyk
>  
> 2014-05-21 19:31 GMT+02:00 Bikas Saha <bi...@hortonworks.com>:
> You are right. In fact, it’s a very interesting use case.
>  
> Are you using MapProcessor and ReduceProcessor? Or have you written your own processor and are just using Tez inputs/outputs?
>  
> If you look at the latest WordCount.java code in the tez code base, then you can see the current best practice for using the API. For these best practices on using the Tez API, you should look at compiling against the current master that tracks the next 0.5 release. If you are building tez locally then it’s the master branch. Otherwise maven artifacts (for dependency on 0.5.0-incubating-SNAPSHOT) are at https://repository.apache.org/content/groups/snapshots/org/apache/tez
>  
>  
> Let us know if this helps!
> Bikas
>  
> From: Wojciech Indyk [mailto:wojciechindyk@gmail.com] 
> Sent: Wednesday, May 21, 2014 1:58 AM
> To: user@tez.incubator.apache.org
> Subject: Re: Sequence file as an output
>  
> When I remove MRHelpers.doJobClientMagic then NullPointerException in Configuration class occurs. 
>  
> Could you advise me a base class (class and branch/release) for good practice in TEZ for mapReduce jobs? I've rewritten my MR job to use Counters (not available in MapReduce on TEZ) and Sessions (to improve iterative processing speed). I have just Map and Reduce phase, it works in loop (several iterations), so I think using session can improve a performance. Am I right?
> 
> Kindly regards
> Wojciech Indyk
>  
> 2014-05-21 0:33 GMT+02:00 Siddharth Seth <ss...@apache.org>:
> It's possible that the old Output Format is being used (mapred vs mapreduce).
> Could you try forcing this to use the new API with the following.
>     finalVertex.setBoolean("mapred.mapper.new-api", true);
> Also, if you happen to be using MRHelpers.doJobClientMagic - remove that, since that could reset this parameter.
>  
> This is a little messed up, but we're working on making this much easier to use in 0.5.
>  
> Thanks
> - Sid
>  
>  
> On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <wo...@gmail.com> wrote:
> Hi all!
> I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.
> I use:
> finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR, SequenceFileOutputFormat.class.getName());
> The problem is the output is saved as TextOutputFormat. I use Sequence file as an input to DAG and it works fine (I use SequenceFileInputFormat).
> 
> Kindly regards
> Wojciech Indyk
>  
>  
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
>  
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.

RE: Sequence file as an output

Posted by Bikas Saha <bi...@hortonworks.com>.

If you see issues in your 0.5 build while running on the cluster you may
want to follow the latest instructions in BUILDING.txt to target Hadoop 2.2
(HDP 2.1).



*From:* Bikas Saha [mailto:bikas@hortonworks.com]
*Sent:* Thursday, May 22, 2014 2:56 PM
*To:* user@tez.incubator.apache.org
*Subject:* RE: Sequence file as an output



That’s good news. The gains with larger data set may be lower because the
time is dominated by the actual code that’s doing work. You may check that.



You can actually build 0.5 and use it on your cluster because Tez is a
client side application. You only need to have the correct jars on the
local client classpath and on HDFS location pointed to by TEZ_LIB_URI in
your tez-site.xml.



Bikas



*From:* Wojciech Indyk [mailto:wojciechindyk@gmail.com]
*Sent:* Thursday, May 22, 2014 2:33 PM
*To:* user@tez.incubator.apache.org
*Subject:* Re: Sequence file as an output



I wrote my own processors, as in WordCount in v.0.4.

Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where
TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in
context of Wordcount from 0.5 version. So That I decided to base on
Wordcount from 0.4 version. It worked ok until the output format problem.

Nevertheless, I made a workaround to just check performance of TEZ with
sessions. I generated sequenceFileInput for each iteration by MapReduce
algorithm. Then I used this input for TEZ version of the algorithm (I saved
TEZ output in an other place). Results are very promising. By small dataset
(~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster.

I don't have time now to work on problem with SequenceFile as an output. I
would rather to rewrite the code according to best practices. I think also
update TEZ 0.4 to 0.5 will be required.


Kindly regards

Wojciech Indyk



2014-05-21 19:31 GMT+02:00 Bikas Saha <bi...@hortonworks.com>:

You are right. In fact, it’s a very interesting use case.



Are you using MapProcessor and ReduceProcessor? Or have you written your
own processor and are just using Tez inputs/outputs?



If you look at the latest WordCount.java code in the tez code base, then
you can see the current best practice for using the API. For these best
practices on using the Tez API, you should look at compiling against the
current master that tracks the next 0.5 release. If you are building tez
locally then it’s the master branch. Otherwise maven artifacts (for
dependency on 0.5.0-incubating-SNAPSHOT) are at
https://repository.apache.org/content/groups/snapshots/org/apache/tez





Let us know if this helps!

Bikas



*From:* Wojciech Indyk [mailto:wojciechindyk@gmail.com]
*Sent:* Wednesday, May 21, 2014 1:58 AM
*To:* user@tez.incubator.apache.org
*Subject:* Re: Sequence file as an output



When I remove MRHelpers.doJobClientMagic then NullPointerException in
Configuration class occurs.



Could you advise me a base class (class and branch/release) for good
practice in TEZ for mapReduce jobs? I've rewritten my MR job to use
Counters (not available in MapReduce on TEZ) and Sessions (to improve
iterative processing speed). I have just Map and Reduce phase, it works in
loop (several iterations), so I think using session can improve a
performance. Am I right?


Kindly regards

Wojciech Indyk



2014-05-21 0:33 GMT+02:00 Siddharth Seth <ss...@apache.org>:

It's possible that the old Output Format is being used (mapred vs
mapreduce).

Could you try forcing this to use the new API with the following.

    finalVertex.setBoolean("mapred.mapper.new-api", true);

Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
since that could reset this parameter.



This is a little messed up, but we're working on making this much easier to
use in 0.5.



Thanks

- Sid





On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <wo...@gmail.com>
wrote:

Hi all!

I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.

I use:

finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
SequenceFileOutputFormat.class.getName());

The problem is the output is saved as TextOutputFormat. I use Sequence file
as an input to DAG and it works fine (I use SequenceFileInputFormat).


Kindly regards

Wojciech Indyk






CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

RE: Sequence file as an output

Posted by Bikas Saha <bi...@hortonworks.com>.

That’s good news. The gains with larger data set may be lower because the
time is dominated by the actual code that’s doing work. You may check that.



You can actually build 0.5 and use it on your cluster because Tez is a
client side application. You only need to have the correct jars on the
local client classpath and on HDFS location pointed to by TEZ_LIB_URI in
your tez-site.xml.



Bikas



*From:* Wojciech Indyk [mailto:wojciechindyk@gmail.com]
*Sent:* Thursday, May 22, 2014 2:33 PM
*To:* user@tez.incubator.apache.org
*Subject:* Re: Sequence file as an output



I wrote my own processors, as in WordCount in v.0.4.

Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where
TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in
context of Wordcount from 0.5 version. So That I decided to base on
Wordcount from 0.4 version. It worked ok until the output format problem.

Nevertheless, I made a workaround to just check performance of TEZ with
sessions. I generated sequenceFileInput for each iteration by MapReduce
algorithm. Then I used this input for TEZ version of the algorithm (I saved
TEZ output in an other place). Results are very promising. By small dataset
(~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster.

I don't have time now to work on problem with SequenceFile as an output. I
would rather to rewrite the code according to best practices. I think also
update TEZ 0.4 to 0.5 will be required.


Kindly regards

Wojciech Indyk



2014-05-21 19:31 GMT+02:00 Bikas Saha <bi...@hortonworks.com>:

You are right. In fact, it’s a very interesting use case.



Are you using MapProcessor and ReduceProcessor? Or have you written your
own processor and are just using Tez inputs/outputs?



If you look at the latest WordCount.java code in the tez code base, then
you can see the current best practice for using the API. For these best
practices on using the Tez API, you should look at compiling against the
current master that tracks the next 0.5 release. If you are building tez
locally then it’s the master branch. Otherwise maven artifacts (for
dependency on 0.5.0-incubating-SNAPSHOT) are at
https://repository.apache.org/content/groups/snapshots/org/apache/tez





Let us know if this helps!

Bikas



*From:* Wojciech Indyk [mailto:wojciechindyk@gmail.com]
*Sent:* Wednesday, May 21, 2014 1:58 AM
*To:* user@tez.incubator.apache.org
*Subject:* Re: Sequence file as an output



When I remove MRHelpers.doJobClientMagic then NullPointerException in
Configuration class occurs.



Could you advise me a base class (class and branch/release) for good
practice in TEZ for mapReduce jobs? I've rewritten my MR job to use
Counters (not available in MapReduce on TEZ) and Sessions (to improve
iterative processing speed). I have just Map and Reduce phase, it works in
loop (several iterations), so I think using session can improve a
performance. Am I right?


Kindly regards

Wojciech Indyk



2014-05-21 0:33 GMT+02:00 Siddharth Seth <ss...@apache.org>:

It's possible that the old Output Format is being used (mapred vs
mapreduce).

Could you try forcing this to use the new API with the following.

    finalVertex.setBoolean("mapred.mapper.new-api", true);

Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
since that could reset this parameter.



This is a little messed up, but we're working on making this much easier to
use in 0.5.



Thanks

- Sid





On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <wo...@gmail.com>
wrote:

Hi all!

I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.

I use:

finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
SequenceFileOutputFormat.class.getName());

The problem is the output is saved as TextOutputFormat. I use Sequence file
as an input to DAG and it works fine (I use SequenceFileInputFormat).


Kindly regards

Wojciech Indyk






CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Sequence file as an output

Posted by Wojciech Indyk <wo...@gmail.com>.

I wrote my own processors, as in WordCount in v.0.4.
Initially I based on Wordcount from TEZ 0.5. However, I use HDP 2.1, where
TEZ 0.4 is installed and there were some method missing in TEZ 0.4 in
context of Wordcount from 0.5 version. So That I decided to base on
Wordcount from 0.4 version. It worked ok until the output format problem.
Nevertheless, I made a workaround to just check performance of TEZ with
sessions. I generated sequenceFileInput for each iteration by MapReduce
algorithm. Then I used this input for TEZ version of the algorithm (I saved
TEZ output in an other place). Results are very promising. By small dataset
(~1GB) TEZ is 3 times faster. By ~40GB dataset TEZ is 30% faster.
I don't have time now to work on problem with SequenceFile as an output. I
would rather to rewrite the code according to best practices. I think also
update TEZ 0.4 to 0.5 will be required.

Kindly regards
Wojciech Indyk


2014-05-21 19:31 GMT+02:00 Bikas Saha <bi...@hortonworks.com>:

> You are right. In fact, it’s a very interesting use case.
>
>
>
> Are you using MapProcessor and ReduceProcessor? Or have you written your
> own processor and are just using Tez inputs/outputs?
>
>
>
> If you look at the latest WordCount.java code in the tez code base, then
> you can see the current best practice for using the API. For these best
> practices on using the Tez API, you should look at compiling against the
> current master that tracks the next 0.5 release. If you are building tez
> locally then it’s the master branch. Otherwise maven artifacts (for
> dependency on 0.5.0-incubating-SNAPSHOT) are at
> https://repository.apache.org/content/groups/snapshots/org/apache/tez
>
>
>
>
>
> Let us know if this helps!
>
> Bikas
>
>
>
> *From:* Wojciech Indyk [mailto:wojciechindyk@gmail.com]
> *Sent:* Wednesday, May 21, 2014 1:58 AM
> *To:* user@tez.incubator.apache.org
> *Subject:* Re: Sequence file as an output
>
>
>
> When I remove MRHelpers.doJobClientMagic then NullPointerException in
> Configuration class occurs.
>
>
>
> Could you advise me a base class (class and branch/release) for good
> practice in TEZ for mapReduce jobs? I've rewritten my MR job to use
> Counters (not available in MapReduce on TEZ) and Sessions (to improve
> iterative processing speed). I have just Map and Reduce phase, it works in
> loop (several iterations), so I think using session can improve a
> performance. Am I right?
>
>
> Kindly regards
>
> Wojciech Indyk
>
>
>
> 2014-05-21 0:33 GMT+02:00 Siddharth Seth <ss...@apache.org>:
>
> It's possible that the old Output Format is being used (mapred vs
> mapreduce).
>
> Could you try forcing this to use the new API with the following.
>
>     finalVertex.setBoolean("mapred.mapper.new-api", true);
>
> Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
> since that could reset this parameter.
>
>
>
> This is a little messed up, but we're working on making this much easier
> to use in 0.5.
>
>
>
> Thanks
>
> - Sid
>
>
>
>
>
> On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <wo...@gmail.com>
> wrote:
>
> Hi all!
>
> I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.
>
> I use:
>
> finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
> SequenceFileOutputFormat.class.getName());
>
> The problem is the output is saved as TextOutputFormat. I use Sequence
> file as an input to DAG and it works fine (I use SequenceFileInputFormat).
>
>
> Kindly regards
>
> Wojciech Indyk
>
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

RE: Sequence file as an output

Posted by Bikas Saha <bi...@hortonworks.com>.

You are right. In fact, it’s a very interesting use case.



Are you using MapProcessor and ReduceProcessor? Or have you written your
own processor and are just using Tez inputs/outputs?



If you look at the latest WordCount.java code in the tez code base, then
you can see the current best practice for using the API. For these best
practices on using the Tez API, you should look at compiling against the
current master that tracks the next 0.5 release. If you are building tez
locally then it’s the master branch. Otherwise maven artifacts (for
dependency on 0.5.0-incubating-SNAPSHOT) are at
https://repository.apache.org/content/groups/snapshots/org/apache/tez





Let us know if this helps!

Bikas



*From:* Wojciech Indyk [mailto:wojciechindyk@gmail.com]
*Sent:* Wednesday, May 21, 2014 1:58 AM
*To:* user@tez.incubator.apache.org
*Subject:* Re: Sequence file as an output



When I remove MRHelpers.doJobClientMagic then NullPointerException in
Configuration class occurs.



Could you advise me a base class (class and branch/release) for good
practice in TEZ for mapReduce jobs? I've rewritten my MR job to use
Counters (not available in MapReduce on TEZ) and Sessions (to improve
iterative processing speed). I have just Map and Reduce phase, it works in
loop (several iterations), so I think using session can improve a
performance. Am I right?


Kindly regards

Wojciech Indyk



2014-05-21 0:33 GMT+02:00 Siddharth Seth <ss...@apache.org>:

It's possible that the old Output Format is being used (mapred vs
mapreduce).

Could you try forcing this to use the new API with the following.

    finalVertex.setBoolean("mapred.mapper.new-api", true);

Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
since that could reset this parameter.



This is a little messed up, but we're working on making this much easier to
use in 0.5.



Thanks

- Sid





On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <wo...@gmail.com>
wrote:

Hi all!

I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.

I use:

finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
SequenceFileOutputFormat.class.getName());

The problem is the output is saved as TextOutputFormat. I use Sequence file
as an input to DAG and it works fine (I use SequenceFileInputFormat).


Kindly regards

Wojciech Indyk

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Sequence file as an output

Posted by Wojciech Indyk <wo...@gmail.com>.

When I remove MRHelpers.doJobClientMagic then NullPointerException in
Configuration class occurs.

Could you advise me a base class (class and branch/release) for good
practice in TEZ for mapReduce jobs? I've rewritten my MR job to use
Counters (not available in MapReduce on TEZ) and Sessions (to improve
iterative processing speed). I have just Map and Reduce phase, it works in
loop (several iterations), so I think using session can improve a
performance. Am I right?

Kindly regards
Wojciech Indyk

2014-05-21 0:33 GMT+02:00 Siddharth Seth <ss...@apache.org>:

> It's possible that the old Output Format is being used (mapred vs
> mapreduce).
> Could you try forcing this to use the new API with the following.
>     finalVertex.setBoolean("mapred.mapper.new-api", true);
> Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
> since that could reset this parameter.
>
> This is a little messed up, but we're working on making this much easier
> to use in 0.5.
>
> Thanks
> - Sid
>
>
>
> On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <wo...@gmail.com>wrote:
>
>> Hi all!
>> I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a
>> SequenceFile.
>> I use:
>> finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
>> SequenceFileOutputFormat.class.getName());
>> The problem is the output is saved as TextOutputFormat. I use Sequence
>> file as an input to DAG and it works fine (I use SequenceFileInputFormat).
>>
>> Kindly regards
>> Wojciech Indyk
>>
>
>

Re: Sequence file as an output

Posted by Siddharth Seth <ss...@apache.org>.

It's possible that the old Output Format is being used (mapred vs
mapreduce).
Could you try forcing this to use the new API with the following.
    finalVertex.setBoolean("mapred.mapper.new-api", true);
Also, if you happen to be using MRHelpers.doJobClientMagic - remove that,
since that could reset this parameter.

This is a little messed up, but we're working on making this much easier to
use in 0.5.

Thanks
- Sid



On Tue, May 20, 2014 at 3:19 PM, Wojciech Indyk <wo...@gmail.com>wrote:

> Hi all!
> I use tez-0.4 on HDP 2.1. I tried to save results of DAG as a SequenceFile.
> I use:
> finalVertex.set(MRJobConfig.OUTPUT_FORMAT_CLASS_ATTR,
> SequenceFileOutputFormat.class.getName());
> The problem is the output is saved as TextOutputFormat. I use Sequence
> file as an input to DAG and it works fine (I use SequenceFileInputFormat).
>
> Kindly regards
> Wojciech Indyk
>