You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Ranjithkumar Gampa <gr...@gmail.com> on 2012/09/28 19:57:28 UTC

context.write() Vs FSDataOutputStream.writeBytes()

Hi,

we are using FSDataOutputStream.writeBytes() from map/reduce to write to
Hive table path directly instead of context.write() which is working fine
and so far no problems with this approach.
we make sure the file names are distinct by appending taskAttemptId to them
and we use speculative execution 'false' to ensure map/reducer won't work
on same data and create inconsistency in writing data to HDFS, we went for
this approach for below reasons, please let's know if any disadvantages
with it.

1) To avoid cleanup of _SUCCESS and _LOG files created by reducer/mapper
output which Hive may not like.
2) To write some records from mappers which doesn't need to participate in
Reducer logic, so can save some sort and shuffle process. We are exploring
on Multi Output format, but still above point need to be taken care I think.
3) We have some special characters in data, on which we are doing String
manipulation using 'ISO-8859-1' encoding, using Text class in
context.write() is not preserving these characters due to default utf-8
encoding used by it.

Kindly please share if my understanding is not correct and there are some
other ways of taking care above three points, I am happy to hear and learn,
our project uses mix of Hadoop MR and Hive.

Thanks in advance.

Regards,
Ranjith

Re: context.write() Vs FSDataOutputStream.writeBytes()

Posted by Ranjithkumar Gampa <gr...@gmail.com>.
Hello all,

Anybody looked into below topic. Please reply your views.

Thanks
Ranjith

On Fri, Sep 28, 2012 at 1:57 PM, Ranjithkumar Gampa <gr...@gmail.com>wrote:

> Hi,
>
> we are using FSDataOutputStream.writeBytes() from map/reduce to write to
> Hive table path directly instead of context.write() which is working fine
> and so far no problems with this approach.
>  we make sure the file names are distinct by appending taskAttemptId to
> them and we use speculative execution 'false' to ensure map/reducer won't
> work on same data and create inconsistency in writing data to HDFS, we went
> for this approach for below reasons, please let's know if any disadvantages
> with it.
>
> 1) To avoid cleanup of _SUCCESS and _LOG files created by reducer/mapper
> output which Hive may not like.
> 2) To write some records from mappers which doesn't need to participate in
> Reducer logic, so can save some sort and shuffle process. We are exploring
> on Multi Output format, but still above point need to be taken care I think.
> 3) We have some special characters in data, on which we are doing String
> manipulation using 'ISO-8859-1' encoding, using Text class in
> context.write() is not preserving these characters due to default utf-8
> encoding used by it.
>
> Kindly please share if my understanding is not correct and there are some
> other ways of taking care above three points, I am happy to hear and learn,
> our project uses mix of Hadoop MR and Hive.
>
> Thanks in advance.
>
> Regards,
> Ranjith
>
>

Re: context.write() Vs FSDataOutputStream.writeBytes()

Posted by Ranjithkumar Gampa <gr...@gmail.com>.
Hello all,

Anybody looked into below topic. Please reply your views.

Thanks
Ranjith

On Fri, Sep 28, 2012 at 1:57 PM, Ranjithkumar Gampa <gr...@gmail.com>wrote:

> Hi,
>
> we are using FSDataOutputStream.writeBytes() from map/reduce to write to
> Hive table path directly instead of context.write() which is working fine
> and so far no problems with this approach.
>  we make sure the file names are distinct by appending taskAttemptId to
> them and we use speculative execution 'false' to ensure map/reducer won't
> work on same data and create inconsistency in writing data to HDFS, we went
> for this approach for below reasons, please let's know if any disadvantages
> with it.
>
> 1) To avoid cleanup of _SUCCESS and _LOG files created by reducer/mapper
> output which Hive may not like.
> 2) To write some records from mappers which doesn't need to participate in
> Reducer logic, so can save some sort and shuffle process. We are exploring
> on Multi Output format, but still above point need to be taken care I think.
> 3) We have some special characters in data, on which we are doing String
> manipulation using 'ISO-8859-1' encoding, using Text class in
> context.write() is not preserving these characters due to default utf-8
> encoding used by it.
>
> Kindly please share if my understanding is not correct and there are some
> other ways of taking care above three points, I am happy to hear and learn,
> our project uses mix of Hadoop MR and Hive.
>
> Thanks in advance.
>
> Regards,
> Ranjith
>
>

Re: context.write() Vs FSDataOutputStream.writeBytes()

Posted by Ranjithkumar Gampa <gr...@gmail.com>.
Hello all,

Anybody looked into below topic. Please reply your views.

Thanks
Ranjith

On Fri, Sep 28, 2012 at 1:57 PM, Ranjithkumar Gampa <gr...@gmail.com>wrote:

> Hi,
>
> we are using FSDataOutputStream.writeBytes() from map/reduce to write to
> Hive table path directly instead of context.write() which is working fine
> and so far no problems with this approach.
>  we make sure the file names are distinct by appending taskAttemptId to
> them and we use speculative execution 'false' to ensure map/reducer won't
> work on same data and create inconsistency in writing data to HDFS, we went
> for this approach for below reasons, please let's know if any disadvantages
> with it.
>
> 1) To avoid cleanup of _SUCCESS and _LOG files created by reducer/mapper
> output which Hive may not like.
> 2) To write some records from mappers which doesn't need to participate in
> Reducer logic, so can save some sort and shuffle process. We are exploring
> on Multi Output format, but still above point need to be taken care I think.
> 3) We have some special characters in data, on which we are doing String
> manipulation using 'ISO-8859-1' encoding, using Text class in
> context.write() is not preserving these characters due to default utf-8
> encoding used by it.
>
> Kindly please share if my understanding is not correct and there are some
> other ways of taking care above three points, I am happy to hear and learn,
> our project uses mix of Hadoop MR and Hive.
>
> Thanks in advance.
>
> Regards,
> Ranjith
>
>

Re: context.write() Vs FSDataOutputStream.writeBytes()

Posted by Ranjithkumar Gampa <gr...@gmail.com>.
Hello all,

Anybody looked into below topic. Please reply your views.

Thanks
Ranjith

On Fri, Sep 28, 2012 at 1:57 PM, Ranjithkumar Gampa <gr...@gmail.com>wrote:

> Hi,
>
> we are using FSDataOutputStream.writeBytes() from map/reduce to write to
> Hive table path directly instead of context.write() which is working fine
> and so far no problems with this approach.
>  we make sure the file names are distinct by appending taskAttemptId to
> them and we use speculative execution 'false' to ensure map/reducer won't
> work on same data and create inconsistency in writing data to HDFS, we went
> for this approach for below reasons, please let's know if any disadvantages
> with it.
>
> 1) To avoid cleanup of _SUCCESS and _LOG files created by reducer/mapper
> output which Hive may not like.
> 2) To write some records from mappers which doesn't need to participate in
> Reducer logic, so can save some sort and shuffle process. We are exploring
> on Multi Output format, but still above point need to be taken care I think.
> 3) We have some special characters in data, on which we are doing String
> manipulation using 'ISO-8859-1' encoding, using Text class in
> context.write() is not preserving these characters due to default utf-8
> encoding used by it.
>
> Kindly please share if my understanding is not correct and there are some
> other ways of taking care above three points, I am happy to hear and learn,
> our project uses mix of Hadoop MR and Hive.
>
> Thanks in advance.
>
> Regards,
> Ranjith
>
>