You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by roryqi <ro...@apache.org> on 2023/07/11 12:47:43 UTC

Introduce Uniffle : A stability solution of Hive's shuffle

Dear Apache Hive community,


We are delighted to announce the support of Tez on Uniffle.  Uniffle havs
supported Apache Spark, Apache,Hadoop MapReduce and Apache Tez.

Uniffle is a remote shuffle service. In several situations, Uniffle will
provide great help.

   1. If you use AWS spot instances or mix resources, tasks may be
   preempted. It will be great if we store shuffle data in the Uniffle and we
   can deploy Uniffle on some stable resource. It will improve the stability
   of tasks. If tasks are preempted, we won’t recompute tasks if we store
   shuffle in the Uniffle.
   2. For large shuffle jobs, Uniffle can reduce random IO for the jobs.
   Uniffle can improve the performance of jobs. For 1TB MapReduce Terasort, 1w
   map tasks, 1w reduce tasks, job performance will increase 30%.

We also welcome pull requests and are eager to see how you might use
Uniffle to make Hive more user-friendly. More information, you can access
https://github.com/apache/incubator-uniffle


Best

Rory

Re: Introduce Uniffle : A stability solution of Hive's shuffle

Posted by Sungwoo Park <gl...@gmail.com>.

In addition to the two main benefits summarized by Rory, I would like to
add another benefit of using remote shuffle service:

3. If you run large jobs in public clouds, sometimes the amount of local
storage attached to your instances can be a limiting factor. By using
remote shuffle service, you can cut the usage of local storage by half
(because shuffle data is sent to remote shuffle service, rather than
written to local storage).

Although you still need local storage for the remaining half, using remote
shuffle service opens new possibilities of further reducing local storage
(e.g., directly reading from network rather than spilling to local disk).

Thanks,

--- Sungwoo

On Tue, Jul 11, 2023 at 9:48 PM roryqi <ro...@apache.org> wrote:

> Dear Apache Hive community,
>
>
> We are delighted to announce the support of Tez on Uniffle.  Uniffle havs
> supported Apache Spark, Apache,Hadoop MapReduce and Apache Tez.
>
> Uniffle is a remote shuffle service. In several situations, Uniffle will
> provide great help.
>
>    1. If you use AWS spot instances or mix resources, tasks may be
>    preempted. It will be great if we store shuffle data in the Uniffle and
> we
>    can deploy Uniffle on some stable resource. It will improve the
> stability
>    of tasks. If tasks are preempted, we won’t recompute tasks if we store
>    shuffle in the Uniffle.
>    2. For large shuffle jobs, Uniffle can reduce random IO for the jobs.
>    Uniffle can improve the performance of jobs. For 1TB MapReduce
> Terasort, 1w
>    map tasks, 1w reduce tasks, job performance will increase 30%.
>
> We also welcome pull requests and are eager to see how you might use
> Uniffle to make Hive more user-friendly. More information, you can access
> https://github.com/apache/incubator-uniffle
>
>
> Best
>
> Rory
>

Re: Introduce Uniffle : A stability solution of Hive's shuffle

Posted by He Qi <ro...@apache.org>.

Thanks, We're testing the Tez in production envrionment. 
The biggest issue is to add ability of recompute. 
You can see https://github.com/apache/incubator-uniffle/issues/1011

In the future, we want to make Hive won't rely on the disks. We can make Shuffle Server to sort the data and flush the data to HDFS. Reducer will merge the files on HDFS.

On 2023/07/20 10:08:05 Okumin wrote:
> Hi Rory,
> 
> Let me express my gratitude and positive impression of Uniffle.
> Actually, we also feel the necessity of a shuffle service for our Hive
> deployment, and I've been watching the project. I will check the
> implementation for Tez and send feedback or PRs if I find something.
> 
> Regards,
> Okumin
> 
> On Tue, Jul 11, 2023 at 9:48 PM roryqi <ro...@apache.org> wrote:
> 
> > Dear Apache Hive community,
> >
> >
> > We are delighted to announce the support of Tez on Uniffle.  Uniffle havs
> > supported Apache Spark, Apache,Hadoop MapReduce and Apache Tez.
> >
> > Uniffle is a remote shuffle service. In several situations, Uniffle will
> > provide great help.
> >
> >    1. If you use AWS spot instances or mix resources, tasks may be
> >    preempted. It will be great if we store shuffle data in the Uniffle and
> > we
> >    can deploy Uniffle on some stable resource. It will improve the
> > stability
> >    of tasks. If tasks are preempted, we won’t recompute tasks if we store
> >    shuffle in the Uniffle.
> >    2. For large shuffle jobs, Uniffle can reduce random IO for the jobs.
> >    Uniffle can improve the performance of jobs. For 1TB MapReduce
> > Terasort, 1w
> >    map tasks, 1w reduce tasks, job performance will increase 30%.
> >
> > We also welcome pull requests and are eager to see how you might use
> > Uniffle to make Hive more user-friendly. More information, you can access
> > https://github.com/apache/incubator-uniffle
> >
> >
> > Best
> >
> > Rory
> >
>

Re: Introduce Uniffle : A stability solution of Hive's shuffle

Posted by Okumin <ma...@okumin.com>.

Hi Rory,

Let me express my gratitude and positive impression of Uniffle.
Actually, we also feel the necessity of a shuffle service for our Hive
deployment, and I've been watching the project. I will check the
implementation for Tez and send feedback or PRs if I find something.

Regards,
Okumin

On Tue, Jul 11, 2023 at 9:48 PM roryqi <ro...@apache.org> wrote:

> Dear Apache Hive community,
>
>
> We are delighted to announce the support of Tez on Uniffle.  Uniffle havs
> supported Apache Spark, Apache,Hadoop MapReduce and Apache Tez.
>
> Uniffle is a remote shuffle service. In several situations, Uniffle will
> provide great help.
>
>    1. If you use AWS spot instances or mix resources, tasks may be
>    preempted. It will be great if we store shuffle data in the Uniffle and
> we
>    can deploy Uniffle on some stable resource. It will improve the
> stability
>    of tasks. If tasks are preempted, we won’t recompute tasks if we store
>    shuffle in the Uniffle.
>    2. For large shuffle jobs, Uniffle can reduce random IO for the jobs.
>    Uniffle can improve the performance of jobs. For 1TB MapReduce
> Terasort, 1w
>    map tasks, 1w reduce tasks, job performance will increase 30%.
>
> We also welcome pull requests and are eager to see how you might use
> Uniffle to make Hive more user-friendly. More information, you can access
> https://github.com/apache/incubator-uniffle
>
>
> Best
>
> Rory
>