You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by Maharajan Nachiappa <ma...@gmail.com> on 2014/09/03 19:49:41 UTC

Fwd: Helix parallelism

Hi Kishore/kanak,

Thanks much for the guidance, I have tested the feature with minimum nodes it works as expected but have not done the fullest testing.

I have a question, is there a way that I can get the resulting data back to consolidate or aggregate in the client as an option along the TaskResult object which has status and info? Say for example of returning 1 or 2 kb of results from 5 task participants, as an optional data object. Similar to the map-reduce concept but a real time basis so given the client opportunity to do consolidate results.

Regards,
Maha

On Aug 22, 2014, at 8:05 AM, kishore g <g....@gmail.com> wrote:

Not sure if you are subscribed to the mailing list

---------- Forwarded message ----------
From: "Kanak Biscuitwala" <ka...@hotmail.com>
Date: Aug 21, 2014 10:02 AM
Subject: RE: Helix parallelism
To: "user@helix.apache.org" <us...@helix.apache.org>
Cc: 

Yes, you can use the task framework, which hasn't been released yet, but will be soon. For more on the task framework, you can read this blog post: http://engineering.linkedin.com/distributed-systems/ad-hoc-task-management-apache-helix

You can submit a job with 1000 tasks using either Java or YAML.

The YAML specification of this job would look something like:

name: MyWorkflow
jobs:
    - name: RunQueries

      command: RunQuery # The command corresponding to Task callbacks

      jobConfigMap: { # Arbitrary key-value pairs to pass to all tasks in this job

        k1: "v1",
        k2: "v2"

      }
      numConcurrentTasksPerInstance: 200 # Max parallelism per instance

      tasks: # Schedule 1000 tasks, each responsible for aggregating requests for a chunk of partitions

        - taskConfigMap: { # Arbitrary key-value pairs to pass to this task

            query: "query1"
          }

        - taskConfigMap: {
            query: "query2"

          }
        - taskConfigMap: {

            query: "query3"
          } # Repeat for remaining 997 tasks



You can also see this class for an example of how to build jobs in Java: https://github.com/apache/helix/blob/master/helix-core/src/test/java/org/apache/helix/integration/task/TestIndependentTaskRebalancer.java

Then you just need to implement a Task callback and register it on each of the instances, and Helix will take care of assignment and retries.
Date: Thu, 21 Aug 2014 09:07:11 -0700
Subject: Helix parallelism
From: maharajan.nachi@gmail.com
To: user@helix.apache.org

Hi,

I just started looking at the capability that helix can do Parallelism executing task evenly in the cluster instances, resources. 

I have a requirement in executing different queries but in parallel to solve some issue. Can helix help in this case?

For example
1. I have some 1000 different queries to be executed.
2. I have 5 nodes configured in the helix cluster capable of executing set of queries.
3. I need helix to distribute these 1000 different queries equally to the 5 nodes (200 per node) and takes care re-executing failed set of queries. And notifies the controller about the job done.

Can someone help me in understand how helix can solve this kind of issue? 

Regards,
Maha

RE: Helix parallelism

Posted by Kanak Biscuitwala <ka...@hotmail.com>.

Hi Maha,

The info field is meant to be lightweight progress metadata. If you need to store something more sophisticated, you can use HelixManager#getHelixPropertyStore for small (kilobytes) state, or you can store pointers in the property store to a different store that can handle more data.

Kanak

________________________________
> From: maharajan.nachi@gmail.com 
> Subject: Fwd: Helix parallelism 
> Date: Wed, 3 Sep 2014 10:49:41 -0700 
> To: user@helix.apache.org 
> 
> Hi Kishore/kanak, 
> 
> Thanks much for the guidance, I have tested the feature with minimum 
> nodes it works as expected but have not done the fullest testing. 
> 
> I have a question, is there a way that I can get the resulting data 
> back to consolidate or aggregate in the client as an option along the 
> TaskResult object which has status and info? Say for example of 
> returning 1 or 2 kb of results from 5 task participants, as an optional 
> data object. Similar to the map-reduce concept but a real time basis so 
> given the client opportunity to do consolidate results. 
> 
> Regards, 
> Maha 
> 
> On Aug 22, 2014, at 8:05 AM, kishore g 
> <g....@gmail.com>> wrote: 
> 
> 
> Not sure if you are subscribed to the mailing list 
> 
> ---------- Forwarded message ---------- 
> From: "Kanak Biscuitwala" <ka...@hotmail.com>> 
> Date: Aug 21, 2014 10:02 AM 
> Subject: RE: Helix parallelism 
> To: "user@helix.apache.org<ma...@helix.apache.org>" 
> <us...@helix.apache.org>> 
> Cc: 
> 
> Yes, you can use the task framework, which hasn't been released yet, 
> but will be soon. For more on the task framework, you can read this 
> blog 
> post: http://engineering.linkedin.com/distributed-systems/ad-hoc-task-management-apache-helix 
> 
> You can submit a job with 1000 tasks using either Java or YAML. 
> 
> The YAML specification of this job would look something like: 
> 
> 
> name: MyWorkflow 
> jobs: 
> - name: RunQueries 
> 
> 
> command: RunQuery # The command corresponding to Task callbacks 
> 
> jobConfigMap: { # Arbitrary key-value pairs to pass to all tasks in this job 
> 
> k1: "v1", 
> k2: "v2" 
> 
> 
> } 
> 
> numConcurrentTasksPerInstance: 200 # Max parallelism per instance 
> 
> tasks: # Schedule 1000 tasks, each responsible for aggregating requests 
> for a chunk of partitions 
> - taskConfigMap: { # Arbitrary key-value pairs to pass to this task 
> query: "query1" 
> } 
> - taskConfigMap: { 
> query: "query2" 
> } 
> - taskConfigMap: { 
> query: "query3" 
> } # Repeat for remaining 997 tasks 
> 
> 
> You can also see this class for an example of how to build jobs in 
> Java: https://github.com/apache/helix/blob/master/helix-core/src/test/java/org/apache/helix/integration/task/TestIndependentTaskRebalancer.java 
> 
> Then you just need to implement a Task callback and register it on each 
> of the instances, and Helix will take care of assignment and retries. 
> ________________________________ 
> Date: Thu, 21 Aug 2014 09:07:11 -0700 
> Subject: Helix parallelism 
> From: maharajan.nachi@gmail.com<ma...@gmail.com> 
> To: user@helix.apache.org<ma...@helix.apache.org> 
> 
> Hi, 
> 
> I just started looking at the capability that helix can do Parallelism 
> executing task evenly in the cluster instances, resources. 
> 
> I have a requirement in executing different queries but in parallel to 
> solve some issue. Can helix help in this case? 
> 
> For example 
> 1. I have some 1000 different queries to be executed. 
> 2. I have 5 nodes configured in the helix cluster capable of executing 
> set of queries. 
> 3. I need helix to distribute these 1000 different queries equally to 
> the 5 nodes (200 per node) and takes care re-executing failed set of 
> queries. And notifies the controller about the job done. 
> 
> Can someone help me in understand how helix can solve this kind of issue? 
> 
> Regards, 
> Maha