You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@helix.apache.org by Maharajan Nachiappa <ma...@gmail.com> on 2014/09/03 19:49:41 UTC
Fwd: Helix parallelism
Hi Kishore/kanak,
Thanks much for the guidance, I have tested the feature with minimum nodes it works as expected but have not done the fullest testing.
I have a question, is there a way that I can get the resulting data back to consolidate or aggregate in the client as an option along the TaskResult object which has status and info? Say for example of returning 1 or 2 kb of results from 5 task participants, as an optional data object. Similar to the map-reduce concept but a real time basis so given the client opportunity to do consolidate results.
Regards,
Maha
On Aug 22, 2014, at 8:05 AM, kishore g <g....@gmail.com> wrote:
Not sure if you are subscribed to the mailing list
---------- Forwarded message ----------
From: "Kanak Biscuitwala" <ka...@hotmail.com>
Date: Aug 21, 2014 10:02 AM
Subject: RE: Helix parallelism
To: "user@helix.apache.org" <us...@helix.apache.org>
Cc:
Yes, you can use the task framework, which hasn't been released yet, but will be soon. For more on the task framework, you can read this blog post: http://engineering.linkedin.com/distributed-systems/ad-hoc-task-management-apache-helix
You can submit a job with 1000 tasks using either Java or YAML.
The YAML specification of this job would look something like:
name: MyWorkflow
jobs:
- name: RunQueries
command: RunQuery # The command corresponding to Task callbacks
jobConfigMap: { # Arbitrary key-value pairs to pass to all tasks in this job
k1: "v1",
k2: "v2"
}
numConcurrentTasksPerInstance: 200 # Max parallelism per instance
tasks: # Schedule 1000 tasks, each responsible for aggregating requests for a chunk of partitions
- taskConfigMap: { # Arbitrary key-value pairs to pass to this task
query: "query1"
}
- taskConfigMap: {
query: "query2"
}
- taskConfigMap: {
query: "query3"
} # Repeat for remaining 997 tasks
You can also see this class for an example of how to build jobs in Java: https://github.com/apache/helix/blob/master/helix-core/src/test/java/org/apache/helix/integration/task/TestIndependentTaskRebalancer.java
Then you just need to implement a Task callback and register it on each of the instances, and Helix will take care of assignment and retries.
Date: Thu, 21 Aug 2014 09:07:11 -0700
Subject: Helix parallelism
From: maharajan.nachi@gmail.com
To: user@helix.apache.org
Hi,
I just started looking at the capability that helix can do Parallelism executing task evenly in the cluster instances, resources.
I have a requirement in executing different queries but in parallel to solve some issue. Can helix help in this case?
For example
1. I have some 1000 different queries to be executed.
2. I have 5 nodes configured in the helix cluster capable of executing set of queries.
3. I need helix to distribute these 1000 different queries equally to the 5 nodes (200 per node) and takes care re-executing failed set of queries. And notifies the controller about the job done.
Can someone help me in understand how helix can solve this kind of issue?
Regards,
Maha
RE: Helix parallelism
Posted by Kanak Biscuitwala <ka...@hotmail.com>.
Hi Maha,
The info field is meant to be lightweight progress metadata. If you need to store something more sophisticated, you can use HelixManager#getHelixPropertyStore for small (kilobytes) state, or you can store pointers in the property store to a different store that can handle more data.
Kanak
________________________________
> From: maharajan.nachi@gmail.com
> Subject: Fwd: Helix parallelism
> Date: Wed, 3 Sep 2014 10:49:41 -0700
> To: user@helix.apache.org
>
> Hi Kishore/kanak,
>
> Thanks much for the guidance, I have tested the feature with minimum
> nodes it works as expected but have not done the fullest testing.
>
> I have a question, is there a way that I can get the resulting data
> back to consolidate or aggregate in the client as an option along the
> TaskResult object which has status and info? Say for example of
> returning 1 or 2 kb of results from 5 task participants, as an optional
> data object. Similar to the map-reduce concept but a real time basis so
> given the client opportunity to do consolidate results.
>
> Regards,
> Maha
>
> On Aug 22, 2014, at 8:05 AM, kishore g
> <g....@gmail.com>> wrote:
>
>
> Not sure if you are subscribed to the mailing list
>
> ---------- Forwarded message ----------
> From: "Kanak Biscuitwala" <ka...@hotmail.com>>
> Date: Aug 21, 2014 10:02 AM
> Subject: RE: Helix parallelism
> To: "user@helix.apache.org<ma...@helix.apache.org>"
> <us...@helix.apache.org>>
> Cc:
>
> Yes, you can use the task framework, which hasn't been released yet,
> but will be soon. For more on the task framework, you can read this
> blog
> post: http://engineering.linkedin.com/distributed-systems/ad-hoc-task-management-apache-helix
>
> You can submit a job with 1000 tasks using either Java or YAML.
>
> The YAML specification of this job would look something like:
>
>
> name: MyWorkflow
> jobs:
> - name: RunQueries
>
>
> command: RunQuery # The command corresponding to Task callbacks
>
> jobConfigMap: { # Arbitrary key-value pairs to pass to all tasks in this job
>
> k1: "v1",
> k2: "v2"
>
>
> }
>
> numConcurrentTasksPerInstance: 200 # Max parallelism per instance
>
> tasks: # Schedule 1000 tasks, each responsible for aggregating requests
> for a chunk of partitions
> - taskConfigMap: { # Arbitrary key-value pairs to pass to this task
> query: "query1"
> }
> - taskConfigMap: {
> query: "query2"
> }
> - taskConfigMap: {
> query: "query3"
> } # Repeat for remaining 997 tasks
>
>
> You can also see this class for an example of how to build jobs in
> Java: https://github.com/apache/helix/blob/master/helix-core/src/test/java/org/apache/helix/integration/task/TestIndependentTaskRebalancer.java
>
> Then you just need to implement a Task callback and register it on each
> of the instances, and Helix will take care of assignment and retries.
> ________________________________
> Date: Thu, 21 Aug 2014 09:07:11 -0700
> Subject: Helix parallelism
> From: maharajan.nachi@gmail.com<ma...@gmail.com>
> To: user@helix.apache.org<ma...@helix.apache.org>
>
> Hi,
>
> I just started looking at the capability that helix can do Parallelism
> executing task evenly in the cluster instances, resources.
>
> I have a requirement in executing different queries but in parallel to
> solve some issue. Can helix help in this case?
>
> For example
> 1. I have some 1000 different queries to be executed.
> 2. I have 5 nodes configured in the helix cluster capable of executing
> set of queries.
> 3. I need helix to distribute these 1000 different queries equally to
> the 5 nodes (200 per node) and takes care re-executing failed set of
> queries. And notifies the controller about the job done.
>
> Can someone help me in understand how helix can solve this kind of issue?
>
> Regards,
> Maha