You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@helix.apache.org by Kanak Biscuitwala <ka...@hotmail.com> on 2014/03/27 08:42:37 UTC
Helix Meeting transcript 3/26/14

ASFBot was unavailable, so here is a copy-paste meeting log:

23:09
osgigeek
I submit tasks to a task management interface whatever we choose
23:10
osgigeek
those tasks need to be executed as soon as possible
23:10
osgigeek
and I should get a response
23:10
kishoreg1
they will be
23:10
kishoreg1
thats exactly the behavior
23:10
kanakb
osgigeek, perhaps i should have done a better job separating api from implementation details
23:10
kanakb
but that is exactly what this does
23:11
kanakb
the api is exactly what you're suggesting
23:11
osgigeek
ok but I think I am lost when we say
23:11
osgigeek
controller looks at the the resource and the tasks and assigns the tasks to nodes in the cluster
23:11
osgigeek
why are the tasks inherently distributed?
23:12
kanakb
depends on your policy
23:12
osgigeek
ah ok
23:12
kanakb
if you *want* stuff to run on specific nodes, we can make that happen
23:12
kanakb
if you just want your task to run somewhere that has capacity
23:12
kanakb
then we can do that too
23:12
osgigeek
right
23:13
osgigeek
so I am thinking of that specific usecase
23:13
osgigeek
I have it today
23:13
osgigeek
I want to execute tasks on a given node and iff it does not have capacity move it to a node which does
23:13
osgigeek
if you are saying I can do it with a policy then that works
23:13
osgigeek
because I may want the tasks to be executed locally
23:13
osgigeek
I may not want them distributed
23:14
kanakb
if you want to run the task on the same machine as the client
23:14
kanakb
using helix probably doesn't make much sense
23:14
osgigeek
the benefit I see is helix will give me the ability to failover to another node if my tasks are queued and the node fails
23:14
osgigeek
I dont have to manage that myself
23:15
osgigeek
Helix can also maybe manage execution based on capacity
23:15
kishoreg1
there is a whole lot more than that
23:15
kishoreg1
correct
23:15
kishoreg1
and with integration with yarn/ec2 etc
23:15
osgigeek
so yes I agree there is a whole lot more than that I am not trivializing the issue
23:15
kishoreg1
we can support SLA
23:16
kishoreg1
u can say run these tasks, i need it to be completed in X hours
23:16
kishoreg1
so now helix can run the tasks, monitor progress and go ask EC2 for more instances
23:17
kishoreg1
and start the tasks there
23:17
kishoreg1
and release EC2 instances after its done
23:18
osgigeek
sure
23:18
kishoreg1
so yeah, in general the problems are initial distribution, handling faults, and scaling up/down
23:18
osgigeek
but distributing tasks without a policy seems like an optimization
23:18
kishoreg1
these can be applied to either resources or tasks
23:19
kishoreg1
yeah, depends on the use case. some have policy
23:19
kishoreg1
most use cases will simply say run these tasks
23:19
osgigeek
I think the separation of resource and task makes sense in my head but maybe not to you guys
23:19
kishoreg1
i dont care where it runs
23:20
osgigeek
ok lets keep moving I dont want to stop progress
23:20
kishoreg1
it does have a difference
23:20
kishoreg1
may be u can help us in the terminology there
23:20
osgigeek
sure
23:21
osgigeek
so one question related to tasks
23:21
osgigeek
can I say execute these at midnight every day?
23:21
kishoreg1
basically resource is an entity that is distributed its not transient
23:21
osgigeek
a resource is also not executable and does not have results
23:21
kishoreg1
u want the resource to always exist and in a particular state
23:21
kishoreg1
correct
23:21
kanakb
yes, recurring tasks at a given time a use case we want to support with the scheduling work
23:22
kishoreg1
task on the other hand is something that can be executed
23:22
osgigeek
correct
23:22
osgigeek
that is why the separation
23:22
kishoreg1
and is executed probably once
23:22
osgigeek
no task can be executed repeatedly
23:22
osgigeek
if its recurring
23:23
kishoreg1
dont know if concepts of partition/replica makes sense
23:23
osgigeek
say something like execute this every 4 hrs
23:23
kishoreg1
yeah i take that back
23:23
kishoreg1
so shud we make task as first class citizen
23:23
osgigeek
I think state machine wont make sense for tasks
23:24
kishoreg1
it does
23:24
kishoreg1
task has states
23:24
osgigeek
what states does a task go through?
23:24
kanakb
see the diagram on the wiki page
23:24
kishoreg1
init, start, stop, pause, resume, cancel
23:25
osgigeek
so how will we pause a running thread/task?
23:25
kishoreg1
we will call a method call pause on the task
23:26
kishoreg1
user will have to implement that method to stop doing what ever it was doing
23:26
osgigeek
ok I guess it does for recurring tasks
23:26
osgigeek
ok
23:26
osgigeek
so tasks have states : check
23:27
osgigeek
what about partition and replica
23:27
osgigeek
I think replica makes sense for failover
23:27
osgigeek
issue is removing it when the task is cancelled or completed
23:27
osgigeek
or errored
23:28
kanakb
so this is actually a problem that needs some thought
23:29
kanakb
we definitely need partition-level granularity for some use cases
23:29
kanakb
and the callbacks are already partition-level
23:30
osgigeek
btw one other difference between tasks and resources is users dont get to model the state model for tasks, they are predefined. Am I correct?
23:30
kanakb
yeah that's all abstracted
23:31
osgigeek
right I was trying to see if there are enough differences to require defining task as a new top level
23:31
osgigeek
or a first class citizen
23:31
kishoreg1
yeah, its probably better to have it
23:33
kishoreg1
other extreme approach is get rid of resource
23:33
kishoreg1
and call everything as Task
23:33
kishoreg1
but have attributes on Task
23:34
kishoreg1
longRunning, shortLive
23:34
osgigeek
I think resource has its place for sure I think its cleaner to keep them both
23:35
kishoreg1
ok,
23:35
osgigeek
its hard to explain to people that think of database as a Task
23:35
osgigeek
people are used to thinking of those as resources
23:35
kishoreg1
so we will have resource, partition, replica
23:35
kishoreg1
task, subtask
23:36
kanakb
what about workflows?
23:36
osgigeek
do we need a subtask, why cant we model it as a task itself?
23:36
osgigeek
a task with a parent task?
23:36
kishoreg1
we need a logical grouping of multiple tasks
23:37
kishoreg1
probably a taskqueue, task?
23:37
hsaputra
as DAG ?
23:37
osgigeek
TaskCollection?
23:37
kanakb
taskqueue is probably not correct because queue implies sequential execution
23:38
osgigeek
yeah I am with kanakb: on that
23:38
osgigeek
It implies order
23:38
kishoreg1
hsaputra: DAG is above taskcollection
23:39
kishoreg1
for example indexing 100 files under a directory
23:39
hsaputra
ok ...
23:39
kishoreg1
we will have 100 tasks
23:40
kishoreg1
but we want refer to them with a logical name
23:40
kishoreg1
thats what we are trying to come up with a name
23:40
kishoreg1
taskgroup, taskcollection
23:40
hsaputra
Tasks 1-to-m of Task instances ?
23:41
kishoreg1
yes
23:41
osgigeek
TaskGroup is probably better
23:41
osgigeek
Ok I like TaskGroup
23:41
kishoreg1
ok.
23:42
osgigeek
so should the user be able to attach listener to TaskGroup as well as Task?
23:42
kishoreg1
yes
23:42
osgigeek
kanakb: what was the concept of workflow?
23:43
kishoreg1
just like you can listen to resource as well as partition
23:43
kanakb
workflow is taskgroups organized into a DAG, enforcing execution order
23:43
kishoreg1
workflow simply defines the order in which tasks shud be run
23:43
osgigeek
the user defines that?
23:43
kanakb
yes
23:43
kishoreg1
yes
23:44
osgigeek
So a workflow traditionally supports branching
23:44
osgigeek
do we have that ?
23:44
osgigeek
fork and join?
23:45
kanakb
conditional execution is something we can add later
23:45
osgigeek
ok so its a sequential and a single branch
23:46
kanakb
sequential with respect to parents
23:46
kanakb
you can have 0 to multiple parents
23:46
kanakb
at arbitrary depth
23:46
osgigeek
multiple parents implies join
23:46
kanakb
right so that is supported
23:46
osgigeek
ok
23:46
kanakb
as is fork, i guess
23:46
osgigeek
yes
23:47
osgigeek
so we will then need an api to allow the user to express this workflow
23:47
kanakb
that's YAML right now
23:47
osgigeek
but we want to add a java one too?
23:47
kishoreg1
but yeah we need an api as well
23:47
kanakb
yeah so right now it's based on a builder
23:47
kanakb
that takes tasks
23:47
kanakb
that's pretty static
23:48
kanakb
ideally it would be nice to make this more dynamic
23:48
kanakb
i could say the same thing about taskgroups actually
23:48
osgigeek
I am thinking we should introduce branch maybe
23:48
kishoreg1
lets start static
23:48
kishoreg1
whats a branch
23:48
osgigeek
or we could do this whole thing as events
23:49
osgigeek
so bear with me on this branch attempt
23:50
osgigeek
so a branch contains tasks
23:50
osgigeek
and we fork branches and join them
23:50
osgigeek
to make up workflows
23:50
osgigeek
so branch1.fork(branch2, branch3)
23:51
osgigeek
branch1.add(task1).add(task2)
23:51
osgigeek
and so on
23:51
osgigeek
a branch inherently executes tasks in order
23:52
osgigeek
fork essentially allows distribution
23:52
osgigeek
across nodes
23:52
osgigeek
branch1 to node1
23:52
osgigeek
sorry branch2 to node 1 and branch3 to node2
23:52
kanakb
this is currently expressed as just specifying the parent tasks for each task
23:52
kanakb
isn't that just as expressive?
23:54
kanakb
it also has the benefit of allowing helix to construct the graph for you instead of trying to specify it all up front
23:54
osgigeek
sorry I have not seen the current expression so dont know if its as good
23:54
osgigeek
so the more I think about this I think the more it makes sense to loosely couple them
23:55
osgigeek
each task should have task state events
23:55
kishoreg1
yes, i would love keep them loosely couple
23:55
osgigeek
and other tasks should simply listen to those events
23:55
kishoreg1
helix controller does not even understand the fork join concepts
23:55
osgigeek
so T1 and T2
23:55
osgigeek
T1 has states already all we need to do is trigger events
23:56
kishoreg1
:)
23:56
osgigeek
so if T2 depends on T1 and say T0 to complete it listens to the complete events for T1 and T0
23:56
kishoreg1
u basically explained how we have implemented
23:56
osgigeek
lol
23:56
kishoreg1
actually that was our first implementation
23:57
osgigeek
but that way we can keep them loosely coupled
23:57
kishoreg1
the thing is who listens
23:57
osgigeek
and people can add tasks anytime
23:57
osgigeek
a task listens, we only need to get what event-type it listens to of which task
23:57
osgigeek
the task event carries the event-type and task-id
23:58
osgigeek
or task name
23:58
kishoreg1
is it onus of T1 to listen to T0 complete before T1 starts
23:58
osgigeek
so there should be some entity orchestrating the listening and triggering of T1
23:58
kishoreg1
or controller listens to T0 and starts T1 after T0 has reached complete state
23:58
osgigeek
some manager or maybe like you say controller
23:58
kishoreg1
correct, thats what we are doing
23:58
osgigeek
but the user expresses the listen events on the task
23:58
kishoreg1
right
23:58
osgigeek
so the user should think that the task is listening
23:59
osgigeek
that I think allows pluggablility of tasks
23:59
kishoreg1
right, i think its exactly the way u described
00:00
kishoreg1
lets wait for kanak to finish his work and see a working demo
00:00
osgigeek
mind you I have not looked at any existing implementation
00:00
osgigeek
so it just validates your design I think
00:01
kishoreg1
yep
00:01
kishoreg1
ok its 12:00
00:01
kishoreg1
shud we wrap up?
00:01
kishoreg1
i badly wanted to talk about admin api
00:01
osgigeek
yes so what have we decided can we recap?
00:01
osgigeek
Tasks are first class citizens
00:01
kishoreg1
we will have taskgroup and task as first class citizen
00:02
osgigeek
Tasks are grouped into TaskGroups
00:02
kishoreg1
so we will have taskconfiguration
00:02
kishoreg1
we need better way to differentiate between task and resource
00:03
kanakb
if taskgroups are first class citizens, where does workflow fit in
00:03
kishoreg1
workflow is just a user concept
00:03
kishoreg1
helix does not understand it internally
00:03
kishoreg1
a task as entryCriteria
00:03
kishoreg1
or exitCriteria
00:04
kanakb
ok
00:04
kishoreg1
that expresses the dependency triggers
00:04
kanakb
a task, not a task group, correct?
00:04
kishoreg1
it can be at both levels
00:04
osgigeek
yeah I think both levels makes sense
00:05
osgigeek
an entire group could be cancelled if a criteria is met
00:05
kishoreg1
correct,
00:05
kishoreg1
so we can have tasks be long running?
00:06
kishoreg1
for example shud a stream processing job be represented as task
00:06
kishoreg1
or resource
00:06
kanakb
probably resource
00:06
kanakb
i don't know though
00:06
osgigeek
I think task
00:06
kishoreg1
i think sandeep will say it shud be task
00:06
osgigeek
lol
00:06
osgigeek
yes its a task
00:06
osgigeek
a long running one albeit without timeout
00:07
kanakb
basically what that implies is that some tasks are not completable
00:07
kishoreg1
i agree, its just that we currently model them as a resource
00:07
kishoreg1
yeah, so we can have them as attribute on the task configuration
00:09
kishoreg1
ok anything else
00:09
osgigeek
callbacks
00:09
osgigeek
callbacks can be on task as well as taskgroup
00:09
kishoreg1
we need to have api's for that
00:09
osgigeek
does a task group have an id?
00:09
osgigeek
or only tasks have ids?
00:09
kishoreg1
yep
00:10
kishoreg1
taskgroup will have id
00:10
kishoreg1
some name
00:10
osgigeek
yes I agree
00:10
osgigeek
yes
00:10
osgigeek
ok I hate to do this so late but what about priority?
00:10
osgigeek
task priority?
00:10
osgigeek
or even task group priority?
00:10
kishoreg1
we will need that, good point
00:10
kishoreg1
i overlooked that
00:11
kishoreg1
by the task can have replicas rt
00:11
osgigeek
yes
00:11
kishoreg1
is replica the right word?
00:12
osgigeek
hmm
00:12
osgigeek
I cant think of any other right now
00:13
kishoreg1
taskredundancy ?
00:15
osgigeek
yeah the concept is that, but as a term does not fit nicely
00:15
kishoreg1
lets go with replica
00:15
osgigeek
yeah lets go with replica
00:15
kishoreg1
so state model still applies
00:15
osgigeek
btw tasks have a temporal nature to them
00:15
osgigeek
yes
00:15
kishoreg1
we can say one task is leader
00:15
osgigeek
tasks *may have a temporal nature to them
00:16
osgigeek
sure
00:16
kishoreg1
other task is in standby or something like that
00:16
kishoreg1
temporal?
00:16
osgigeek
ah standby is a nicer name
00:16
osgigeek
temporal = some time element
00:16
kishoreg1
yep
00:16
osgigeek
like execute this at 12 midnight
00:16
kishoreg1
yes definitely thats our first use case
00:16
osgigeek
tasks can be repetitive or one shot
00:17
osgigeek
if repetitive user defines how many times
00:18
osgigeek
how many times could be of three types (1) repeate n times or (2) repeat every 'n' hours or (3) every 'n' hours until end-date
00:19
hsaputra
have to leave early guys, will try to catch up with the chat summary, have a good night
00:20
kanakb
thanks for joining in
00:20
kanakb
good night
00:20
hsaputra has left IRC (Quit: Page closed)
00:20
kishoreg1
thanks
00:20
kishoreg1
take care
00:21
osgigeek
anyway that should be all about tasks
00:23
kanakb
ok
00:23
kanakb
should we leave admin for another day?
00:23
osgigeek
lets try to do it tomorrow
00:24
osgigeek
I will be online, whenever you guys come on we can try to tackle it
00:24
kanakb
ok
00:24
osgigeek
tomorrow evening I am thinking just to clarify
00:24
kanakb
sure
00:25
kanakb
ok, let's end here then
00:25
kanakb
thanks guys
00:25
osgigeek
ok sounds good
00:25
osgigeek
thank you
00:31
osgigeek has left ()