You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Ramana Inukonda Nagaraj (JIRA)" <ji...@apache.org> on 2015/03/25 01:22:52 UTC

[jira] [Created] (DRILL-2550) Drillbit disconnect from ZK results in drillbit being lost until restart

Ramana Inukonda Nagaraj created DRILL-2550:
----------------------------------------------

             Summary: Drillbit disconnect from ZK results in drillbit being lost until restart
                 Key: DRILL-2550
                 URL: https://issues.apache.org/jira/browse/DRILL-2550
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Flow
            Reporter: Ramana Inukonda Nagaraj
            Assignee: Chris Westin
            Priority: Minor


Not quite sure if this is an issue or even if its important- maybe someone can think of a situation where this might be a bigger issue.

Steps taken to recreate:
1. Startup drillbits on multiple nodes. (They all come up and form a 8 node cluster)
2. Start executing a long running query.
3. Use TCPKILL to kill all connections between one node and zookeeper port 5181. 
Drill seems to behave very gracefully here - I see a nice error message saying Query failed: ForemanException: One more more nodes lost connectivity during query. Identified node was atsqa6c61.qa.lab

However, once I start allowing connections back the node is not brought back as part of the cluster until a drillbit restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (DRILL-2550) Drillbit disconnect from ZK results in drillbit being lost until restart

Posted by Hanifi Gunes <hg...@maprtech.com>.

If we're going to go in that direction, it'd be nice to do it in a way that
allows not only for a node to rejoin the cluster, but for new nodes to be
added while it is already running. Seems like it shouldn't be so different.
- Agree. This will give us ability to elastically scale up/down allowing
online node updates without service going down.

That's still a ways off from recovering lost work done for queries with
fragments running on a node that goes down, though.
- I can see some trivial cases around which we only need to re-assign
computation to another node but do not need to replay the data from
upstream like a leaf fragment fails or similar. It gets more complicated
when fragment's lineage comes into picture.

Good food for thought.

-Hanifi

On Tue, Mar 24, 2015 at 6:01 PM, Chris Westin <cw...@yahoo.com> wrote:

> If we're going to go in that direction, it'd be nice to do it in a way that
> allows not only for a node to rejoin the cluster, but for new nodes to be
> added while it is already running. Seems like it shouldn't be so different.
>
> That's still a ways off from recovering lost work done for queries with
> fragments running on a node that goes down, though.
>
> On Tue, Mar 24, 2015 at 5:46 PM, Hanifi Gunes <hg...@maprtech.com> wrote:
>
> > I doubt if we do periodic cluster discovery. Current model seems to rely
> on
> > static set of nodes that is queried while bootstrapping. Also the
> > "graceful" behavior described above basically tells that we are not
> > fault-tolerant. Dynamic cluster discovery is somewhat easy to implement
> via
> > watchers or suchlike. However making Drill fault-tolerant needs some
> > serious discussion, which is quite important for long running queries. As
> > we mature, I hope to see more discussions coming around these issues
> > because failures do happen.
> >
> > Regards.
> > -Hanifi
> >
> > On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) <
> > jira@apache.org> wrote:
> >
> > > Ramana Inukonda Nagaraj created DRILL-2550:
> > > ----------------------------------------------
> > >
> > >              Summary: Drillbit disconnect from ZK results in drillbit
> > > being lost until restart
> > >                  Key: DRILL-2550
> > >                  URL: https://issues.apache.org/jira/browse/DRILL-2550
> > >              Project: Apache Drill
> > >           Issue Type: Bug
> > >           Components: Execution - Flow
> > >             Reporter: Ramana Inukonda Nagaraj
> > >             Assignee: Chris Westin
> > >             Priority: Minor
> > >
> > >
> > > Not quite sure if this is an issue or even if its important- maybe
> > someone
> > > can think of a situation where this might be a bigger issue.
> > >
> > > Steps taken to recreate:
> > > 1. Startup drillbits on multiple nodes. (They all come up and form a 8
> > > node cluster)
> > > 2. Start executing a long running query.
> > > 3. Use TCPKILL to kill all connections between one node and zookeeper
> > port
> > > 5181.
> > > Drill seems to behave very gracefully here - I see a nice error message
> > > saying Query failed: ForemanException: One more more nodes lost
> > > connectivity during query. Identified node was atsqa6c61.qa.lab
> > >
> > > However, once I start allowing connections back the node is not brought
> > > back as part of the cluster until a drillbit restart.
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian JIRA
> > > (v6.3.4#6332)
> > >
> >
>

Re: [jira] [Created] (DRILL-2550) Drillbit disconnect from ZK results in drillbit being lost until restart

Posted by Chris Westin <cw...@yahoo.com>.

If we're going to go in that direction, it'd be nice to do it in a way that
allows not only for a node to rejoin the cluster, but for new nodes to be
added while it is already running. Seems like it shouldn't be so different.

That's still a ways off from recovering lost work done for queries with
fragments running on a node that goes down, though.

On Tue, Mar 24, 2015 at 5:46 PM, Hanifi Gunes <hg...@maprtech.com> wrote:

> I doubt if we do periodic cluster discovery. Current model seems to rely on
> static set of nodes that is queried while bootstrapping. Also the
> "graceful" behavior described above basically tells that we are not
> fault-tolerant. Dynamic cluster discovery is somewhat easy to implement via
> watchers or suchlike. However making Drill fault-tolerant needs some
> serious discussion, which is quite important for long running queries. As
> we mature, I hope to see more discussions coming around these issues
> because failures do happen.
>
> Regards.
> -Hanifi
>
> On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) <
> jira@apache.org> wrote:
>
> > Ramana Inukonda Nagaraj created DRILL-2550:
> > ----------------------------------------------
> >
> >              Summary: Drillbit disconnect from ZK results in drillbit
> > being lost until restart
> >                  Key: DRILL-2550
> >                  URL: https://issues.apache.org/jira/browse/DRILL-2550
> >              Project: Apache Drill
> >           Issue Type: Bug
> >           Components: Execution - Flow
> >             Reporter: Ramana Inukonda Nagaraj
> >             Assignee: Chris Westin
> >             Priority: Minor
> >
> >
> > Not quite sure if this is an issue or even if its important- maybe
> someone
> > can think of a situation where this might be a bigger issue.
> >
> > Steps taken to recreate:
> > 1. Startup drillbits on multiple nodes. (They all come up and form a 8
> > node cluster)
> > 2. Start executing a long running query.
> > 3. Use TCPKILL to kill all connections between one node and zookeeper
> port
> > 5181.
> > Drill seems to behave very gracefully here - I see a nice error message
> > saying Query failed: ForemanException: One more more nodes lost
> > connectivity during query. Identified node was atsqa6c61.qa.lab
> >
> > However, once I start allowing connections back the node is not brought
> > back as part of the cluster until a drillbit restart.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
> >
>

Re: [jira] [Created] (DRILL-2550) Drillbit disconnect from ZK results in drillbit being lost until restart

Posted by Hanifi Gunes <hg...@maprtech.com>.

I doubt if we do periodic cluster discovery. Current model seems to rely on
static set of nodes that is queried while bootstrapping. Also the
"graceful" behavior described above basically tells that we are not
fault-tolerant. Dynamic cluster discovery is somewhat easy to implement via
watchers or suchlike. However making Drill fault-tolerant needs some
serious discussion, which is quite important for long running queries. As
we mature, I hope to see more discussions coming around these issues
because failures do happen.

Regards.
-Hanifi

On Tue, Mar 24, 2015 at 5:22 PM, Ramana Inukonda Nagaraj (JIRA) <
jira@apache.org> wrote:

> Ramana Inukonda Nagaraj created DRILL-2550:
> ----------------------------------------------
>
>              Summary: Drillbit disconnect from ZK results in drillbit
> being lost until restart
>                  Key: DRILL-2550
>                  URL: https://issues.apache.org/jira/browse/DRILL-2550
>              Project: Apache Drill
>           Issue Type: Bug
>           Components: Execution - Flow
>             Reporter: Ramana Inukonda Nagaraj
>             Assignee: Chris Westin
>             Priority: Minor
>
>
> Not quite sure if this is an issue or even if its important- maybe someone
> can think of a situation where this might be a bigger issue.
>
> Steps taken to recreate:
> 1. Startup drillbits on multiple nodes. (They all come up and form a 8
> node cluster)
> 2. Start executing a long running query.
> 3. Use TCPKILL to kill all connections between one node and zookeeper port
> 5181.
> Drill seems to behave very gracefully here - I see a nice error message
> saying Query failed: ForemanException: One more more nodes lost
> connectivity during query. Identified node was atsqa6c61.qa.lab
>
> However, once I start allowing connections back the node is not brought
> back as part of the cluster until a drillbit restart.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>