You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by Josh Elser <el...@apache.org> on 2018/02/26 21:57:07 UTC

[DISCUSS] Design for a "query log"

Hiya,

I wanted to share this little design doc with you about some feature 
work we've been thinking about. The following is a Google doc in which 
anyone should be allowed to comment. Feel free to comment there, or here 
on the thread.

https://s.apache.org/phoenix-query-log

The high-level goal is to create a construct in which Phoenix clients 
will automatically serialize information about the queries they run to a 
table for retrospective analysis. Ideally, this information would be 
stored in a Phoenix table. We want this data to help answer questions like:

* What queries are running against my system
* What specific queries started between 535AM and 620AM two days ago
* What queries are user "bob" running
* Are my user's queries effectively using the indexes in the system

Anti-goals for include:

* Cluster impact (computation/memory) usage of a query
* Query performance may be slowed to ensure all data is serialized
* A third-party service dedicated to ensuring query info is serialized 
(in the event of client failure)

Take a look at the document and let us know what you think please. I'm 
happy to try to explain this in greater detail.

- Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)

Re: [DISCUSS] Design for a "query log"

Posted by Josh Elser <el...@apache.org>.
Yup, totally agree. Will also try to make these more clear in the doc.

We want to queue up the work to "log" and do it in the background. If 
that queue gets too full, we drop it right off that cliff :)

Thanks, Samarth!

On 3/2/18 5:47 PM, Samarth Jain wrote:
> A couple more points which I think you alluded to, Josh, but I would still
> like to call out:
> 1) Writing of these query logs to a phoenix table should be best effort
> i.e. a query definitely shouldn't fail because we encountered an issue
> while writing its log
> 2) Writing of query logs should happen in a manner that is async to the
> flow of the query i.e. a query shouldn't incur the cost of the write
> happening to the query log table
> 
> Doing 2) will help out with 1)
> 
> 
> 
> 
> 
> On Fri, Mar 2, 2018 at 2:28 PM, Josh Elser <el...@apache.org> wrote:
> 
>> Thanks Nick and Andrew! These are great points.
>>
>> * A TTL out of the box is a must. That's such a good suggestion
>> * Sensitivity of data being stored is also a tricky-serious issue to
>> consider. We'll want to lock the table down and be able to state very
>> clearly what data may show up in it.
>> * I like the "levels" of detail that will be persisted. It will help break
>> up the development work (e.g. first impl can just be the INFO details), and
>> prevents concern of runtime impact.
>> * Sampling is a no-brainer for "always-on" situations. I like that too.
>>
>> I'll work on taking these (and others) and updating the gdoc tonight.
>> Thanks again for your feedback!
>>
>>
>> On 3/2/18 1:50 PM, Andrew Purtell wrote:
>>
>>> Agree with Nick's points but let me augment with an additional suggestion:
>>> Tunable/configurable threshold for sampling. In many cases it's sufficient
>>> to sample e.g. 1% of queries to get sufficient coverage and this would
>>> prune 99% of actual load from the query log.
>>>
>>> Also let me underline that compliance requirements will require either
>>> super strong controls of the query log if everything is always logged, in
>>> which case it is important that it works well with access control features
>>> to lock it down; or better what Nick suggests where we can turn off things
>>> like logging the values supplied for bound parameters.
>>>
>>>
>>>
>>> On Fri, Mar 2, 2018 at 8:41 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>>>
>>> I'm a big fan of this idea. There was a brief discussion on the topic over
>>>> on PHOENIX-2715.
>>>>
>>>> My first concern is that the collected information is huge -- easily far
>>>> larger than the user data for a busy cluster. For instance, a couple 10's
>>>> of GB stored user data, guideposts set to default 100mb, enable salting
>>>> on
>>>> a table with an "innocent" value of 10 or 20 and the collection of RPCs
>>>> can
>>>> easily grow into the hundreds for simple queries. Even if you catalog
>>>> just
>>>> the "logical" RPC's - HBase Client API calls that Phoenix plans rather
>>>> than
>>>> the underlying HBase Client RPCs - this will be quite large. The
>>>> guidepost
>>>> themselves for such a table would be on the order of 30mb.
>>>>
>>>> My next concern is about the sensitive query parameters being stored.
>>>> It's
>>>> entirely reasonable to expect a table to store sensitive information that
>>>> should not be exposed to operations.
>>>>
>>>> Thus, my suggestions:
>>>> * minimize the unbounded nature of this table by truncating all columns
>>>> to
>>>> some max length -- perhaps 5k or 10k.
>>>> * enable a default TTL on the schema. 7 days seems like a good starting
>>>> point.
>>>> * consider controlling which columns are populated via some operational
>>>> mechanism. Use Logger level as an example, with INFO the default setting.
>>>> Which data is stored at this level? Then at DEBUG, then TRACE. Maybe
>>>> timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and
>>>> scan metrics. TRACE adds RPCs and timing, snapshot metadata.
>>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>> On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <el...@apache.org> wrote:
>>>>
>>>> Hiya,
>>>>>
>>>>> I wanted to share this little design doc with you about some feature
>>>>> work
>>>>> we've been thinking about. The following is a Google doc in which anyone
>>>>> should be allowed to comment. Feel free to comment there, or here on the
>>>>> thread.
>>>>>
>>>>> https://s.apache.org/phoenix-query-log
>>>>>
>>>>> The high-level goal is to create a construct in which Phoenix clients
>>>>>
>>>> will
>>>>
>>>>> automatically serialize information about the queries they run to a
>>>>> table
>>>>> for retrospective analysis. Ideally, this information would be stored in
>>>>>
>>>> a
>>>>
>>>>> Phoenix table. We want this data to help answer questions like:
>>>>>
>>>>> * What queries are running against my system
>>>>> * What specific queries started between 535AM and 620AM two days ago
>>>>> * What queries are user "bob" running
>>>>> * Are my user's queries effectively using the indexes in the system
>>>>>
>>>>> Anti-goals for include:
>>>>>
>>>>> * Cluster impact (computation/memory) usage of a query
>>>>> * Query performance may be slowed to ensure all data is serialized
>>>>> * A third-party service dedicated to ensuring query info is serialized
>>>>>
>>>> (in
>>>>
>>>>> the event of client failure)
>>>>>
>>>>> Take a look at the document and let us know what you think please. I'm
>>>>> happy to try to explain this in greater detail.
>>>>>
>>>>> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
> 

Re: [DISCUSS] Design for a "query log"

Posted by Samarth Jain <sa...@gmail.com>.
A couple more points which I think you alluded to, Josh, but I would still
like to call out:
1) Writing of these query logs to a phoenix table should be best effort
i.e. a query definitely shouldn't fail because we encountered an issue
while writing its log
2) Writing of query logs should happen in a manner that is async to the
flow of the query i.e. a query shouldn't incur the cost of the write
happening to the query log table

Doing 2) will help out with 1)





On Fri, Mar 2, 2018 at 2:28 PM, Josh Elser <el...@apache.org> wrote:

> Thanks Nick and Andrew! These are great points.
>
> * A TTL out of the box is a must. That's such a good suggestion
> * Sensitivity of data being stored is also a tricky-serious issue to
> consider. We'll want to lock the table down and be able to state very
> clearly what data may show up in it.
> * I like the "levels" of detail that will be persisted. It will help break
> up the development work (e.g. first impl can just be the INFO details), and
> prevents concern of runtime impact.
> * Sampling is a no-brainer for "always-on" situations. I like that too.
>
> I'll work on taking these (and others) and updating the gdoc tonight.
> Thanks again for your feedback!
>
>
> On 3/2/18 1:50 PM, Andrew Purtell wrote:
>
>> Agree with Nick's points but let me augment with an additional suggestion:
>> Tunable/configurable threshold for sampling. In many cases it's sufficient
>> to sample e.g. 1% of queries to get sufficient coverage and this would
>> prune 99% of actual load from the query log.
>>
>> Also let me underline that compliance requirements will require either
>> super strong controls of the query log if everything is always logged, in
>> which case it is important that it works well with access control features
>> to lock it down; or better what Nick suggests where we can turn off things
>> like logging the values supplied for bound parameters.
>>
>>
>>
>> On Fri, Mar 2, 2018 at 8:41 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>>
>> I'm a big fan of this idea. There was a brief discussion on the topic over
>>> on PHOENIX-2715.
>>>
>>> My first concern is that the collected information is huge -- easily far
>>> larger than the user data for a busy cluster. For instance, a couple 10's
>>> of GB stored user data, guideposts set to default 100mb, enable salting
>>> on
>>> a table with an "innocent" value of 10 or 20 and the collection of RPCs
>>> can
>>> easily grow into the hundreds for simple queries. Even if you catalog
>>> just
>>> the "logical" RPC's - HBase Client API calls that Phoenix plans rather
>>> than
>>> the underlying HBase Client RPCs - this will be quite large. The
>>> guidepost
>>> themselves for such a table would be on the order of 30mb.
>>>
>>> My next concern is about the sensitive query parameters being stored.
>>> It's
>>> entirely reasonable to expect a table to store sensitive information that
>>> should not be exposed to operations.
>>>
>>> Thus, my suggestions:
>>> * minimize the unbounded nature of this table by truncating all columns
>>> to
>>> some max length -- perhaps 5k or 10k.
>>> * enable a default TTL on the schema. 7 days seems like a good starting
>>> point.
>>> * consider controlling which columns are populated via some operational
>>> mechanism. Use Logger level as an example, with INFO the default setting.
>>> Which data is stored at this level? Then at DEBUG, then TRACE. Maybe
>>> timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and
>>> scan metrics. TRACE adds RPCs and timing, snapshot metadata.
>>>
>>> Thanks,
>>> Nick
>>>
>>> On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <el...@apache.org> wrote:
>>>
>>> Hiya,
>>>>
>>>> I wanted to share this little design doc with you about some feature
>>>> work
>>>> we've been thinking about. The following is a Google doc in which anyone
>>>> should be allowed to comment. Feel free to comment there, or here on the
>>>> thread.
>>>>
>>>> https://s.apache.org/phoenix-query-log
>>>>
>>>> The high-level goal is to create a construct in which Phoenix clients
>>>>
>>> will
>>>
>>>> automatically serialize information about the queries they run to a
>>>> table
>>>> for retrospective analysis. Ideally, this information would be stored in
>>>>
>>> a
>>>
>>>> Phoenix table. We want this data to help answer questions like:
>>>>
>>>> * What queries are running against my system
>>>> * What specific queries started between 535AM and 620AM two days ago
>>>> * What queries are user "bob" running
>>>> * Are my user's queries effectively using the indexes in the system
>>>>
>>>> Anti-goals for include:
>>>>
>>>> * Cluster impact (computation/memory) usage of a query
>>>> * Query performance may be slowed to ensure all data is serialized
>>>> * A third-party service dedicated to ensuring query info is serialized
>>>>
>>> (in
>>>
>>>> the event of client failure)
>>>>
>>>> Take a look at the document and let us know what you think please. I'm
>>>> happy to try to explain this in greater detail.
>>>>
>>>> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
>>>>
>>>>
>>>
>>
>>
>>

Re: [DISCUSS] Design for a "query log"

Posted by Josh Elser <el...@apache.org>.
Thanks Nick and Andrew! These are great points.

* A TTL out of the box is a must. That's such a good suggestion
* Sensitivity of data being stored is also a tricky-serious issue to 
consider. We'll want to lock the table down and be able to state very 
clearly what data may show up in it.
* I like the "levels" of detail that will be persisted. It will help 
break up the development work (e.g. first impl can just be the INFO 
details), and prevents concern of runtime impact.
* Sampling is a no-brainer for "always-on" situations. I like that too.

I'll work on taking these (and others) and updating the gdoc tonight. 
Thanks again for your feedback!

On 3/2/18 1:50 PM, Andrew Purtell wrote:
> Agree with Nick's points but let me augment with an additional suggestion:
> Tunable/configurable threshold for sampling. In many cases it's sufficient
> to sample e.g. 1% of queries to get sufficient coverage and this would
> prune 99% of actual load from the query log.
> 
> Also let me underline that compliance requirements will require either
> super strong controls of the query log if everything is always logged, in
> which case it is important that it works well with access control features
> to lock it down; or better what Nick suggests where we can turn off things
> like logging the values supplied for bound parameters.
> 
> 
> 
> On Fri, Mar 2, 2018 at 8:41 AM, Nick Dimiduk <nd...@gmail.com> wrote:
> 
>> I'm a big fan of this idea. There was a brief discussion on the topic over
>> on PHOENIX-2715.
>>
>> My first concern is that the collected information is huge -- easily far
>> larger than the user data for a busy cluster. For instance, a couple 10's
>> of GB stored user data, guideposts set to default 100mb, enable salting on
>> a table with an "innocent" value of 10 or 20 and the collection of RPCs can
>> easily grow into the hundreds for simple queries. Even if you catalog just
>> the "logical" RPC's - HBase Client API calls that Phoenix plans rather than
>> the underlying HBase Client RPCs - this will be quite large. The guidepost
>> themselves for such a table would be on the order of 30mb.
>>
>> My next concern is about the sensitive query parameters being stored. It's
>> entirely reasonable to expect a table to store sensitive information that
>> should not be exposed to operations.
>>
>> Thus, my suggestions:
>> * minimize the unbounded nature of this table by truncating all columns to
>> some max length -- perhaps 5k or 10k.
>> * enable a default TTL on the schema. 7 days seems like a good starting
>> point.
>> * consider controlling which columns are populated via some operational
>> mechanism. Use Logger level as an example, with INFO the default setting.
>> Which data is stored at this level? Then at DEBUG, then TRACE. Maybe
>> timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and
>> scan metrics. TRACE adds RPCs and timing, snapshot metadata.
>>
>> Thanks,
>> Nick
>>
>> On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <el...@apache.org> wrote:
>>
>>> Hiya,
>>>
>>> I wanted to share this little design doc with you about some feature work
>>> we've been thinking about. The following is a Google doc in which anyone
>>> should be allowed to comment. Feel free to comment there, or here on the
>>> thread.
>>>
>>> https://s.apache.org/phoenix-query-log
>>>
>>> The high-level goal is to create a construct in which Phoenix clients
>> will
>>> automatically serialize information about the queries they run to a table
>>> for retrospective analysis. Ideally, this information would be stored in
>> a
>>> Phoenix table. We want this data to help answer questions like:
>>>
>>> * What queries are running against my system
>>> * What specific queries started between 535AM and 620AM two days ago
>>> * What queries are user "bob" running
>>> * Are my user's queries effectively using the indexes in the system
>>>
>>> Anti-goals for include:
>>>
>>> * Cluster impact (computation/memory) usage of a query
>>> * Query performance may be slowed to ensure all data is serialized
>>> * A third-party service dedicated to ensuring query info is serialized
>> (in
>>> the event of client failure)
>>>
>>> Take a look at the document and let us know what you think please. I'm
>>> happy to try to explain this in greater detail.
>>>
>>> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
>>>
>>
> 
> 
> 

Re: [DISCUSS] Design for a "query log"

Posted by Andrew Purtell <ap...@apache.org>.
Agree with Nick's points but let me augment with an additional suggestion:
Tunable/configurable threshold for sampling. In many cases it's sufficient
to sample e.g. 1% of queries to get sufficient coverage and this would
prune 99% of actual load from the query log.

Also let me underline that compliance requirements will require either
super strong controls of the query log if everything is always logged, in
which case it is important that it works well with access control features
to lock it down; or better what Nick suggests where we can turn off things
like logging the values supplied for bound parameters.



On Fri, Mar 2, 2018 at 8:41 AM, Nick Dimiduk <nd...@gmail.com> wrote:

> I'm a big fan of this idea. There was a brief discussion on the topic over
> on PHOENIX-2715.
>
> My first concern is that the collected information is huge -- easily far
> larger than the user data for a busy cluster. For instance, a couple 10's
> of GB stored user data, guideposts set to default 100mb, enable salting on
> a table with an "innocent" value of 10 or 20 and the collection of RPCs can
> easily grow into the hundreds for simple queries. Even if you catalog just
> the "logical" RPC's - HBase Client API calls that Phoenix plans rather than
> the underlying HBase Client RPCs - this will be quite large. The guidepost
> themselves for such a table would be on the order of 30mb.
>
> My next concern is about the sensitive query parameters being stored. It's
> entirely reasonable to expect a table to store sensitive information that
> should not be exposed to operations.
>
> Thus, my suggestions:
> * minimize the unbounded nature of this table by truncating all columns to
> some max length -- perhaps 5k or 10k.
> * enable a default TTL on the schema. 7 days seems like a good starting
> point.
> * consider controlling which columns are populated via some operational
> mechanism. Use Logger level as an example, with INFO the default setting.
> Which data is stored at this level? Then at DEBUG, then TRACE. Maybe
> timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and
> scan metrics. TRACE adds RPCs and timing, snapshot metadata.
>
> Thanks,
> Nick
>
> On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <el...@apache.org> wrote:
>
> > Hiya,
> >
> > I wanted to share this little design doc with you about some feature work
> > we've been thinking about. The following is a Google doc in which anyone
> > should be allowed to comment. Feel free to comment there, or here on the
> > thread.
> >
> > https://s.apache.org/phoenix-query-log
> >
> > The high-level goal is to create a construct in which Phoenix clients
> will
> > automatically serialize information about the queries they run to a table
> > for retrospective analysis. Ideally, this information would be stored in
> a
> > Phoenix table. We want this data to help answer questions like:
> >
> > * What queries are running against my system
> > * What specific queries started between 535AM and 620AM two days ago
> > * What queries are user "bob" running
> > * Are my user's queries effectively using the indexes in the system
> >
> > Anti-goals for include:
> >
> > * Cluster impact (computation/memory) usage of a query
> > * Query performance may be slowed to ensure all data is serialized
> > * A third-party service dedicated to ensuring query info is serialized
> (in
> > the event of client failure)
> >
> > Take a look at the document and let us know what you think please. I'm
> > happy to try to explain this in greater detail.
> >
> > - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
> >
>



-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [DISCUSS] Design for a "query log"

Posted by Nick Dimiduk <nd...@gmail.com>.
I'm a big fan of this idea. There was a brief discussion on the topic over
on PHOENIX-2715.

My first concern is that the collected information is huge -- easily far
larger than the user data for a busy cluster. For instance, a couple 10's
of GB stored user data, guideposts set to default 100mb, enable salting on
a table with an "innocent" value of 10 or 20 and the collection of RPCs can
easily grow into the hundreds for simple queries. Even if you catalog just
the "logical" RPC's - HBase Client API calls that Phoenix plans rather than
the underlying HBase Client RPCs - this will be quite large. The guidepost
themselves for such a table would be on the order of 30mb.

My next concern is about the sensitive query parameters being stored. It's
entirely reasonable to expect a table to store sensitive information that
should not be exposed to operations.

Thus, my suggestions:
* minimize the unbounded nature of this table by truncating all columns to
some max length -- perhaps 5k or 10k.
* enable a default TTL on the schema. 7 days seems like a good starting
point.
* consider controlling which columns are populated via some operational
mechanism. Use Logger level as an example, with INFO the default setting.
Which data is stored at this level? Then at DEBUG, then TRACE. Maybe
timestamp, SQL, and explain are at INFO. DEBUG adds bound parameters and
scan metrics. TRACE adds RPCs and timing, snapshot metadata.

Thanks,
Nick

On Mon, Feb 26, 2018 at 1:57 PM, Josh Elser <el...@apache.org> wrote:

> Hiya,
>
> I wanted to share this little design doc with you about some feature work
> we've been thinking about. The following is a Google doc in which anyone
> should be allowed to comment. Feel free to comment there, or here on the
> thread.
>
> https://s.apache.org/phoenix-query-log
>
> The high-level goal is to create a construct in which Phoenix clients will
> automatically serialize information about the queries they run to a table
> for retrospective analysis. Ideally, this information would be stored in a
> Phoenix table. We want this data to help answer questions like:
>
> * What queries are running against my system
> * What specific queries started between 535AM and 620AM two days ago
> * What queries are user "bob" running
> * Are my user's queries effectively using the indexes in the system
>
> Anti-goals for include:
>
> * Cluster impact (computation/memory) usage of a query
> * Query performance may be slowed to ensure all data is serialized
> * A third-party service dedicated to ensuring query info is serialized (in
> the event of client failure)
>
> Take a look at the document and let us know what you think please. I'm
> happy to try to explain this in greater detail.
>
> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
>

Re: [DISCUSS] Design for a "query log"

Posted by Josh Elser <el...@apache.org>.
My gut reaction would be to avoid storing queries over the system tables 
as they would have more noise than value, but I think this is something 
that could be entertained!

On 3/2/18 8:02 AM, Artem Ervits wrote:
> +1
> 
> Is idea here to collect information on all types of queries even against
> system tables? It would be nice to keep counts of how many times a query
> was executed over time. In my mssql days we also had ability to check when
> was the last time update stats ran and all kinds of impact information from
> that.
> 
> On Mar 2, 2018 1:39 AM, "김영우 (YoungWoo Kim)" <yw...@apache.org> wrote:
> 
>> Hi Josh,
>>
>> Thanks for starting this discussion. Overall the design looks good to me!
>>
>> Actually I have an internal implementation for logging user queries to text
>> file over PQS. Let me explain our use cases. We wanted to find out
>> following facts using our ugly hacks:
>> - Running queries against PQS
>> - Find out "WRONG" queries
>>
>> On our systems and applications, I've realized that It's hard to control
>> unexpected large scans, full scans or running queries without indexes
>> against very large table on HBase/Phoenix. Sometimes users forgot to add
>> the "WHERE" clause for queries :-) And also we could not find an effective
>> way to stop or cancel the wrong queries immediately. Therefore, we should
>> find out the users and heavy queries from the query logs  and then we
>> noticed users about the impacts of their queries -- Please try this at
>> home, not production cluster :-)
>>
>> Definitely it would be nice to have query logging out of the box! And if
>> the features are implemented like your proposal, we can replace an ugly
>> hacks with built-in feature from our cluster.
>>
>> Thanks,
>> - Youngwoo
>>
>>
>>
>> On Fri, Mar 2, 2018 at 10:05 AM, Josh Elser <el...@apache.org> wrote:
>>
>>> Any feedback from folks? Not sure if the silence should be interpreted
>>> as ambivalence or plain old being busy :)
>>>
>>> On Mon, Feb 26, 2018 at 4:57 PM, Josh Elser <el...@apache.org> wrote:
>>>> Hiya,
>>>>
>>>> I wanted to share this little design doc with you about some feature
>> work
>>>> we've been thinking about. The following is a Google doc in which
>> anyone
>>>> should be allowed to comment. Feel free to comment there, or here on
>> the
>>>> thread.
>>>>
>>>> https://s.apache.org/phoenix-query-log
>>>>
>>>> The high-level goal is to create a construct in which Phoenix clients
>>> will
>>>> automatically serialize information about the queries they run to a
>> table
>>>> for retrospective analysis. Ideally, this information would be stored
>> in
>>> a
>>>> Phoenix table. We want this data to help answer questions like:
>>>>
>>>> * What queries are running against my system
>>>> * What specific queries started between 535AM and 620AM two days ago
>>>> * What queries are user "bob" running
>>>> * Are my user's queries effectively using the indexes in the system
>>>>
>>>> Anti-goals for include:
>>>>
>>>> * Cluster impact (computation/memory) usage of a query
>>>> * Query performance may be slowed to ensure all data is serialized
>>>> * A third-party service dedicated to ensuring query info is serialized
>>> (in
>>>> the event of client failure)
>>>>
>>>> Take a look at the document and let us know what you think please. I'm
>>> happy
>>>> to try to explain this in greater detail.
>>>>
>>>> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
>>>
>>
> 

Re: [DISCUSS] Design for a "query log"

Posted by Artem Ervits <ar...@gmail.com>.
+1

Is idea here to collect information on all types of queries even against
system tables? It would be nice to keep counts of how many times a query
was executed over time. In my mssql days we also had ability to check when
was the last time update stats ran and all kinds of impact information from
that.

On Mar 2, 2018 1:39 AM, "김영우 (YoungWoo Kim)" <yw...@apache.org> wrote:

> Hi Josh,
>
> Thanks for starting this discussion. Overall the design looks good to me!
>
> Actually I have an internal implementation for logging user queries to text
> file over PQS. Let me explain our use cases. We wanted to find out
> following facts using our ugly hacks:
> - Running queries against PQS
> - Find out "WRONG" queries
>
> On our systems and applications, I've realized that It's hard to control
> unexpected large scans, full scans or running queries without indexes
> against very large table on HBase/Phoenix. Sometimes users forgot to add
> the "WHERE" clause for queries :-) And also we could not find an effective
> way to stop or cancel the wrong queries immediately. Therefore, we should
> find out the users and heavy queries from the query logs  and then we
> noticed users about the impacts of their queries -- Please try this at
> home, not production cluster :-)
>
> Definitely it would be nice to have query logging out of the box! And if
> the features are implemented like your proposal, we can replace an ugly
> hacks with built-in feature from our cluster.
>
> Thanks,
> - Youngwoo
>
>
>
> On Fri, Mar 2, 2018 at 10:05 AM, Josh Elser <el...@apache.org> wrote:
>
> > Any feedback from folks? Not sure if the silence should be interpreted
> > as ambivalence or plain old being busy :)
> >
> > On Mon, Feb 26, 2018 at 4:57 PM, Josh Elser <el...@apache.org> wrote:
> > > Hiya,
> > >
> > > I wanted to share this little design doc with you about some feature
> work
> > > we've been thinking about. The following is a Google doc in which
> anyone
> > > should be allowed to comment. Feel free to comment there, or here on
> the
> > > thread.
> > >
> > > https://s.apache.org/phoenix-query-log
> > >
> > > The high-level goal is to create a construct in which Phoenix clients
> > will
> > > automatically serialize information about the queries they run to a
> table
> > > for retrospective analysis. Ideally, this information would be stored
> in
> > a
> > > Phoenix table. We want this data to help answer questions like:
> > >
> > > * What queries are running against my system
> > > * What specific queries started between 535AM and 620AM two days ago
> > > * What queries are user "bob" running
> > > * Are my user's queries effectively using the indexes in the system
> > >
> > > Anti-goals for include:
> > >
> > > * Cluster impact (computation/memory) usage of a query
> > > * Query performance may be slowed to ensure all data is serialized
> > > * A third-party service dedicated to ensuring query info is serialized
> > (in
> > > the event of client failure)
> > >
> > > Take a look at the document and let us know what you think please. I'm
> > happy
> > > to try to explain this in greater detail.
> > >
> > > - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
> >
>

Re: [DISCUSS] Design for a "query log"

Posted by "김영우 (YoungWoo Kim)" <yw...@apache.org>.
Hi Josh,

Thanks for starting this discussion. Overall the design looks good to me!

Actually I have an internal implementation for logging user queries to text
file over PQS. Let me explain our use cases. We wanted to find out
following facts using our ugly hacks:
- Running queries against PQS
- Find out "WRONG" queries

On our systems and applications, I've realized that It's hard to control
unexpected large scans, full scans or running queries without indexes
against very large table on HBase/Phoenix. Sometimes users forgot to add
the "WHERE" clause for queries :-) And also we could not find an effective
way to stop or cancel the wrong queries immediately. Therefore, we should
find out the users and heavy queries from the query logs  and then we
noticed users about the impacts of their queries -- Please try this at
home, not production cluster :-)

Definitely it would be nice to have query logging out of the box! And if
the features are implemented like your proposal, we can replace an ugly
hacks with built-in feature from our cluster.

Thanks,
- Youngwoo



On Fri, Mar 2, 2018 at 10:05 AM, Josh Elser <el...@apache.org> wrote:

> Any feedback from folks? Not sure if the silence should be interpreted
> as ambivalence or plain old being busy :)
>
> On Mon, Feb 26, 2018 at 4:57 PM, Josh Elser <el...@apache.org> wrote:
> > Hiya,
> >
> > I wanted to share this little design doc with you about some feature work
> > we've been thinking about. The following is a Google doc in which anyone
> > should be allowed to comment. Feel free to comment there, or here on the
> > thread.
> >
> > https://s.apache.org/phoenix-query-log
> >
> > The high-level goal is to create a construct in which Phoenix clients
> will
> > automatically serialize information about the queries they run to a table
> > for retrospective analysis. Ideally, this information would be stored in
> a
> > Phoenix table. We want this data to help answer questions like:
> >
> > * What queries are running against my system
> > * What specific queries started between 535AM and 620AM two days ago
> > * What queries are user "bob" running
> > * Are my user's queries effectively using the indexes in the system
> >
> > Anti-goals for include:
> >
> > * Cluster impact (computation/memory) usage of a query
> > * Query performance may be slowed to ensure all data is serialized
> > * A third-party service dedicated to ensuring query info is serialized
> (in
> > the event of client failure)
> >
> > Take a look at the document and let us know what you think please. I'm
> happy
> > to try to explain this in greater detail.
> >
> > - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)
>

Re: [DISCUSS] Design for a "query log"

Posted by Josh Elser <el...@apache.org>.
Any feedback from folks? Not sure if the silence should be interpreted
as ambivalence or plain old being busy :)

On Mon, Feb 26, 2018 at 4:57 PM, Josh Elser <el...@apache.org> wrote:
> Hiya,
>
> I wanted to share this little design doc with you about some feature work
> we've been thinking about. The following is a Google doc in which anyone
> should be allowed to comment. Feel free to comment there, or here on the
> thread.
>
> https://s.apache.org/phoenix-query-log
>
> The high-level goal is to create a construct in which Phoenix clients will
> automatically serialize information about the queries they run to a table
> for retrospective analysis. Ideally, this information would be stored in a
> Phoenix table. We want this data to help answer questions like:
>
> * What queries are running against my system
> * What specific queries started between 535AM and 620AM two days ago
> * What queries are user "bob" running
> * Are my user's queries effectively using the indexes in the system
>
> Anti-goals for include:
>
> * Cluster impact (computation/memory) usage of a query
> * Query performance may be slowed to ensure all data is serialized
> * A third-party service dedicated to ensuring query info is serialized (in
> the event of client failure)
>
> Take a look at the document and let us know what you think please. I'm happy
> to try to explain this in greater detail.
>
> - Josh (on behalf of myself, Ankit, Rajeshbabu, and Sergey)