You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sergey Shelukhin <se...@hortonworks.com> on 2013/08/21 23:28:43 UTC

LIKE filter pushdown for tables and partitions

Hi.

I think there are issues with the way hive can currently do LIKE
operator JDO pushdown and it the code should be removed for partitions
and tables.
Are there objections to removing LIKE from Filter.g and related areas?
If no I will file a JIRA and do it.

Details:
There's code in metastore that is capable of pushing down LIKE
expression into JDO for string partition keys, as well as tables.
The code for tables doesn't appear used, and partition code definitely
doesn't run in Hive proper because metastore client doesn't send LIKE
expressions to server. It may be used in e.g. HCat and other places,
but after asking some people here, I found out it probably isn't.
I was trying to make it run and noticed some problems:
1) For partitions, Hive sends SQL patterns in a filter for like, e.g.
"%foo%", whereas metastore passes them into matches() JDOQL method
which expects Java regex.
2) Converting the pattern to Java regex via UDFLike method, I found
out that not all regexes appear to work in DN. ".*foo" seems to work
but anything complex (such as escaping the pattern using
Pattern.quote, which UDFLike does) breaks and no longer matches
properly.
3) I tried to implement common cases using JDO methods
startsWith/endsWith/indexOf (I will file a JIRA), but when I run tests
on Derby, they also appear to have problems with some strings (for
example, partition with backslash in the name cannot be matched by
LIKE "%\%" (single backslash in a string), after being converted to
.indexOf(param) where param is "\" (escaping the backslash once again
doesn't work either, and anyway there's no documented reason why it
shouldn't work properly), while other characters match correctly, even
e.g. "%".

For tables, there's no SQL-like, it expects Java regex, but I am not
convinced all Java regexes are going to work.

So, I think that for future correctness sake it's better to remove this code.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: LIKE filter pushdown for tables and partitions

Posted by Stephen Sprague <sp...@gmail.com>.
Thanks Sergey for that.  Good stuff.

I can't speak for everybody here obviously but Hive partition elimination
is critical - its gotta happen somehow.  However, if JDOQL method isn't
robust around the edges i'm fine with finding something better.

So if I get you right you're saying by removing the "optimized path"
(getPartitionsByFiltr/JDOQL) the partition elimination logic will default
to the "normal path" which is some other kind of filtering.  To that i
guess i'd have to say what's the risk? It's a little slower?

Thanks for your patience, Sergey!






On Tue, Aug 27, 2013 at 10:35 AM, Sergey Shelukhin
<se...@hortonworks.com>wrote:

> This method is used to prune partitions for the job (separately from
> actually processing data).
> There are a few ways to get partitions from Hive for a query (to avoid
> reading all partitions when filtering involves partition columns)  -
> get-by-filter that I want to modify is one of them. Hive itself uses it as
> a perf optimization; the normal path gets all partition column values (via
> partition names) and applies the filter locally, whereas the optimized path
> converts the filter to JDOQL for DataNucleus (that Hive metastore uses
> internally), which converts it to SQL queries for e.g. MySQL. This normally
> happens before MR job is even run.
>
> Hive uses the latter (JDOQL pushdown) path for a restricted set of filters.
> These are enforced in Hive metastore client, not server; the server
> supports a wider set of filters, but Hive itself doesn't use them. While
> trying to enable Hive to use a wider set I noticed that the LIKE filter
> doesn't work properly - both regex and indexOf/... functions in DN seem to
> have some weird edge cases. It may be sending some things directly to
> datastore which would not actually work.
> However they would work for simple regexes (definition of simple is not
> clear and may be not the same for all datastores).
>
> Given that there's normal path to filter partitions in hive client and
> pre-job perf optimization for like is not that important, I want to remove
> this for Hive,
> I assume that other products using this path must apply filtering on client
> too sometimes (because getPartitionsByFilter doesn't support all filters
> even on server, e.g. such  operators as not, between, etc.).
>
> On Tue, Aug 27, 2013 at 9:13 AM, Stephen Sprague <sp...@gmail.com>
> wrote:
>
> > sorry to be dumb-ass but what does that translate into in the HSQL
> dialect?
> >
> > Judging from the name you use, getPartitionsByFilter, you're saying you
> > want to remove the use case of using like clause on a partition column?
> >
> > if so, um, yeah, i would think that's surely used.
> >
> >
> >
> > On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >wrote:
> >
> > > Adding user list. Any objections to removing LIKE support from
> > > getPartitionsByFilter?
> > >
> > > On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <
> hashutosh@apache.org
> > > >wrote:
> > >
> > > > Couple of questions:
> > > >
> > > > 1. What about LIKE operator for Hive itself? Will that continue to
> work
> > > > (presumably because there is an alternative path for that).
> > > > 2. This will nonetheless break other direct consumers of metastore
> > client
> > > > api (like HCatalog).
> > > >
> > > > I see your point that we have a buggy implementation, so whats out
> > there
> > > is
> > > > not safe to use. Question than really is shall we remove this code,
> > > thereby
> > > > breaking people for whom current buggy implementation is good enough
> > (or
> > > > you can say salvage them from breaking in future). Or shall we try to
> > fix
> > > > it now?
> > > > My take is if there are no users of this anyways, then there is no
> > point
> > > > fixing it for non-existing users, but if there are we probably have
> > to. I
> > > > will suggest you to send an email to users@hive to ask if there are
> > > users
> > > > for this.
> > > >
> > > > Thanks,
> > > > Ashutosh
> > > >
> > > >
> > > >
> > > > On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <
> > > sergey@hortonworks.com
> > > > >wrote:
> > > >
> > > > > Since there's no response I am assuming nobody cares about this
> > code...
> > > > > Jira is HIVE-5134, I will attach a patch with removal this week.
> > > > >
> > > > > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> > > > sergey@hortonworks.com
> > > > > >wrote:
> > > > >
> > > > > > Hi.
> > > > > >
> > > > > > I think there are issues with the way hive can currently do LIKE
> > > > > > operator JDO pushdown and it the code should be removed for
> > > partitions
> > > > > > and tables.
> > > > > > Are there objections to removing LIKE from Filter.g and related
> > > areas?
> > > > > > If no I will file a JIRA and do it.
> > > > > >
> > > > > > Details:
> > > > > > There's code in metastore that is capable of pushing down LIKE
> > > > > > expression into JDO for string partition keys, as well as tables.
> > > > > > The code for tables doesn't appear used, and partition code
> > > definitely
> > > > > > doesn't run in Hive proper because metastore client doesn't send
> > LIKE
> > > > > > expressions to server. It may be used in e.g. HCat and other
> > places,
> > > > > > but after asking some people here, I found out it probably isn't.
> > > > > > I was trying to make it run and noticed some problems:
> > > > > > 1) For partitions, Hive sends SQL patterns in a filter for like,
> > e.g.
> > > > > > "%foo%", whereas metastore passes them into matches() JDOQL
> method
> > > > > > which expects Java regex.
> > > > > > 2) Converting the pattern to Java regex via UDFLike method, I
> found
> > > > > > out that not all regexes appear to work in DN. ".*foo" seems to
> > work
> > > > > > but anything complex (such as escaping the pattern using
> > > > > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > > > > properly.
> > > > > > 3) I tried to implement common cases using JDO methods
> > > > > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run
> > > tests
> > > > > > on Derby, they also appear to have problems with some strings
> (for
> > > > > > example, partition with backslash in the name cannot be matched
> by
> > > > > > LIKE "%\%" (single backslash in a string), after being converted
> to
> > > > > > .indexOf(param) where param is "\" (escaping the backslash once
> > again
> > > > > > doesn't work either, and anyway there's no documented reason why
> it
> > > > > > shouldn't work properly), while other characters match correctly,
> > > even
> > > > > > e.g. "%".
> > > > > >
> > > > > > For tables, there's no SQL-like, it expects Java regex, but I am
> > not
> > > > > > convinced all Java regexes are going to work.
> > > > > >
> > > > > > So, I think that for future correctness sake it's better to
> remove
> > > this
> > > > > > code.
> > > > > >
> > > > >
> > > > > --
> > > > > CONFIDENTIALITY NOTICE
> > > > > NOTICE: This message is intended for the use of the individual or
> > > entity
> > > > to
> > > > > which it is addressed and may contain information that is
> > confidential,
> > > > > privileged and exempt from disclosure under applicable law. If the
> > > reader
> > > > > of this message is not the intended recipient, you are hereby
> > notified
> > > > that
> > > > > any printing, copying, dissemination, distribution, disclosure or
> > > > > forwarding of this communication is strictly prohibited. If you
> have
> > > > > received this communication in error, please contact the sender
> > > > immediately
> > > > > and delete it from your system. Thank You.
> > > > >
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: LIKE filter pushdown for tables and partitions

Posted by Sergey Shelukhin <se...@hortonworks.com>.
This method is used to prune partitions for the job (separately from
actually processing data).
There are a few ways to get partitions from Hive for a query (to avoid
reading all partitions when filtering involves partition columns)  -
get-by-filter that I want to modify is one of them. Hive itself uses it as
a perf optimization; the normal path gets all partition column values (via
partition names) and applies the filter locally, whereas the optimized path
converts the filter to JDOQL for DataNucleus (that Hive metastore uses
internally), which converts it to SQL queries for e.g. MySQL. This normally
happens before MR job is even run.

Hive uses the latter (JDOQL pushdown) path for a restricted set of filters.
These are enforced in Hive metastore client, not server; the server
supports a wider set of filters, but Hive itself doesn't use them. While
trying to enable Hive to use a wider set I noticed that the LIKE filter
doesn't work properly - both regex and indexOf/... functions in DN seem to
have some weird edge cases. It may be sending some things directly to
datastore which would not actually work.
However they would work for simple regexes (definition of simple is not
clear and may be not the same for all datastores).

Given that there's normal path to filter partitions in hive client and
pre-job perf optimization for like is not that important, I want to remove
this for Hive,
I assume that other products using this path must apply filtering on client
too sometimes (because getPartitionsByFilter doesn't support all filters
even on server, e.g. such  operators as not, between, etc.).

On Tue, Aug 27, 2013 at 9:13 AM, Stephen Sprague <sp...@gmail.com> wrote:

> sorry to be dumb-ass but what does that translate into in the HSQL dialect?
>
> Judging from the name you use, getPartitionsByFilter, you're saying you
> want to remove the use case of using like clause on a partition column?
>
> if so, um, yeah, i would think that's surely used.
>
>
>
> On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <sergey@hortonworks.com
> >wrote:
>
> > Adding user list. Any objections to removing LIKE support from
> > getPartitionsByFilter?
> >
> > On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <hashutosh@apache.org
> > >wrote:
> >
> > > Couple of questions:
> > >
> > > 1. What about LIKE operator for Hive itself? Will that continue to work
> > > (presumably because there is an alternative path for that).
> > > 2. This will nonetheless break other direct consumers of metastore
> client
> > > api (like HCatalog).
> > >
> > > I see your point that we have a buggy implementation, so whats out
> there
> > is
> > > not safe to use. Question than really is shall we remove this code,
> > thereby
> > > breaking people for whom current buggy implementation is good enough
> (or
> > > you can say salvage them from breaking in future). Or shall we try to
> fix
> > > it now?
> > > My take is if there are no users of this anyways, then there is no
> point
> > > fixing it for non-existing users, but if there are we probably have
> to. I
> > > will suggest you to send an email to users@hive to ask if there are
> > users
> > > for this.
> > >
> > > Thanks,
> > > Ashutosh
> > >
> > >
> > >
> > > On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <
> > sergey@hortonworks.com
> > > >wrote:
> > >
> > > > Since there's no response I am assuming nobody cares about this
> code...
> > > > Jira is HIVE-5134, I will attach a patch with removal this week.
> > > >
> > > > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> > > sergey@hortonworks.com
> > > > >wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > I think there are issues with the way hive can currently do LIKE
> > > > > operator JDO pushdown and it the code should be removed for
> > partitions
> > > > > and tables.
> > > > > Are there objections to removing LIKE from Filter.g and related
> > areas?
> > > > > If no I will file a JIRA and do it.
> > > > >
> > > > > Details:
> > > > > There's code in metastore that is capable of pushing down LIKE
> > > > > expression into JDO for string partition keys, as well as tables.
> > > > > The code for tables doesn't appear used, and partition code
> > definitely
> > > > > doesn't run in Hive proper because metastore client doesn't send
> LIKE
> > > > > expressions to server. It may be used in e.g. HCat and other
> places,
> > > > > but after asking some people here, I found out it probably isn't.
> > > > > I was trying to make it run and noticed some problems:
> > > > > 1) For partitions, Hive sends SQL patterns in a filter for like,
> e.g.
> > > > > "%foo%", whereas metastore passes them into matches() JDOQL method
> > > > > which expects Java regex.
> > > > > 2) Converting the pattern to Java regex via UDFLike method, I found
> > > > > out that not all regexes appear to work in DN. ".*foo" seems to
> work
> > > > > but anything complex (such as escaping the pattern using
> > > > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > > > properly.
> > > > > 3) I tried to implement common cases using JDO methods
> > > > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run
> > tests
> > > > > on Derby, they also appear to have problems with some strings (for
> > > > > example, partition with backslash in the name cannot be matched by
> > > > > LIKE "%\%" (single backslash in a string), after being converted to
> > > > > .indexOf(param) where param is "\" (escaping the backslash once
> again
> > > > > doesn't work either, and anyway there's no documented reason why it
> > > > > shouldn't work properly), while other characters match correctly,
> > even
> > > > > e.g. "%".
> > > > >
> > > > > For tables, there's no SQL-like, it expects Java regex, but I am
> not
> > > > > convinced all Java regexes are going to work.
> > > > >
> > > > > So, I think that for future correctness sake it's better to remove
> > this
> > > > > code.
> > > > >
> > > >
> > > > --
> > > > CONFIDENTIALITY NOTICE
> > > > NOTICE: This message is intended for the use of the individual or
> > entity
> > > to
> > > > which it is addressed and may contain information that is
> confidential,
> > > > privileged and exempt from disclosure under applicable law. If the
> > reader
> > > > of this message is not the intended recipient, you are hereby
> notified
> > > that
> > > > any printing, copying, dissemination, distribution, disclosure or
> > > > forwarding of this communication is strictly prohibited. If you have
> > > > received this communication in error, please contact the sender
> > > immediately
> > > > and delete it from your system. Thank You.
> > > >
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: LIKE filter pushdown for tables and partitions

Posted by Sergey Shelukhin <se...@hortonworks.com>.
This method is used to prune partitions for the job (separately from
actually processing data).
There are a few ways to get partitions from Hive for a query (to avoid
reading all partitions when filtering involves partition columns)  -
get-by-filter that I want to modify is one of them. Hive itself uses it as
a perf optimization; the normal path gets all partition column values (via
partition names) and applies the filter locally, whereas the optimized path
converts the filter to JDOQL for DataNucleus (that Hive metastore uses
internally), which converts it to SQL queries for e.g. MySQL. This normally
happens before MR job is even run.

Hive uses the latter (JDOQL pushdown) path for a restricted set of filters.
These are enforced in Hive metastore client, not server; the server
supports a wider set of filters, but Hive itself doesn't use them. While
trying to enable Hive to use a wider set I noticed that the LIKE filter
doesn't work properly - both regex and indexOf/... functions in DN seem to
have some weird edge cases. It may be sending some things directly to
datastore which would not actually work.
However they would work for simple regexes (definition of simple is not
clear and may be not the same for all datastores).

Given that there's normal path to filter partitions in hive client and
pre-job perf optimization for like is not that important, I want to remove
this for Hive,
I assume that other products using this path must apply filtering on client
too sometimes (because getPartitionsByFilter doesn't support all filters
even on server, e.g. such  operators as not, between, etc.).

On Tue, Aug 27, 2013 at 9:13 AM, Stephen Sprague <sp...@gmail.com> wrote:

> sorry to be dumb-ass but what does that translate into in the HSQL dialect?
>
> Judging from the name you use, getPartitionsByFilter, you're saying you
> want to remove the use case of using like clause on a partition column?
>
> if so, um, yeah, i would think that's surely used.
>
>
>
> On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <sergey@hortonworks.com
> >wrote:
>
> > Adding user list. Any objections to removing LIKE support from
> > getPartitionsByFilter?
> >
> > On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <hashutosh@apache.org
> > >wrote:
> >
> > > Couple of questions:
> > >
> > > 1. What about LIKE operator for Hive itself? Will that continue to work
> > > (presumably because there is an alternative path for that).
> > > 2. This will nonetheless break other direct consumers of metastore
> client
> > > api (like HCatalog).
> > >
> > > I see your point that we have a buggy implementation, so whats out
> there
> > is
> > > not safe to use. Question than really is shall we remove this code,
> > thereby
> > > breaking people for whom current buggy implementation is good enough
> (or
> > > you can say salvage them from breaking in future). Or shall we try to
> fix
> > > it now?
> > > My take is if there are no users of this anyways, then there is no
> point
> > > fixing it for non-existing users, but if there are we probably have
> to. I
> > > will suggest you to send an email to users@hive to ask if there are
> > users
> > > for this.
> > >
> > > Thanks,
> > > Ashutosh
> > >
> > >
> > >
> > > On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <
> > sergey@hortonworks.com
> > > >wrote:
> > >
> > > > Since there's no response I am assuming nobody cares about this
> code...
> > > > Jira is HIVE-5134, I will attach a patch with removal this week.
> > > >
> > > > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> > > sergey@hortonworks.com
> > > > >wrote:
> > > >
> > > > > Hi.
> > > > >
> > > > > I think there are issues with the way hive can currently do LIKE
> > > > > operator JDO pushdown and it the code should be removed for
> > partitions
> > > > > and tables.
> > > > > Are there objections to removing LIKE from Filter.g and related
> > areas?
> > > > > If no I will file a JIRA and do it.
> > > > >
> > > > > Details:
> > > > > There's code in metastore that is capable of pushing down LIKE
> > > > > expression into JDO for string partition keys, as well as tables.
> > > > > The code for tables doesn't appear used, and partition code
> > definitely
> > > > > doesn't run in Hive proper because metastore client doesn't send
> LIKE
> > > > > expressions to server. It may be used in e.g. HCat and other
> places,
> > > > > but after asking some people here, I found out it probably isn't.
> > > > > I was trying to make it run and noticed some problems:
> > > > > 1) For partitions, Hive sends SQL patterns in a filter for like,
> e.g.
> > > > > "%foo%", whereas metastore passes them into matches() JDOQL method
> > > > > which expects Java regex.
> > > > > 2) Converting the pattern to Java regex via UDFLike method, I found
> > > > > out that not all regexes appear to work in DN. ".*foo" seems to
> work
> > > > > but anything complex (such as escaping the pattern using
> > > > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > > > properly.
> > > > > 3) I tried to implement common cases using JDO methods
> > > > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run
> > tests
> > > > > on Derby, they also appear to have problems with some strings (for
> > > > > example, partition with backslash in the name cannot be matched by
> > > > > LIKE "%\%" (single backslash in a string), after being converted to
> > > > > .indexOf(param) where param is "\" (escaping the backslash once
> again
> > > > > doesn't work either, and anyway there's no documented reason why it
> > > > > shouldn't work properly), while other characters match correctly,
> > even
> > > > > e.g. "%".
> > > > >
> > > > > For tables, there's no SQL-like, it expects Java regex, but I am
> not
> > > > > convinced all Java regexes are going to work.
> > > > >
> > > > > So, I think that for future correctness sake it's better to remove
> > this
> > > > > code.
> > > > >
> > > >
> > > > --
> > > > CONFIDENTIALITY NOTICE
> > > > NOTICE: This message is intended for the use of the individual or
> > entity
> > > to
> > > > which it is addressed and may contain information that is
> confidential,
> > > > privileged and exempt from disclosure under applicable law. If the
> > reader
> > > > of this message is not the intended recipient, you are hereby
> notified
> > > that
> > > > any printing, copying, dissemination, distribution, disclosure or
> > > > forwarding of this communication is strictly prohibited. If you have
> > > > received this communication in error, please contact the sender
> > > immediately
> > > > and delete it from your system. Thank You.
> > > >
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: LIKE filter pushdown for tables and partitions

Posted by Stephen Sprague <sp...@gmail.com>.
sorry to be dumb-ass but what does that translate into in the HSQL dialect?

Judging from the name you use, getPartitionsByFilter, you're saying you
want to remove the use case of using like clause on a partition column?

if so, um, yeah, i would think that's surely used.



On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> Adding user list. Any objections to removing LIKE support from
> getPartitionsByFilter?
>
> On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <hashutosh@apache.org
> >wrote:
>
> > Couple of questions:
> >
> > 1. What about LIKE operator for Hive itself? Will that continue to work
> > (presumably because there is an alternative path for that).
> > 2. This will nonetheless break other direct consumers of metastore client
> > api (like HCatalog).
> >
> > I see your point that we have a buggy implementation, so whats out there
> is
> > not safe to use. Question than really is shall we remove this code,
> thereby
> > breaking people for whom current buggy implementation is good enough (or
> > you can say salvage them from breaking in future). Or shall we try to fix
> > it now?
> > My take is if there are no users of this anyways, then there is no point
> > fixing it for non-existing users, but if there are we probably have to. I
> > will suggest you to send an email to users@hive to ask if there are
> users
> > for this.
> >
> > Thanks,
> > Ashutosh
> >
> >
> >
> > On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >wrote:
> >
> > > Since there's no response I am assuming nobody cares about this code...
> > > Jira is HIVE-5134, I will attach a patch with removal this week.
> > >
> > > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> > sergey@hortonworks.com
> > > >wrote:
> > >
> > > > Hi.
> > > >
> > > > I think there are issues with the way hive can currently do LIKE
> > > > operator JDO pushdown and it the code should be removed for
> partitions
> > > > and tables.
> > > > Are there objections to removing LIKE from Filter.g and related
> areas?
> > > > If no I will file a JIRA and do it.
> > > >
> > > > Details:
> > > > There's code in metastore that is capable of pushing down LIKE
> > > > expression into JDO for string partition keys, as well as tables.
> > > > The code for tables doesn't appear used, and partition code
> definitely
> > > > doesn't run in Hive proper because metastore client doesn't send LIKE
> > > > expressions to server. It may be used in e.g. HCat and other places,
> > > > but after asking some people here, I found out it probably isn't.
> > > > I was trying to make it run and noticed some problems:
> > > > 1) For partitions, Hive sends SQL patterns in a filter for like, e.g.
> > > > "%foo%", whereas metastore passes them into matches() JDOQL method
> > > > which expects Java regex.
> > > > 2) Converting the pattern to Java regex via UDFLike method, I found
> > > > out that not all regexes appear to work in DN. ".*foo" seems to work
> > > > but anything complex (such as escaping the pattern using
> > > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > > properly.
> > > > 3) I tried to implement common cases using JDO methods
> > > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run
> tests
> > > > on Derby, they also appear to have problems with some strings (for
> > > > example, partition with backslash in the name cannot be matched by
> > > > LIKE "%\%" (single backslash in a string), after being converted to
> > > > .indexOf(param) where param is "\" (escaping the backslash once again
> > > > doesn't work either, and anyway there's no documented reason why it
> > > > shouldn't work properly), while other characters match correctly,
> even
> > > > e.g. "%".
> > > >
> > > > For tables, there's no SQL-like, it expects Java regex, but I am not
> > > > convinced all Java regexes are going to work.
> > > >
> > > > So, I think that for future correctness sake it's better to remove
> this
> > > > code.
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: LIKE filter pushdown for tables and partitions

Posted by Stephen Sprague <sp...@gmail.com>.
sorry to be dumb-ass but what does that translate into in the HSQL dialect?

Judging from the name you use, getPartitionsByFilter, you're saying you
want to remove the use case of using like clause on a partition column?

if so, um, yeah, i would think that's surely used.



On Mon, Aug 26, 2013 at 7:48 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> Adding user list. Any objections to removing LIKE support from
> getPartitionsByFilter?
>
> On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <hashutosh@apache.org
> >wrote:
>
> > Couple of questions:
> >
> > 1. What about LIKE operator for Hive itself? Will that continue to work
> > (presumably because there is an alternative path for that).
> > 2. This will nonetheless break other direct consumers of metastore client
> > api (like HCatalog).
> >
> > I see your point that we have a buggy implementation, so whats out there
> is
> > not safe to use. Question than really is shall we remove this code,
> thereby
> > breaking people for whom current buggy implementation is good enough (or
> > you can say salvage them from breaking in future). Or shall we try to fix
> > it now?
> > My take is if there are no users of this anyways, then there is no point
> > fixing it for non-existing users, but if there are we probably have to. I
> > will suggest you to send an email to users@hive to ask if there are
> users
> > for this.
> >
> > Thanks,
> > Ashutosh
> >
> >
> >
> > On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >wrote:
> >
> > > Since there's no response I am assuming nobody cares about this code...
> > > Jira is HIVE-5134, I will attach a patch with removal this week.
> > >
> > > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> > sergey@hortonworks.com
> > > >wrote:
> > >
> > > > Hi.
> > > >
> > > > I think there are issues with the way hive can currently do LIKE
> > > > operator JDO pushdown and it the code should be removed for
> partitions
> > > > and tables.
> > > > Are there objections to removing LIKE from Filter.g and related
> areas?
> > > > If no I will file a JIRA and do it.
> > > >
> > > > Details:
> > > > There's code in metastore that is capable of pushing down LIKE
> > > > expression into JDO for string partition keys, as well as tables.
> > > > The code for tables doesn't appear used, and partition code
> definitely
> > > > doesn't run in Hive proper because metastore client doesn't send LIKE
> > > > expressions to server. It may be used in e.g. HCat and other places,
> > > > but after asking some people here, I found out it probably isn't.
> > > > I was trying to make it run and noticed some problems:
> > > > 1) For partitions, Hive sends SQL patterns in a filter for like, e.g.
> > > > "%foo%", whereas metastore passes them into matches() JDOQL method
> > > > which expects Java regex.
> > > > 2) Converting the pattern to Java regex via UDFLike method, I found
> > > > out that not all regexes appear to work in DN. ".*foo" seems to work
> > > > but anything complex (such as escaping the pattern using
> > > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > > properly.
> > > > 3) I tried to implement common cases using JDO methods
> > > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run
> tests
> > > > on Derby, they also appear to have problems with some strings (for
> > > > example, partition with backslash in the name cannot be matched by
> > > > LIKE "%\%" (single backslash in a string), after being converted to
> > > > .indexOf(param) where param is "\" (escaping the backslash once again
> > > > doesn't work either, and anyway there's no documented reason why it
> > > > shouldn't work properly), while other characters match correctly,
> even
> > > > e.g. "%".
> > > >
> > > > For tables, there's no SQL-like, it expects Java regex, but I am not
> > > > convinced all Java regexes are going to work.
> > > >
> > > > So, I think that for future correctness sake it's better to remove
> this
> > > > code.
> > > >
> > >
> > > --
> > > CONFIDENTIALITY NOTICE
> > > NOTICE: This message is intended for the use of the individual or
> entity
> > to
> > > which it is addressed and may contain information that is confidential,
> > > privileged and exempt from disclosure under applicable law. If the
> reader
> > > of this message is not the intended recipient, you are hereby notified
> > that
> > > any printing, copying, dissemination, distribution, disclosure or
> > > forwarding of this communication is strictly prohibited. If you have
> > > received this communication in error, please contact the sender
> > immediately
> > > and delete it from your system. Thank You.
> > >
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: LIKE filter pushdown for tables and partitions

Posted by Sergey Shelukhin <se...@hortonworks.com>.
Adding user list. Any objections to removing LIKE support from
getPartitionsByFilter?

On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <ha...@apache.org>wrote:

> Couple of questions:
>
> 1. What about LIKE operator for Hive itself? Will that continue to work
> (presumably because there is an alternative path for that).
> 2. This will nonetheless break other direct consumers of metastore client
> api (like HCatalog).
>
> I see your point that we have a buggy implementation, so whats out there is
> not safe to use. Question than really is shall we remove this code, thereby
> breaking people for whom current buggy implementation is good enough (or
> you can say salvage them from breaking in future). Or shall we try to fix
> it now?
> My take is if there are no users of this anyways, then there is no point
> fixing it for non-existing users, but if there are we probably have to. I
> will suggest you to send an email to users@hive to ask if there are users
> for this.
>
> Thanks,
> Ashutosh
>
>
>
> On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <sergey@hortonworks.com
> >wrote:
>
> > Since there's no response I am assuming nobody cares about this code...
> > Jira is HIVE-5134, I will attach a patch with removal this week.
> >
> > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >wrote:
> >
> > > Hi.
> > >
> > > I think there are issues with the way hive can currently do LIKE
> > > operator JDO pushdown and it the code should be removed for partitions
> > > and tables.
> > > Are there objections to removing LIKE from Filter.g and related areas?
> > > If no I will file a JIRA and do it.
> > >
> > > Details:
> > > There's code in metastore that is capable of pushing down LIKE
> > > expression into JDO for string partition keys, as well as tables.
> > > The code for tables doesn't appear used, and partition code definitely
> > > doesn't run in Hive proper because metastore client doesn't send LIKE
> > > expressions to server. It may be used in e.g. HCat and other places,
> > > but after asking some people here, I found out it probably isn't.
> > > I was trying to make it run and noticed some problems:
> > > 1) For partitions, Hive sends SQL patterns in a filter for like, e.g.
> > > "%foo%", whereas metastore passes them into matches() JDOQL method
> > > which expects Java regex.
> > > 2) Converting the pattern to Java regex via UDFLike method, I found
> > > out that not all regexes appear to work in DN. ".*foo" seems to work
> > > but anything complex (such as escaping the pattern using
> > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > properly.
> > > 3) I tried to implement common cases using JDO methods
> > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run tests
> > > on Derby, they also appear to have problems with some strings (for
> > > example, partition with backslash in the name cannot be matched by
> > > LIKE "%\%" (single backslash in a string), after being converted to
> > > .indexOf(param) where param is "\" (escaping the backslash once again
> > > doesn't work either, and anyway there's no documented reason why it
> > > shouldn't work properly), while other characters match correctly, even
> > > e.g. "%".
> > >
> > > For tables, there's no SQL-like, it expects Java regex, but I am not
> > > convinced all Java regexes are going to work.
> > >
> > > So, I think that for future correctness sake it's better to remove this
> > > code.
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: LIKE filter pushdown for tables and partitions

Posted by Sergey Shelukhin <se...@hortonworks.com>.
Adding user list. Any objections to removing LIKE support from
getPartitionsByFilter?

On Mon, Aug 26, 2013 at 2:54 PM, Ashutosh Chauhan <ha...@apache.org>wrote:

> Couple of questions:
>
> 1. What about LIKE operator for Hive itself? Will that continue to work
> (presumably because there is an alternative path for that).
> 2. This will nonetheless break other direct consumers of metastore client
> api (like HCatalog).
>
> I see your point that we have a buggy implementation, so whats out there is
> not safe to use. Question than really is shall we remove this code, thereby
> breaking people for whom current buggy implementation is good enough (or
> you can say salvage them from breaking in future). Or shall we try to fix
> it now?
> My take is if there are no users of this anyways, then there is no point
> fixing it for non-existing users, but if there are we probably have to. I
> will suggest you to send an email to users@hive to ask if there are users
> for this.
>
> Thanks,
> Ashutosh
>
>
>
> On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <sergey@hortonworks.com
> >wrote:
>
> > Since there's no response I am assuming nobody cares about this code...
> > Jira is HIVE-5134, I will attach a patch with removal this week.
> >
> > On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <
> sergey@hortonworks.com
> > >wrote:
> >
> > > Hi.
> > >
> > > I think there are issues with the way hive can currently do LIKE
> > > operator JDO pushdown and it the code should be removed for partitions
> > > and tables.
> > > Are there objections to removing LIKE from Filter.g and related areas?
> > > If no I will file a JIRA and do it.
> > >
> > > Details:
> > > There's code in metastore that is capable of pushing down LIKE
> > > expression into JDO for string partition keys, as well as tables.
> > > The code for tables doesn't appear used, and partition code definitely
> > > doesn't run in Hive proper because metastore client doesn't send LIKE
> > > expressions to server. It may be used in e.g. HCat and other places,
> > > but after asking some people here, I found out it probably isn't.
> > > I was trying to make it run and noticed some problems:
> > > 1) For partitions, Hive sends SQL patterns in a filter for like, e.g.
> > > "%foo%", whereas metastore passes them into matches() JDOQL method
> > > which expects Java regex.
> > > 2) Converting the pattern to Java regex via UDFLike method, I found
> > > out that not all regexes appear to work in DN. ".*foo" seems to work
> > > but anything complex (such as escaping the pattern using
> > > Pattern.quote, which UDFLike does) breaks and no longer matches
> > > properly.
> > > 3) I tried to implement common cases using JDO methods
> > > startsWith/endsWith/indexOf (I will file a JIRA), but when I run tests
> > > on Derby, they also appear to have problems with some strings (for
> > > example, partition with backslash in the name cannot be matched by
> > > LIKE "%\%" (single backslash in a string), after being converted to
> > > .indexOf(param) where param is "\" (escaping the backslash once again
> > > doesn't work either, and anyway there's no documented reason why it
> > > shouldn't work properly), while other characters match correctly, even
> > > e.g. "%".
> > >
> > > For tables, there's no SQL-like, it expects Java regex, but I am not
> > > convinced all Java regexes are going to work.
> > >
> > > So, I think that for future correctness sake it's better to remove this
> > > code.
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: LIKE filter pushdown for tables and partitions

Posted by Ashutosh Chauhan <ha...@apache.org>.
Couple of questions:

1. What about LIKE operator for Hive itself? Will that continue to work
(presumably because there is an alternative path for that).
2. This will nonetheless break other direct consumers of metastore client
api (like HCatalog).

I see your point that we have a buggy implementation, so whats out there is
not safe to use. Question than really is shall we remove this code, thereby
breaking people for whom current buggy implementation is good enough (or
you can say salvage them from breaking in future). Or shall we try to fix
it now?
My take is if there are no users of this anyways, then there is no point
fixing it for non-existing users, but if there are we probably have to. I
will suggest you to send an email to users@hive to ask if there are users
for this.

Thanks,
Ashutosh



On Mon, Aug 26, 2013 at 2:08 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> Since there's no response I am assuming nobody cares about this code...
> Jira is HIVE-5134, I will attach a patch with removal this week.
>
> On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <sergey@hortonworks.com
> >wrote:
>
> > Hi.
> >
> > I think there are issues with the way hive can currently do LIKE
> > operator JDO pushdown and it the code should be removed for partitions
> > and tables.
> > Are there objections to removing LIKE from Filter.g and related areas?
> > If no I will file a JIRA and do it.
> >
> > Details:
> > There's code in metastore that is capable of pushing down LIKE
> > expression into JDO for string partition keys, as well as tables.
> > The code for tables doesn't appear used, and partition code definitely
> > doesn't run in Hive proper because metastore client doesn't send LIKE
> > expressions to server. It may be used in e.g. HCat and other places,
> > but after asking some people here, I found out it probably isn't.
> > I was trying to make it run and noticed some problems:
> > 1) For partitions, Hive sends SQL patterns in a filter for like, e.g.
> > "%foo%", whereas metastore passes them into matches() JDOQL method
> > which expects Java regex.
> > 2) Converting the pattern to Java regex via UDFLike method, I found
> > out that not all regexes appear to work in DN. ".*foo" seems to work
> > but anything complex (such as escaping the pattern using
> > Pattern.quote, which UDFLike does) breaks and no longer matches
> > properly.
> > 3) I tried to implement common cases using JDO methods
> > startsWith/endsWith/indexOf (I will file a JIRA), but when I run tests
> > on Derby, they also appear to have problems with some strings (for
> > example, partition with backslash in the name cannot be matched by
> > LIKE "%\%" (single backslash in a string), after being converted to
> > .indexOf(param) where param is "\" (escaping the backslash once again
> > doesn't work either, and anyway there's no documented reason why it
> > shouldn't work properly), while other characters match correctly, even
> > e.g. "%".
> >
> > For tables, there's no SQL-like, it expects Java regex, but I am not
> > convinced all Java regexes are going to work.
> >
> > So, I think that for future correctness sake it's better to remove this
> > code.
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: LIKE filter pushdown for tables and partitions

Posted by Sergey Shelukhin <se...@hortonworks.com>.
Since there's no response I am assuming nobody cares about this code...
Jira is HIVE-5134, I will attach a patch with removal this week.

On Wed, Aug 21, 2013 at 2:28 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> Hi.
>
> I think there are issues with the way hive can currently do LIKE
> operator JDO pushdown and it the code should be removed for partitions
> and tables.
> Are there objections to removing LIKE from Filter.g and related areas?
> If no I will file a JIRA and do it.
>
> Details:
> There's code in metastore that is capable of pushing down LIKE
> expression into JDO for string partition keys, as well as tables.
> The code for tables doesn't appear used, and partition code definitely
> doesn't run in Hive proper because metastore client doesn't send LIKE
> expressions to server. It may be used in e.g. HCat and other places,
> but after asking some people here, I found out it probably isn't.
> I was trying to make it run and noticed some problems:
> 1) For partitions, Hive sends SQL patterns in a filter for like, e.g.
> "%foo%", whereas metastore passes them into matches() JDOQL method
> which expects Java regex.
> 2) Converting the pattern to Java regex via UDFLike method, I found
> out that not all regexes appear to work in DN. ".*foo" seems to work
> but anything complex (such as escaping the pattern using
> Pattern.quote, which UDFLike does) breaks and no longer matches
> properly.
> 3) I tried to implement common cases using JDO methods
> startsWith/endsWith/indexOf (I will file a JIRA), but when I run tests
> on Derby, they also appear to have problems with some strings (for
> example, partition with backslash in the name cannot be matched by
> LIKE "%\%" (single backslash in a string), after being converted to
> .indexOf(param) where param is "\" (escaping the backslash once again
> doesn't work either, and anyway there's no documented reason why it
> shouldn't work properly), while other characters match correctly, even
> e.g. "%".
>
> For tables, there's no SQL-like, it expects Java regex, but I am not
> convinced all Java regexes are going to work.
>
> So, I think that for future correctness sake it's better to remove this
> code.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.