You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by Sumit Nigam <su...@yahoo.com> on 2015/10/05 10:49:13 UTC

Skip Scan

Hi,
Would it make any difference if I were to pass non-sorted IDs (secondary indexed) to a huge IN clause? I assume that skip scan optimization would work in either case. 
Also, can any one let me know if there is some limit to beyond how many such IDs in a large IN clause do I get into diminishing returns? Or is it plainly dependent on specific workloads and memory of region servers?
Thanks,Sumit

Re: Skip Scan

Posted by James Taylor <ja...@apache.org>.

Hi Sumit,

No, it doesn't matter whether the values in an IN clause are sorted or not
- we'll sort them internally to ensure they are. We've run perf tests in
the past where we had 250K values in an IN clause over a 1B row data set,
and performance was still good. Take a look at this blog for more detail:
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Thanks,
James

On Mon, Oct 5, 2015 at 2:08 AM, James Heather <ja...@mendeley.com>
wrote:

> I'll leave someone else to comment on the Phoenix specifics.
>
> I recall from some experiments on MySQL that if you have a massive load of
> IDs to pass, it's quicker if you split them into batches of some reasonable
> (but still large) size, and that for this, you would want to sort them
> first. I don't think it made any difference whether the IDs were sorted
> within an individual SQL statement, but you want to split into batches that
> cover disjoint ranges, so the easiest is to sort the whole lot first and
> then split.
>
> This might be MySQL-specific, though. I think each query was being turned
> into a range scan, from the lowest ID in the IN clause to the highest,
> which was why it was useful to get the ranges disjoint and not too huge.
>
> James
>
>
> On 05/10/15 09:49, Sumit Nigam wrote:
>
> Hi,
>
> Would it make any difference if I were to pass non-sorted IDs (secondary
> indexed) to a huge IN clause? I assume that skip scan optimization would
> work in either case.
>
> Also, can any one let me know if there is some limit to beyond how many
> such IDs in a large IN clause do I get into diminishing returns? Or is it
> plainly dependent on specific workloads and memory of region servers?
>
> Thanks,
> Sumit
>
>
>

Re: Skip Scan

Posted by James Heather <ja...@mendeley.com>.

I'll leave someone else to comment on the Phoenix specifics.

I recall from some experiments on MySQL that if you have a massive load 
of IDs to pass, it's quicker if you split them into batches of some 
reasonable (but still large) size, and that for this, you would want to 
sort them first. I don't think it made any difference whether the IDs 
were sorted within an individual SQL statement, but you want to split 
into batches that cover disjoint ranges, so the easiest is to sort the 
whole lot first and then split.

This might be MySQL-specific, though. I think each query was being 
turned into a range scan, from the lowest ID in the IN clause to the 
highest, which was why it was useful to get the ranges disjoint and not 
too huge.

James

On 05/10/15 09:49, Sumit Nigam wrote:
> Hi,
>
> Would it make any difference if I were to pass non-sorted IDs 
> (secondary indexed) to a huge IN clause? I assume that skip scan 
> optimization would work in either case.
>
> Also, can any one let me know if there is some limit to beyond how 
> many such IDs in a large IN clause do I get into diminishing returns? 
> Or is it plainly dependent on specific workloads and memory of region 
> servers?
>
> Thanks,
> Sumit