You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Davide Giannella <da...@apache.org> on 2014/08/07 11:22:26 UTC

Speeding-up the OrderedIndex

Hello team,

I was thinking in background around how we could speed-up the current
ordered index and just realised the our main use case is the indexing of
dates.

Currently the index indexes the full date translated into a string[0] up
to the 'Z' of the JCR date format[1]

(0) http://goo.gl/e3Gr9V
(1) http://goo.gl/VOtjfi

This will produce tons of keys in the index. I would say one for each
node. If we truncate the date up to the minute or the second we could
reduce drastically this aspect.

The query engine is anyhow double-checking the conditions when fetching
the results.

My main concern now is understanding what it would be a proper truncate
line.

Truncating to the second could give us a bucket of 1k elements plus
time-zones.

Truncating to the minute could give us 60k buckets plus time-zones.

Thoughts? Am I missing anything here?

thank you
Davide

Re: Speeding-up the OrderedIndex

Posted by Davide Giannella <da...@apache.org>.

On 11/08/2014 21:07, Justin Edelson wrote:
> In my experience, you would use minute precision for range queries and
> second precision for ordering. So I guess this means that second
> precision would be the best default and then allowing minute precision
> to be used as an optimization for an index which you knew was only
> going to be used for selection purposes (or ms for other cases where
> you needed more precise ordering / data selection). Justin
thanks Justin.

https://issues.apache.org/jira/browse/OAK-2028 for the records :)

D.

Re: Speeding-up the OrderedIndex

Posted by Justin Edelson <ju...@justinedelson.com>.

Hi,

On Fri, Aug 8, 2014 at 2:57 AM, Davide Giannella <da...@apache.org> wrote:
> Hello Justin,
>
> On 07/08/2014 15:12, Justin Edelson wrote:
>> Hi Davide,
>> Could this be configurable on the QID?
> I was thinking on the same line. We create the code so almost everything
> is possible. :)
>
> What would it be a good default? Definitely when sorting I don't care of
> the millisecond precision. In my experience when retrieving content
> ordered by dates
>
> select * from [nt:unstructured] order by jcr:lastModified desc

In my experience, you would use minute precision for range queries and
second precision for ordering. So I guess this means that second
precision would be the best default and then allowing minute precision
to be used as an optimization for an index which you knew was only
going to be used for selection purposes (or ms for other cases where
you needed more precise ordering / data selection).

Justin

>
> when presented to the end user a precision of a minute would suffice.
> This means that if you truncate it to the minute you'll have correct
> sorting up to the minute but if you had 1k nodes added within that
> minute you won't be able to predict the order of that bucket.
>
> Thoughts?
>
> D.
>
>

Re: Speeding-up the OrderedIndex

Posted by Davide Giannella <da...@apache.org>.

On 08/08/2014 09:21, Michael Marth wrote:
> Hi,
>
> you mention previously that the Query Engine checks the order anyway. Is my interpretation correct that the Index would return the results unordered within the minute interval to the query engine, then the query engine would do the correct ordering on the returned set?
>
The query engine checks the WHERE conditions anyhow as it's O(n) but if
the index returned a sorted set it won't sort it again.

D.

Re: Speeding-up the OrderedIndex

Posted by Michael Marth <mm...@adobe.com>.

Hi,

This means that if you truncate it to the minute you'll have correct
sorting up to the minute but if you had 1k nodes added within that
minute you won't be able to predict the order of that bucket.

you mention previously that the Query Engine checks the order anyway. Is my interpretation correct that the Index would return the results unordered within the minute interval to the query engine, then the query engine would do the correct ordering on the returned set?

Michael

Re: Speeding-up the OrderedIndex

Posted by Davide Giannella <da...@apache.org>.

Hello Justin,

On 07/08/2014 15:12, Justin Edelson wrote:
> Hi Davide,
> Could this be configurable on the QID?
I was thinking on the same line. We create the code so almost everything
is possible. :)

What would it be a good default? Definitely when sorting I don't care of
the millisecond precision. In my experience when retrieving content
ordered by dates

select * from [nt:unstructured] order by jcr:lastModified desc

when presented to the end user a precision of a minute would suffice.
This means that if you truncate it to the minute you'll have correct
sorting up to the minute but if you had 1k nodes added within that
minute you won't be able to predict the order of that bucket.

Thoughts?

D.

Re: Speeding-up the OrderedIndex

Posted by Justin Edelson <ju...@justinedelson.com>.

Hi Davide,
Could this be configurable on the QID?

Justin

On Thursday, August 7, 2014, Davide Giannella <da...@apache.org> wrote:

> Hello team,
>
> I was thinking in background around how we could speed-up the current
> ordered index and just realised the our main use case is the indexing of
> dates.
>
> Currently the index indexes the full date translated into a string[0] up
> to the 'Z' of the JCR date format[1]
>
> (0) http://goo.gl/e3Gr9V
> (1) http://goo.gl/VOtjfi
>
> This will produce tons of keys in the index. I would say one for each
> node. If we truncate the date up to the minute or the second we could
> reduce drastically this aspect.
>
> The query engine is anyhow double-checking the conditions when fetching
> the results.
>
> My main concern now is understanding what it would be a proper truncate
> line.
>
> Truncating to the second could give us a bucket of 1k elements plus
> time-zones.
>
> Truncating to the minute could give us 60k buckets plus time-zones.
>
> Thoughts? Am I missing anything here?
>
> thank you
> Davide
>
>
>

Re: Speeding-up the OrderedIndex

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

+1 to "speeding up the ordered index" (as the subject says)

But:

-1 to _not_ sorting accurately up to the millisecond, when using "order
by". See also my comment on OAK-2028. I consider it a bug if somebody runs
a query with "order by lastModified" and then the result is ordered by a
truncated version of "lastModified" (depending on the index configuration).

Regards,
Thomas





On 07/08/14 11:22, "Davide Giannella" <da...@apache.org> wrote:

>Hello team,
>
>I was thinking in background around how we could speed-up the current
>ordered index and just realised the our main use case is the indexing of
>dates.
>
>Currently the index indexes the full date translated into a string[0] up
>to the 'Z' of the JCR date format[1]
>
>(0) http://goo.gl/e3Gr9V
>(1) http://goo.gl/VOtjfi
>
>This will produce tons of keys in the index. I would say one for each
>node. If we truncate the date up to the minute or the second we could
>reduce drastically this aspect.
>
>The query engine is anyhow double-checking the conditions when fetching
>the results.
>
>My main concern now is understanding what it would be a proper truncate
>line.
>
>Truncating to the second could give us a bucket of 1k elements plus
>time-zones.
>
>Truncating to the minute could give us 60k buckets plus time-zones.
>
>Thoughts? Am I missing anything here?
>
>thank you
>Davide
>
>

Re: Speeding-up the OrderedIndex

Posted by Davide Giannella <da...@apache.org>.

On 08/08/2014 09:38, Michael Dürig wrote:
>
> Are you sure about the 'Z'? AFAIR (*) the time zone part is left to
> whatever the user specified when setting a date property (e.g.
> 2014-08-08T09:21:55.123+01:00).
You're absolutely right. An actual key in the index is

2014-04-22T10%3A11%3A24.002%2B01%3A00

that translated is

2014-04-22T10:11:24.002+01:00

and by looking at it we could hit a similar stuff as OAK-1763 with range
queries if have different time zones. Although it's more likely not to
happen as the indexing and dating happens on the server side and as long
as the server doesn't change the timezone JVM settings all the dates
will be on the same TZ.

Leaving the encoding issue (OAK-1763) aside we could work anyhow by
zeroing the aspects we don't want like the milliseconds, having
therefore the above date as 2014-04-22T10:11:24.000+01:00 if we want to
cancel the millisecond precision or 2014-04-22T10:11:00.000+01:00 if we
want to cancel the second precision.

Assuming we go for a configurable approach the question still remain:
what will it be a good approximation for sorting?

I'm for cancelling the millisecond precision by default as I think that
most use cases won't care if within the same second the items are not in
the perfect order.

The configurable options will be IMO: none, millis, seconds but we can
easily provide even lesser precision options.

Thoughts?

D.

Re: Speeding-up the OrderedIndex

Posted by Michael Dürig <md...@apache.org>.


On 7.8.14 11:22 , Davide Giannella wrote:
> Hello team,
>
> I was thinking in background around how we could speed-up the current
> ordered index and just realised the our main use case is the indexing of
> dates.
>
> Currently the index indexes the full date translated into a string[0] up
> to the 'Z' of the JCR date format[1]

Are you sure about the 'Z'? AFAIR (*) the time zone part is left to 
whatever the user specified when setting a date property (e.g. 
2014-08-08T09:21:55.123+01:00).

(*) I vaguely remember that in a very early stage of Oak we normalised 
all date properties to the Z timezone, however this introduced backward 
compatibility issues.

Michael


>
> (0) http://goo.gl/e3Gr9V
> (1) http://goo.gl/VOtjfi
>
> This will produce tons of keys in the index. I would say one for each
> node. If we truncate the date up to the minute or the second we could
> reduce drastically this aspect.
>
> The query engine is anyhow double-checking the conditions when fetching
> the results.
>
> My main concern now is understanding what it would be a proper truncate
> line.
>
> Truncating to the second could give us a bucket of 1k elements plus
> time-zones.
>
> Truncating to the minute could give us 60k buckets plus time-zones.
>
> Thoughts? Am I missing anything here?
>
> thank you
> Davide
>
>