You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Nigel Sim <ni...@gmail.com> on 2009/08/14 06:34:54 UTC

Performance of a large number of small nodes

Hi,

I am using Jackrabbit to store a mixture of scientific data, which includes
files and numerical data. The performance of files are fine, but the
numerical data needs to be extracted as datasets based on attributes such as
observation time, and this appears to be quite slow in comparison to a
native DB (obviously). I would really prefer to keep all this related data
in the same management system, so is there a way to improve the ingestion
and retrieval of many small nodes?

Perhaps I could register a certain backend storage type for my numerical
nodes which have the same 5 or 6 attributes, and would feel more at home in
a single DB table?

My second question, is there an efficient way to query for the latest
observation? I would assume querying for the node type, sorting, and just
retrieving the first result?

Cheers
Nigel

Re: Performance of a large number of small nodes

Posted by Nigel Sim <ni...@gmail.com>.

Hi Bertrand,


2009/8/27 Bertrand Delacretaz <bd...@apache.org>

> Hi Nigel,
>
> On Thu, Aug 27, 2009 at 5:40 AM, Nigel Sim<ni...@gmail.com> wrote:
> > ...In Jackrabbit the path looks like
> /<instrument>/<dataset>/YYYY/MM/DD/<value>
> >
> > I can probably improve the ingest time by an order of magnitude by more
> > intelligent session handling, but the retrieval also needs to be improved
> > and I don't know how. In the production system, using PostgreSQL as the
> back
> > end, with 100,000 points across 50 instruments, it takes about 3 seconds
> to
> > execute the query to retrieve the dataset. This needs to be < 1s at
> worst,
> > as it feeds other systems....
>
> What's the query?
>
> Find all values for a given instrument and dataset in a specific period of
> time?
>

Spot on.

Nigel

Re: Performance of a large number of small nodes

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi Nigel,

On Thu, Aug 27, 2009 at 1:09 PM, Bertrand
Delacretaz<bd...@apache.org> wrote:
> On Thu, Aug 27, 2009 at 5:40 AM, Nigel Sim<ni...@gmail.com> wrote:
>> ...In Jackrabbit the path looks like /<instrument>/<dataset>/YYYY/MM/DD/<value>
>>
> ...What's the query?
>
> Find all values for a given instrument and dataset in a specific period of time?

Have you tried something like this?

Given timestamps T1 and T2, boundaries of the data to retrieve.

Compute start and end paths P1 and P2 corresponding to T1 and T2.

Start at P1, navigate to the value nodes using Node.getNodes(),
retrieve value nodes where timestamp >= T1 (try to avoid data
conversions when doing that - maybe store timestamps as a long).

Compute next path, retrieve all value nodes using Node.getNodes() (as
they are by definition within range).

Repeat, and when next path is P2 check timestamp <= T2 when retrieving
value nodes.

Dunno if that's what you're doing already, and I didn't test the
performance, but that feels intuitively like the fastest way to run
such a query with a large number of results.

-Bertrand

Re: Performance of a large number of small nodes

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, Aug 28, 2009 at 9:16 AM, Jukka Zitting <ju...@gmail.com> wrote:
> On Thu, Aug 27, 2009 at 1:09 PM, Bertrand
> Delacretaz<bd...@apache.org> wrote:
>> Find all values for a given instrument and dataset in a specific period of time?
>
> That's traditionally been a troublesome query for Lucene indexes.
> Luckily there's a recent Lucene feature called TrieRange [1,2] that
> pretty much solves that issue and makes numeric range queries really
> fast.
>
> TrieRange hasn't yet been integrated with Jackrabbit, but if there's
> demand it'll definitely be included in the Jackrabbit 2.x cycle.
> Contributions are of course welcome. :-)

If someone is interested in working on the trie range stuff, see the
related webinar announcement from general@lucene:
http://markmail.org/message/kmbtq66dyrmew6ms

BR,

Jukka Zitting

Re: Performance of a large number of small nodes

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Aug 27, 2009 at 1:09 PM, Bertrand
Delacretaz<bd...@apache.org> wrote:
> Find all values for a given instrument and dataset in a specific period of time?

That's traditionally been a troublesome query for Lucene indexes.
Luckily there's a recent Lucene feature called TrieRange [1,2] that
pretty much solves that issue and makes numeric range queries really
fast.

TrieRange hasn't yet been integrated with Jackrabbit, but if there's
demand it'll definitely be included in the Jackrabbit 2.x cycle.
Contributions are of course welcome. :-)

[1] http://www.thetaphi.de/share/Schindler-TrieRange.ppt
[2] http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/

BR,

Jukka Zitting

Re: Performance of a large number of small nodes

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi Nigel,

On Thu, Aug 27, 2009 at 5:40 AM, Nigel Sim<ni...@gmail.com> wrote:
> ...In Jackrabbit the path looks like /<instrument>/<dataset>/YYYY/MM/DD/<value>
>
> I can probably improve the ingest time by an order of magnitude by more
> intelligent session handling, but the retrieval also needs to be improved
> and I don't know how. In the production system, using PostgreSQL as the back
> end, with 100,000 points across 50 instruments, it takes about 3 seconds to
> execute the query to retrieve the dataset. This needs to be < 1s at worst,
> as it feeds other systems....

What's the query?

Find all values for a given instrument and dataset in a specific period of time?

-Bertrand

Re: Performance of a large number of small nodes

Posted by Nigel Sim <ni...@gmail.com>.

Hi Bertrand,

Some numbers:

Jackrabbit:
Ingest = 43189ms. Retrieve = 580ms

JPA/Database:
Ingest = 86ms. Retrieve = 33ms.

The data structure looks like this:

Instrument:
  String name;
  String model;
  ...

Dataset:
  Instrument instrument;
  String type;
  String units;
  List<Value> values;

Value:
  Date time;
  Double value;

In Jackrabbit the path looks like /<instrument>/<dataset>/YYYY/MM/DD/<value>

I can probably improve the ingest time by an order of magnitude by more
intelligent session handling, but the retrieval also needs to be improved
and I don't know how. In the production system, using PostgreSQL as the back
end, with 100,000 points across 50 instruments, it takes about 3 seconds to
execute the query to retrieve the dataset. This needs to be < 1s at worst,
as it feeds other systems.

Advice would be gratefully received.

Cheers
Nigel

2009/8/15 Bertrand Delacretaz <bd...@apache.org>

> Hi Nigel,
>
> On Sat, Aug 15, 2009 at 6:32 AM, Nigel Sim<ni...@gmail.com> wrote:
> > ...Thanks for your suggestion. Unfortunately, even in the simplest case
> of 100
> > nodes in the root node, the time taken to retrieve is too long. If I
> could
> > resolve this fundamental speed issue then I could apply your solution to
> > help me scale my system....
>
> How much is too long, and how do you retrieve the nodes?
> I'm curious, as retrieving 100 nodes by navigating the JCR
> parent/child relationships should not be that slow.
>
> > ...I think I just need to bite the bullet and admit my use case doesn't
> really
> > map on Jackrabbit :)...
>
> If you tell us a bit more about your data structure, someone might be
> able to help.
> Did you have a look at http://wiki.apache.org/jackrabbit/DavidsModel ?
> That can help structure things in a JCR-friendly way.
>
> -Bertrand
>



-- 
JCU eResearch Centre
School Of Business (IT)
James Cook University

Re: Performance of a large number of small nodes

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi Nigel,

On Sat, Aug 15, 2009 at 6:32 AM, Nigel Sim<ni...@gmail.com> wrote:
> ...Thanks for your suggestion. Unfortunately, even in the simplest case of 100
> nodes in the root node, the time taken to retrieve is too long. If I could
> resolve this fundamental speed issue then I could apply your solution to
> help me scale my system....

How much is too long, and how do you retrieve the nodes?
I'm curious, as retrieving 100 nodes by navigating the JCR
parent/child relationships should not be that slow.

> ...I think I just need to bite the bullet and admit my use case doesn't really
> map on Jackrabbit :)...

If you tell us a bit more about your data structure, someone might be
able to help.
Did you have a look at http://wiki.apache.org/jackrabbit/DavidsModel ?
That can help structure things in a JCR-friendly way.

-Bertrand

Re: Performance of a large number of small nodes

Posted by Nigel Sim <ni...@gmail.com>.

Hi Bertrand,

Thanks for your suggestion. Unfortunately, even in the simplest case of 100
nodes in the root node, the time taken to retrieve is too long. If I could
resolve this fundamental speed issue then I could apply your solution to
help me scale my system.

I think I just need to bite the bullet and admit my use case doesn't really
map on Jackrabbit :)

Thanks
Nigel

2009/8/14 Bertrand Delacretaz <bd...@apache.org>

> Hi,
>
> On Fri, Aug 14, 2009 at 6:34 AM, Nigel Sim<ni...@gmail.com> wrote:
> > ...I am using Jackrabbit to store a mixture of scientific data, which
> includes
> > files and numerical data. The performance of files are fine, but the
> > numerical data needs to be extracted as datasets based on attributes such
> as
> > observation time, and this appears to be quite slow in comparison to a
> > native DB (obviously). I would really prefer to keep all this related
> data
> > in the same management system, so is there a way to improve the ingestion
> > and retrieval of many small nodes?...
>
> Could you take advantage of paths to express the observation time, and
> use that for "queries"?
>
> Storing data under paths like /data/2009/12/24/23/02/58 would allow
> you to find nodes that belong to a specific day, or hour, by
> navigating paths, which might be much more efficient than queries.
>
> > ...My second question, is there an efficient way to query for the latest
> > observation? I would assume querying for the node type, sorting, and just
> > retrieving the first result?...
>
> Paths would also help here, and you could use observation to keep
> track of the path that corresponds to the most recent data, if needed.
>
> -Bertrand
>



-- 
JCU eResearch Centre
School Of Business (IT)
James Cook University

Re: Performance of a large number of small nodes

Posted by Bertrand Delacretaz <bd...@apache.org>.

Hi,

On Fri, Aug 14, 2009 at 6:34 AM, Nigel Sim<ni...@gmail.com> wrote:
> ...I am using Jackrabbit to store a mixture of scientific data, which includes
> files and numerical data. The performance of files are fine, but the
> numerical data needs to be extracted as datasets based on attributes such as
> observation time, and this appears to be quite slow in comparison to a
> native DB (obviously). I would really prefer to keep all this related data
> in the same management system, so is there a way to improve the ingestion
> and retrieval of many small nodes?...

Could you take advantage of paths to express the observation time, and
use that for "queries"?

Storing data under paths like /data/2009/12/24/23/02/58 would allow
you to find nodes that belong to a specific day, or hour, by
navigating paths, which might be much more efficient than queries.

> ...My second question, is there an efficient way to query for the latest
> observation? I would assume querying for the node type, sorting, and just
> retrieving the first result?...

Paths would also help here, and you could use observation to keep
track of the path that corresponds to the most recent data, if needed.

-Bertrand