You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Marcin Nowak <ma...@comarch.com> on 2007/04/23 08:21:26 UTC

eXist

Hi,

Recently I've discovered XML database quite similar in general concepts 
to Jackrabbit, in fact it does not provide versioning and referencing 
between nodes but it is really fast as I compared it with Jackrabbit, 
especially in querying and importing nodes, question is why Jackrabbit 
performs so badly in comparison to eXist?

Project webpage:
http://exist.sourceforge.net/

BR,
Marcin Nowak

Re: eXist

Posted by Marcin Nowak <ma...@comarch.com>.

Hi,

First of all, my intention was definitely not to troll - I am looking 
for the best solution for an XML storage, my favourite is Jackrabbit but 
I've found something what in my opinion performs better - I am only 
asking why? I really want to use Jackrabbit, I like it versioning and 
referencing features but I need it to be a high performance XML storage.

In fact my question was based on short testing, but not just 5 minutes 
:) I have created a repository containing a collections nested in each 
other(three of them) each with three 4,5 MB XML files. Then I've 
launched a query (btw - import times are impressive (4,5MB XML in ca. 10 
seconds)- will you agree? If not - show me how to configure Jackrabbit 
to preform that good(same import in Jackrabbit took ca. 16 minutes on 
same machine) - again please don't take it as trolling - **I really want 
to know how to configure Jackrabbit to be high-performance**). Query was 
really simple

for $x in //type where $x='STRING_SINGLE'
return $x

and was performed on the whole DB - correct me if I am wrong. Results of 
querying I have received after less than 4 seconds.

I know how Jackrabbit performs in default configuration, on derby, 
mysql, and oracle DB very well, you can see results of my tests 
somewhere here in mailing archives, I've published complex report some 
time ago, after that report I have made those tests again - because of 
changes made in Jackrabbit source code, results were better but in 
comparison to eXist, again, not to optimistic.

My main question is that is there anything that can speed up Jackrabbit 
to get close to performance results achieved in eXist? Take this 
question seriously - performance is one of the main requirements to XML 
storage which I need.

BR,
Marcin Nowak

Jean-Baptiste Quenot wrote:
> * Marcin Nowak:
>
>   
>> Recently I've  discovered XML database quite  similar in general
>> concepts to Jackrabbit,  in fact it does  not provide versioning
>> and  referencing  between  nodes  but   it  is  really  fast  as
>> I  compared  it  with  Jackrabbit, especially  in  querying  and
>> importing nodes, question is why Jackrabbit performs so badly in
>> comparison to eXist?
>>     
>
> You're asking  for a troll very  obviously, so I won't  comment on
> it, but there are a few things that are worth to mention:
>
> 1. eXist  is  an XML  database,  Jackrabbit  is  not, so  you  are
>    comparing two  unrelated things.   Moreover, even if  the query
>    syntax can look similar, eXist returns XML, whereas JCR returns
>    Java objects.  You need to understand the implications of this,
>    namely parsing the  resulting XML and work with  it can quickly
>    lead to  memory and CPU  starvation, especially when  the query
>    returns a lot of documents.  JCR  plays nicely with this, as it
>    returns an iterator on the data set.
>
> 2. Jackrabbit is  mostly seen  as a Java-API,  whereas eXist  is a
>    standalone beast with specific servlets that talk xmlrpc, REST,
>    and  so  on mostly  accessed  using  HTTP requests  causing  an
>    additional  overhead.  eXist  even  has a  front-end  based  on
>    Cocoon.  A  *lot* of caching is  done on the eXist  side, while
>    with Jackrabbit you will need  a second-level cache in your own
>    code to address that.
>
> 3. In my  book, eXist is not  designed to let you  query the whole
>    database at  once, whereas  Jackrabbit allows  you to  return a
>    sorted  subset  of documents  from  the  whole repository  very
>    efficiently,  by design.   Accessing one  XML document  is very
>    different from querying the whole database with 10k+ documents.
>    Play with eXist more than 5 minutes with a serious data set and
>    you will notice by yourself.
>   
> 4. Jackrabbit's efficiency  at importing nodes depends  largely on
>    the persistence  and filesystem  implementation you  are using.
>    For example I've seen the  BDB storage backend perform 10 times
>    faster than the XML-file-based one.
>
> 5. When  you compare  two approaches  (one XML  database, one  JCR
>    repository) for your own usecase, and moreover when you ask for
>    feedback about  your experiments,  publish the results  of your
>    benchmarks, be very  careful to mention *what*  you tested, and
>    *how*.  You also need to mention of course the numeric figures.
>    Otherwise you're just spreading FUD.
>
> Cheers,
>

Re: eXist

Posted by Jean-Baptiste Quenot <jb...@apache.org>.

* Marcin Nowak:

> Recently I've  discovered XML database quite  similar in general
> concepts to Jackrabbit,  in fact it does  not provide versioning
> and  referencing  between  nodes  but   it  is  really  fast  as
> I  compared  it  with  Jackrabbit, especially  in  querying  and
> importing nodes, question is why Jackrabbit performs so badly in
> comparison to eXist?

You're asking  for a troll very  obviously, so I won't  comment on
it, but there are a few things that are worth to mention:

1. eXist  is  an XML  database,  Jackrabbit  is  not, so  you  are
   comparing two  unrelated things.   Moreover, even if  the query
   syntax can look similar, eXist returns XML, whereas JCR returns
   Java objects.  You need to understand the implications of this,
   namely parsing the  resulting XML and work with  it can quickly
   lead to  memory and CPU  starvation, especially when  the query
   returns a lot of documents.  JCR  plays nicely with this, as it
   returns an iterator on the data set.

2. Jackrabbit is  mostly seen  as a Java-API,  whereas eXist  is a
   standalone beast with specific servlets that talk xmlrpc, REST,
   and  so  on mostly  accessed  using  HTTP requests  causing  an
   additional  overhead.  eXist  even  has a  front-end  based  on
   Cocoon.  A  *lot* of caching is  done on the eXist  side, while
   with Jackrabbit you will need  a second-level cache in your own
   code to address that.

3. In my  book, eXist is not  designed to let you  query the whole
   database at  once, whereas  Jackrabbit allows  you to  return a
   sorted  subset  of documents  from  the  whole repository  very
   efficiently,  by design.   Accessing one  XML document  is very
   different from querying the whole database with 10k+ documents.
   Play with eXist more than 5 minutes with a serious data set and
   you will notice by yourself.

4. Jackrabbit's efficiency  at importing nodes depends  largely on
   the persistence  and filesystem  implementation you  are using.
   For example I've seen the  BDB storage backend perform 10 times
   faster than the XML-file-based one.

5. When  you compare  two approaches  (one XML  database, one  JCR
   repository) for your own usecase, and moreover when you ask for
   feedback about  your experiments,  publish the results  of your
   benchmarks, be very  careful to mention *what*  you tested, and
   *how*.  You also need to mention of course the numeric figures.
   Otherwise you're just spreading FUD.

Cheers,
-- 
     Jean-Baptiste Quenot
aka  John Banana   Qwerty
http://caraldi.com/jbq/

Re: eXist

Posted by Marcin Nowak <ma...@comarch.com>.

So you suggest that storing data as attributes could be more efficient 
in Jackrabbit? After weekend I'll try to provide some results of the 
same test cases but with another set of XML-s with storing based on 
attributes, I'll also make some comparison charts.

If there are any give me some suggestions how data should be organized 
to fit best in Jackrabbit architecture - what should I avoid, where are 
the limitations/depth, number of subtags on one level, etc./ ?

Jukka Zitting wrote:
> Hi,
>
> On 4/24/07, Marcin Nowak <ma...@comarch.com> wrote:
>> I can't share those files but I can give you some stats:
>
> Your data set seems to primarily use tags instead of attributes for
> storing content. Jackrabbit nodes are quite a bit "heavier" than DOM
> nodes, which probably explains the difference in performance.
>
> As a rule of thumb I've sometimes used a rough metric that a
> Jackrabbit node is about an order of magnitude more expensive than a
> DOM node. I think we probably could improve this quite a bit.
>
> BR,
>
> Jukka Zitting
>

Re: eXist

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/24/07, Marcin Nowak <ma...@comarch.com> wrote:
> I can't share those files but I can give you some stats:

Your data set seems to primarily use tags instead of attributes for
storing content. Jackrabbit nodes are quite a bit "heavier" than DOM
nodes, which probably explains the difference in performance.

As a rule of thumb I've sometimes used a rough metric that a
Jackrabbit node is about an order of magnitude more expensive than a
DOM node. I think we probably could improve this quite a bit.

BR,

Jukka Zitting

Re: eXist

Posted by Marcin Nowak <ma...@comarch.com>.

I can't share those files but I can give you some stats:

XML contains 3321 subtags to root

there are two types of subtags

1. Tag containing a text value /2090 tags/

2. Tag containing structure as follows (every subtag contains also a 
text value) /1231 tags/:

document root
 |----->subtag
 |            |------>subtag attrib1 attrib2
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |------>subtag
 |            |------>subtag attrib1 attrib2
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |------>subtag
 |            |------>subtag attrib1 attrib2
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |            |------>subtag
 |            |           |------>subtag

BR,
Marcin Nowak
         


David Nuescheler wrote:
> hi marcin,
>
>> ... some junk-XML documents of size 4715740 B ...
> is this a valid usecase for your application and therefore
> similar to what you expect from your application to be
> working with?
> ...do you think you can share those xml files with the list aswell?
>
> regards,
> david
>

Re: eXist

Posted by David Nuescheler <da...@gmail.com>.

hi marcin,

> ... some junk-XML documents of size 4715740 B ...
is this a valid usecase for your application and therefore
similar to what you expect from your application to be
working with?
...do you think you can share those xml files with the list aswell?

regards,
david

Re: eXist

Posted by Marcin Nowak <ma...@comarch.com>.

Hi,

For testing purposes of Jackrabbit I (in fact we :)) have used attached 
classes and some junk-XML documents of size 4715740 B, testing eXist was 
not so complex, as we used provided by authors of eXist demo application 
and imported same files in same procedure as we did for Jackrabbit. 
Report on Jackrabbit performance can be found in this mailing archive, 
and results achieved in eXist - I don't have a formal report on it now - 
but you can easily reproduce those tests. Jackrabbit performance report 
was based on Jackrabbit v. 1.1.1, after that we relaunched tests again, 
based on the same procedure and Jackrabbit v. 1.2.1 - results were 
better ca. 20% - in fact tests should now be relaunched because of 
bundle persistence manager.

Looking forward for your reply :)

BR,
Marcin Nowak

Jukka Zitting wrote:
> Hi,
>
> On 4/23/07, Marcin Nowak <ma...@comarch.com> wrote:
>> But that is not the point :) anyone have an idea how to configure
>> Jackrabbit to perform like eXist?
>
> Let's see how well we can do. Given a quick look it seems that eXist
> will certainly beat Jackrabbit in the performance comparison, but I'd
> be interested in seeing how close we can get and what are the limiting
> factors we face.
>
> Could you share the test code you are using for both eXist and 
> Jackrabbit?
>
> BR,
>
> Jukka Zitting
>

Re: eXist

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 4/23/07, Marcin Nowak <ma...@comarch.com> wrote:
> But that is not the point :) anyone have an idea how to configure
> Jackrabbit to perform like eXist?

Let's see how well we can do. Given a quick look it seems that eXist
will certainly beat Jackrabbit in the performance comparison, but I'd
be interested in seeing how close we can get and what are the limiting
factors we face.

Could you share the test code you are using for both eXist and Jackrabbit?

BR,

Jukka Zitting

Re: eXist

Posted by Marcin Nowak <ma...@comarch.com>.

Hi,

FolDeRol wrote:
> Marcin,
>
> I used to work with eXist 2.5 years ago. JCR and XML:DB concepts are
> actually have some common moments. The reason of eXist's performance 
> is, as
> far as I know, the fact that eXists keeps the whole database as in-memory
> DOM model

I've made some tests and - I'm not sure where DB is being stored.. I did 
the following:

1. Started repository and checked memory usage of it
2. Added 30 MB of XML files
3. Shut down the repository.
4. Started it again and checked memory usage.

It is quite the same as in point 1.

But that is not the point :) anyone have an idea how to configure 
Jackrabbit to perform like eXist?

BR,
Marcin Nowak

> and, in addition uses advanced indexes like those that allow quick
> processing of XPath expressions like "/x//y".
>
> Regards
>
> On 4/23/07, Marcin Nowak <ma...@comarch.com> wrote:
>>
>> Hi,
>>
>> Recently I've discovered XML database quite similar in general concepts
>> to Jackrabbit, in fact it does not provide versioning and referencing
>> between nodes but it is really fast as I compared it with Jackrabbit,
>> especially in querying and importing nodes, question is why Jackrabbit
>> performs so badly in comparison to eXist?
>>
>> Project webpage:
>> http://exist.sourceforge.net/
>>
>> BR,
>> Marcin Nowak
>>
>

Re: eXist

Posted by FolDeRol <fo...@gmail.com>.

Marcin,

I used to work with eXist 2.5 years ago. JCR and XML:DB concepts are
actually have some common moments. The reason of eXist's performance is, as
far as I know, the fact that eXists keeps the whole database as in-memory
DOM model and, in addition uses advanced indexes like those that allow quick
processing of XPath expressions like "/x//y".

Regards

On 4/23/07, Marcin Nowak <ma...@comarch.com> wrote:
>
> Hi,
>
> Recently I've discovered XML database quite similar in general concepts
> to Jackrabbit, in fact it does not provide versioning and referencing
> between nodes but it is really fast as I compared it with Jackrabbit,
> especially in querying and importing nodes, question is why Jackrabbit
> performs so badly in comparison to eXist?
>
> Project webpage:
> http://exist.sourceforge.net/
>
> BR,
> Marcin Nowak
>