You are viewing a plain text version of this content. The canonical link for it is here.
Posted to xindice-users@xml.apache.org by Brendan Laing <Br...@biise.com> on 2006/11/27 13:34:45 UTC

What is the implementation architecture regarding collections...?

Hi,

I've been using xindice for a few weeks now and have started to puzzle
over the follow question. If I have many xml documents and store each
in a collection I'll have a disk space problem due to the 4MB tbl
files  (like a similar user who posted a mail on 2006-11-08 15:24:51
titled 'pagesize/pagecount change in beta4').

However if I aggregate xml documents into a collection I'm concerned
about issues such as record locking and performance. To discuss the
issue let's suppose the following:

1) I have 10k xml documents, 5k on domain xyz.com and 5k on abc.com.
Obviously each domain will have a xindice server running and therefore
a collection per domain would be required at least.
2) An application sits on xyz and accesses documents via the embeded
interface. Each read or update opens the collection and closes it. If
thread a opens the collection and thread b tries to access it and
close it before a has finished will we experience locking
synchronisation issues? Or is locking at node level in the BTree?
3) The application on xyz accesses documents on abc.com over http (via
the xml_rpc interface). We naturally try to reduce network traffic and
bundle updates to improve response times (the cost of the xml_rpc
exceeds the java applications at each end). However by using a single
collection (that is continually opened and closed?) versus many
smaller collections do we incur a penalty for reading and writing a
larger document, parsing it in and out of xindice?

Any views, opinions?

Brendan

Re: What is the implementation architecture regarding collections...?

Posted by Vadim Gritsenko <va...@reverycodes.com>.
Brendan Laing wrote:
> Vadim, Thank you for your reply. I'll make the necessary adjustments
> to accomodate your feedback. The examples in the online developer
> notes highly recommend that collections are closed after every query
> using finally { col.close. Are you contradicting this by saying the
> collection can be closed on application shutdown? If there's no risk
> or penalty I'd prefer to leave the collection open.

Wording on that page about finally block is relevant for Xindice 1.0; Xindice 
1.1 is different. Also it is still recommended to close the collection, you do 
not have to do it right away, you can keep it for extended periods of time.

What important though, when embedding xindice database into another application, 
it is necessary to call shutdown to close all open collections and flush any 
unwritten data before JVM exits.


> Could I suggest that the analogy between collection and RDBMS table be
> added to the Wiki.

Please do. Wiki gives all the power to the users.

Vadim


> On 28/11/06, Vadim Gritsenko <va...@reverycodes.com> wrote:
>> Brendan Laing wrote:
>> > Hi,
>> >
>> > I've been using xindice for a few weeks now and have started to puzzle
>> > over the follow question. If I have many xml documents and store each
>> > in a collection I'll have a disk space problem due to the 4MB tbl
>> > files  (like a similar user who posted a mail on 2006-11-08 15:24:51
>> > titled 'pagesize/pagecount change in beta4').
>>
>> If you store one document per collection, that is a wrong approach. In 
>> the
>> XML:DB database, collection is intended to store lots of documents. It is
>> similar to how single RDBMS table stores multiple records.
>>
>>
>> > However if I aggregate xml documents into a collection I'm concerned
>> > about issues such as record locking and performance. To discuss the
>> > issue let's suppose the following:
>> >
>> > 1) I have 10k xml documents, 5k on domain xyz.com and 5k on abc.com.
>> > Obviously each domain will have a xindice server running and therefore
>> > a collection per domain would be required at least.
>>
>> If you have two xindice servers, they should be two different server
>> installations, with separate config.xml file and must have separate 
>> directories
>> for the database files.
>>
>> Multiple xindice servers must not ever share same database files.
>>
>> You should either have one xindice server with multiple collections, 
>> or multiple
>> servers (with one or many collections - whatever suits your needs).
>>
>>
>> > 2) An application sits on xyz and accesses documents via the embeded
>> > interface. Each read or update opens the collection and closes it.
>>
>> You don't have to open/close collection for each operation. Collection 
>> can be
>> opened once and used by multiple threads and closed on application 
>> shutdown.
>> Collection opening/closing in the client API does not cause collection
>> opening/closing in the database itself.
>>
>>
>> > If
>> > thread a opens the collection and thread b tries to access it and
>> > close it before a has finished will we experience locking
>> > synchronisation issues?
>>
>> No.
>>
>>
>> > Or is locking at node level in the BTree?
>>
>> There is no locking implemented in the xindice (one client can not 
>> prevent
>> another from modifying a document), but there is a synchronization 
>> (prevents
>> data corruption when multiple threads are writing to database). It is 
>> done on
>> levels deeper than CollectionImpl classes.
>>
>>
>> > 3) The application on xyz accesses documents on abc.com over http (via
>> > the xml_rpc interface). We naturally try to reduce network traffic and
>> > bundle updates to improve response times (the cost of the xml_rpc
>> > exceeds the java applications at each end). However by using a single
>> > collection (that is continually opened and closed?) versus many
>> > smaller collections do we incur a penalty for reading and writing a
>> > larger document, parsing it in and out of xindice?
>>
>> Lots of smaller collections will require more operating system 
>> resources (such
>> as file descriptors). Smaller collections are also harder to query: 
>> there is no
>> cross collection querying implemented by xindice. Parsing of the 
>> document from
>> small or large collection will take exactly same amount of time.
>>
>> Vadim


Re: What is the implementation architecture regarding collections...?

Posted by Brendan Laing <Br...@biise.com>.
Vadim, Thank you for your reply. I'll make the necessary adjustments
to accomodate your feedback. The examples in the online developer
notes highly recommend that collections are closed after every query
using finally { col.close. Are you contradicting this by saying the
collection can be closed on application shutdown? If there's no risk
or penalty I'd prefer to leave the collection open.

Could I suggest that the analogy between collection and RDBMS table be
added to the Wiki.

On 28/11/06, Vadim Gritsenko <va...@reverycodes.com> wrote:
> Brendan Laing wrote:
> > Hi,
> >
> > I've been using xindice for a few weeks now and have started to puzzle
> > over the follow question. If I have many xml documents and store each
> > in a collection I'll have a disk space problem due to the 4MB tbl
> > files  (like a similar user who posted a mail on 2006-11-08 15:24:51
> > titled 'pagesize/pagecount change in beta4').
>
> If you store one document per collection, that is a wrong approach. In the
> XML:DB database, collection is intended to store lots of documents. It is
> similar to how single RDBMS table stores multiple records.
>
>
> > However if I aggregate xml documents into a collection I'm concerned
> > about issues such as record locking and performance. To discuss the
> > issue let's suppose the following:
> >
> > 1) I have 10k xml documents, 5k on domain xyz.com and 5k on abc.com.
> > Obviously each domain will have a xindice server running and therefore
> > a collection per domain would be required at least.
>
> If you have two xindice servers, they should be two different server
> installations, with separate config.xml file and must have separate directories
> for the database files.
>
> Multiple xindice servers must not ever share same database files.
>
> You should either have one xindice server with multiple collections, or multiple
> servers (with one or many collections - whatever suits your needs).
>
>
> > 2) An application sits on xyz and accesses documents via the embeded
> > interface. Each read or update opens the collection and closes it.
>
> You don't have to open/close collection for each operation. Collection can be
> opened once and used by multiple threads and closed on application shutdown.
> Collection opening/closing in the client API does not cause collection
> opening/closing in the database itself.
>
>
> > If
> > thread a opens the collection and thread b tries to access it and
> > close it before a has finished will we experience locking
> > synchronisation issues?
>
> No.
>
>
> > Or is locking at node level in the BTree?
>
> There is no locking implemented in the xindice (one client can not prevent
> another from modifying a document), but there is a synchronization (prevents
> data corruption when multiple threads are writing to database). It is done on
> levels deeper than CollectionImpl classes.
>
>
> > 3) The application on xyz accesses documents on abc.com over http (via
> > the xml_rpc interface). We naturally try to reduce network traffic and
> > bundle updates to improve response times (the cost of the xml_rpc
> > exceeds the java applications at each end). However by using a single
> > collection (that is continually opened and closed?) versus many
> > smaller collections do we incur a penalty for reading and writing a
> > larger document, parsing it in and out of xindice?
>
> Lots of smaller collections will require more operating system resources (such
> as file descriptors). Smaller collections are also harder to query: there is no
> cross collection querying implemented by xindice. Parsing of the document from
> small or large collection will take exactly same amount of time.
>
> Vadim
>
>

Re: What is the implementation architecture regarding collections...?

Posted by Vadim Gritsenko <va...@reverycodes.com>.
Brendan Laing wrote:
> Hi,
> 
> I've been using xindice for a few weeks now and have started to puzzle
> over the follow question. If I have many xml documents and store each
> in a collection I'll have a disk space problem due to the 4MB tbl
> files  (like a similar user who posted a mail on 2006-11-08 15:24:51
> titled 'pagesize/pagecount change in beta4').

If you store one document per collection, that is a wrong approach. In the 
XML:DB database, collection is intended to store lots of documents. It is 
similar to how single RDBMS table stores multiple records.


> However if I aggregate xml documents into a collection I'm concerned
> about issues such as record locking and performance. To discuss the
> issue let's suppose the following:
> 
> 1) I have 10k xml documents, 5k on domain xyz.com and 5k on abc.com.
> Obviously each domain will have a xindice server running and therefore
> a collection per domain would be required at least.

If you have two xindice servers, they should be two different server 
installations, with separate config.xml file and must have separate directories 
for the database files.

Multiple xindice servers must not ever share same database files.

You should either have one xindice server with multiple collections, or multiple 
servers (with one or many collections - whatever suits your needs).


> 2) An application sits on xyz and accesses documents via the embeded
> interface. Each read or update opens the collection and closes it.

You don't have to open/close collection for each operation. Collection can be 
opened once and used by multiple threads and closed on application shutdown. 
Collection opening/closing in the client API does not cause collection 
opening/closing in the database itself.


> If
> thread a opens the collection and thread b tries to access it and
> close it before a has finished will we experience locking
> synchronisation issues?

No.


> Or is locking at node level in the BTree?

There is no locking implemented in the xindice (one client can not prevent 
another from modifying a document), but there is a synchronization (prevents 
data corruption when multiple threads are writing to database). It is done on 
levels deeper than CollectionImpl classes.


> 3) The application on xyz accesses documents on abc.com over http (via
> the xml_rpc interface). We naturally try to reduce network traffic and
> bundle updates to improve response times (the cost of the xml_rpc
> exceeds the java applications at each end). However by using a single
> collection (that is continually opened and closed?) versus many
> smaller collections do we incur a penalty for reading and writing a
> larger document, parsing it in and out of xindice?

Lots of smaller collections will require more operating system resources (such 
as file descriptors). Smaller collections are also harder to query: there is no 
cross collection querying implemented by xindice. Parsing of the document from 
small or large collection will take exactly same amount of time.

Vadim