You are viewing a plain text version of this content. The canonical link for it is here.

Posted to xindice-users@xml.apache.org by Terry Rosenbaum <Te...@amicas.com> on 2002/09/23 08:25:33 UTC

XPathQuery/ResourceSet Design Limitations?

Hi,

Several people have noted that when attempting to
run a query against a collection containing many documents
and producing many Resources as a result, they encounter
an OutOfMemoryError. Thus, for example, it is not
feasible to run a collection scan over a large database
(e.g. accessing each document). I have noticed this myself.

This is a result of Xindice's implementation of the ResourceSet.
The org.apache.xindice.client.xmldb.embed.CollectionImpl
converts the iterator (NodeSet) returned by the XPathQueryResolver
into a set, buffering the entire contents of the result in memory.
A similar operation occurs on the xmlrpc side of the house.

Undoubtedly, this implementation was necessitated by the
requirements of the xmldb org.xmldb.api.base.ResourceSet implementation.

The ResourceSet interface requirements of providing a result count
(getSize()), providing random access (getResource(long index)),
and mutability (addResource(Resource res) and removeResource(long index))
seem to preclude a lazy approach to implementing the ResourceSet (e.g.
as an iterator). Due to the nature of the XPath query (against an undefined
data structure), it is not possible to efficiently count result 
Resources nor to provide
random access to them without actually evaluating the entire result set
in advance.

I think we need to consider possible solutions to this issue.
One possibility would be to eliminate support for the features
of the ResourceSet necessitating the current implementation (e.g.
the ResourceSet.getSize() method could simply throw an
"unimplemented" exception). This might be a viable option
since the ResourceSet interface definition does not explicitly state
which features must be supported.

Another option would be to implement an extension such as
a Xindice query service that would provide query results via
a Xindice-specific feature-reduced version of the ResourceSet enabling the
user of the service to visit the result Resources via an iterator
that did not actually retrieve a Resource into memory until
such time as the nextResource() method was invoked. The embedded
version would be fairly simple to build. An xmlrpc version would be a bit
more work (e.g. a tradeoff between network traffic to obtain more
results and the amount of memory utilized on the client side for
buffering results retrived in chunks - perhaps configurable or
perhaps self-tuning). Aside from more efficient memory utilization,
such an implementation would allow a user to limit the number
of Resources returned by a query.

I really am wondering what the designers of the XPathQueryService
and ResourceSet had in mind for working with large collections.

I feel that this is a major issue that limits the usefulness and untimately
the success of Xindice.

Does anyone else have any thoughts or opinions on this issue?

-Terry

Bastian Fiebig <fi...@reduxnet.de> wrote in xindice-users on 2002-09-17 
at 19:21:05
regarding "Workink with many xml-documents":

> to stress xindice I have a collection with hundreds of thousands small
> xml-documents.
>
> My question is, how to work performant with so many documents.
>
> For example
> // Get all cd elements from the database
> ResourceSet resultSet = service.query("/cd");
> ResourceIterator results = resultSet.getIterator();
>
> I think, there is a problem to receive so many documents at once.
> In JDBC it is possible to limit the size of results. And in JDBC you
> get automatically for example the next 20 results if they are required.
>
> Is there a mechanism who works even?


"Wap Brunei" <br...@hotmail.com> wrote in xindice-dev on 2002-01-02 
at 7:37:08
regarding "java.lang.OutOfMemoryError":

> i have an Person.xml file like this size about 404 byte
>
> <?xml version="1.0" encoding="UTF-8"?>
> <person>
>    <fname>Ken</fname>
>    <lname>Smith</lname>
>    <phone type="work">563-456-7890</phone>
>    <phone type="home">534-567-8901</phone>
>    <email type="home">jsmith@somemail.com</email>
>    <email type="work">john@lovesushi.com</email>
>    <address type="home">34 S. Colon St.</address>
>    <address type="work">9967 W. Shrimp Ave.</address>
> </person>
>
>
> i wrote a java class and insert this to xindice
> for (int i = 1 ; i <= 50000 ; i++)
>      test1.AddDocument("Test"+i);
>
>
> then i make a query
>
> String xpath = "//fname[text()='ken']";
>      XPathQueryService service =
>      (XPathQueryService) col.getService("XPathQueryService", "1.0");
>      ResourceSet resultSet = service.query(xpath);
>      ResourceIterator results = resultSet.getIterator();
>          while (results.hasMoreResources()) {
>             Resource res = results.nextResource();
>           }
>       }
>
> why an error message come out
> java.lang.OutOfMemoryError
>
>     <<no stack trace available>>
>
> Exception in thread "main"


<ma...@yahoo.com> wrote in xindice-dev on 2002-08-28 at 15:56:06
regarding "Out of memory":

> i've tried to do a "load test" over a platform that
> is running using Xindice. If I try to do 100 petitions
> (10 ms between each of them) it's ok. But if I
> try the same experience with 1000 petiticions
> (following the same distribution) the server crashes
> by out of memory). I have a PC with 512 Mb of RAM,
> has anyone found the same problem?

Re: XPathQuery/ResourceSet Design Limitations?

Posted by Jeff Greif <jg...@alumni.princeton.edu>.

Terry,

I think having an extension that returned an iterator rather than a
ResourceSet would be a good idea, and if it works out well, perhaps the
xmldb API should be upgraded to make it standard, rather than an extension.
Having something like an RDBMS cursor is clearly of interest in largescale
production use.  Perhaps the iterator could be based on the
XPathQueryResolver iterator, and produce the resources on demand.

Once there is an iterator over the result set in a networked environment,
there are the questions about where the state for the iterator is kept, how
it is isolated from other transactions on the collection or documents, and
management of the buffering between client and server (i.e, the user asks
for the next result, but perhaps results come from the server in batches of
100 at a time to reduce network traffic).  Mature RDBMS systems have several
varieties of cursors (client-side vs. server-side, forward-only vs.
browsable, and probably more) for handling various situations that arise
often enough to be worth it to incorporate as a feature.  JDBC tries to make
available a good fraction of these capabilities if the underlying DBMS
supports them.  In this case under discussion, using Java iterators more or
less restricts to forward-only (and browsable iterators could be implemented
in client code if necessary).  The use case of large result sets
(particularly those which will be only partially viewed a page at a time in
a client UI) suggests some kind of iterator proxy on the client side
accessing an iterator server side, getting batches of results and handing
them to the client one at a time.

By the way, the Xindice pages (xml.apache.org/xindice) have hidden any
reference they contain to the latest sources or how to access them through
cvs, sufficiently well so I can't find them today.  You appear to be looking
at different ones than the released 1.0 sources.

Jeff

----- Original Message -----
From: "Terry Rosenbaum" <Te...@amicas.com>
To: "xindice-dev" <xi...@xml.apache.org>;
<xi...@xml.apache.org>
Sent: Sunday, September 22, 2002 11:25 PM
Subject: XPathQuery/ResourceSet Design Limitations?


> Hi,
>
...
> One possibility would be to eliminate support for the features
> of the ResourceSet necessitating the current implementation (e.g.
> the ResourceSet.getSize() method could simply throw an
> "unimplemented" exception). This might be a viable option
> since the ResourceSet interface definition does not explicitly state
> which features must be supported.
>
> Another option would be to implement an extension such as
> a Xindice query service that would provide query results via
> a Xindice-specific feature-reduced version of the ResourceSet enabling the
> user of the service to visit the result Resources via an iterator
> that did not actually retrieve a Resource into memory until
> such time as the nextResource() method was invoked. The embedded
> version would be fairly simple to build. An xmlrpc version would be a bit
> more work (e.g. a tradeoff between network traffic to obtain more
> results and the amount of memory utilized on the client side for
> buffering results retrived in chunks - perhaps configurable or
> perhaps self-tuning). Aside from more efficient memory utilization,
> such an implementation would allow a user to limit the number
> of Resources returned by a query.
...

Re: XPathQuery/ResourceSet Design Limitations?

Posted by Kimbro Staken <ks...@xmldatabases.org>.

On Sunday, September 22, 2002, at 11:46  PM, Vladimir Bossicard wrote:

>> I feel that this is a major issue that limits the usefulness and 
>> untimately
>> the success of Xindice.
>> Does anyone else have any thoughts or opinions on this issue?
>
> I think that Xindice could send some really valuable feedback to the 
> XMLDb API group.  Maybe the API needs some rethinking.

It does need a lot of work and I've planned to do it for a long time. 
Unfortunately now I don't see it happening since I led that effort and 
I'm really not able to continue it right now.

If there are people who want to pick this up and advance it please join 
the XML:DB API mailing list and start working on it. Once that happens 
we should be able to get people commit access for the XML:DB repository 
to update the necessary pieces. Don't worry about any other people 
there, the project hasn't advanced in a long time.

As for Xindice, it might be a good idea to add a Xindice Java API that 
isn't the XML:DB API. When I was refactoring the XML:DB API drivers and 
adding the embedded driver I had planned to go another step and 
implement the drivers as a thin veneer over a new internal API. I just 
never got around to it.

> But the guys at XMLDb group will know more about the consequences.
>
> my $0.02
>
> -Vladimir
>
> -- 
> Vladimir Bossicard
> www.bossicard.com
>
>
Kimbro Staken
Java and XML Software, Consulting and Writing 
http://www.xmldatabases.org/
Apache Xindice native XML database http://xml.apache.org/xindice
XML:DB Initiative http://www.xmldb.org

Re: XPathQuery/ResourceSet Design Limitations?

Posted by Vladimir Bossicard <vl...@bossicard.com>.

> I feel that this is a major issue that limits the usefulness and untimately
> the success of Xindice.
> 
> Does anyone else have any thoughts or opinions on this issue?

I think that Xindice could send some really valuable feedback to the 
XMLDb API group.  Maybe the API needs some rethinking.

But the guys at XMLDb group will know more about the consequences.

my $0.02

-Vladimir

-- 
Vladimir Bossicard
www.bossicard.com