You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Claude Warren <cl...@xenei.com> on 2012/08/28 10:47:12 UTC

A modest proposal for a new class of Collection

It seems that one of the major stumbling blocks in building Jena based
applications is that iterators is found in the iterator documentation: “The
behavior of an iterator is unspecified if the underlying collection is
modified while the iteration is in progress in any way other than by
calling this [remove()] method.” For Jena applications this means that
updates to the underlying graph are prohibited while an iterator is in
operation. Iterators are commonly used in query operations.

One work around for this has been to create a copy of the results of a
query operation into a local collection and then iterate across that. The
net result of this work around is excessive consumption of resources to
store the data in flight as it is being iterated over.

A second work around has been to use read and write locks and/or
transactions to prohibit writing to a graph while the graph is being
queried. This solution works well when the number of updates to the graph
is small in comparison to the number of queries. When the number of updates
increases the data store effectively becomes single threaded.

I propose a third work around, a new class of Collection, be considered. I
call this a quantum collection as the items in the collection do not exist
until they are examined. Conceptually, the collection is defined in terms
of constraints on the objects within it. The iterator simply retrieves one
of the objects and presents it. Since the collection does not exist in the
classical sense there is no modification of the collection – save for the
modification of the constraints that define it.

In a Java implementation of the Iterator interface on such a collection
makes the following contract:

hasNext(): Locates an element in the collection to return and returns true
if one was found, false otherwise. This method identifies the element as
part of the collection. The iterator guarantees that the element was in the
collection when the collection was examined. It makes no guarantees that
the element is still in the collection or even that the collection contains
any elements at all.

next(): returns the element located by hasNext() or throws a
NoSuchElementException if there are no more elements. If hasNext() was not
called, next will call hasNext() to locate the object or determine that
none exists.

remove(): not implemented always throws UnsupportedOperationException

Notice that the collection makes no assumptions about the order or
uniqueness of the objects presented by the iterator. The iterator may be
constructed so as to return unique objects, or the constraint on the
collection may introduce ordering or uniqueness.

I believe that such a collection and set of iterators would make streaming
queries of large data sets much easier and faster.   The trade off is that
items that are added or removed over the course of the iteration may or may
not be included in the iterator.  That is, the number of elements returned
by the iterator is indeterminate until all elements have been returned.

Thoughts?

Claude Warren

-- 
I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
Identity: https://www.identify.nu/user.php?claude@xenei.com
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: A modest proposal for a new class of Collection

Posted by Andy Seaborne <an...@apache.org>.

Claude,

The problem you are highlighting is that wanting to process (query) 
results in a way that needs updates to the data?

ARQ does now have various support classes for streaming intermediate 
results to and from disk - this may be a way to effectively take a copy 
and not run out of system resources (streaming to disk and back isn't 
nearly as bad as random I/O to disk).

Another possibility is the transaction mechanism for TDB allows multiple 
readers AND a single concurrent writer.

In transactions, the single writer can be a thread thread receiving 
changes from other thread currently doing a read transactions (think 
actors).  The readers will see a consistent view of the data - no missed 
or non-deterministic results.  The transaction mechanism uses 
write-ahead-logging - the changes are made to a copy, and flushed back 
later.

A in-memory dataset with those transactional characteristics would be 
nice.  In-memory TDB is not scalable - it exactly follwos the disk 
versions, just swaps in a RAM disk.  It's there for testing in the first 
instance.

That new class of Collection is interesting, do you know of any 
libraries to look at using?

Hwover, it does not pass as particularly "modest" :-) to me.   Jena uses 
java.util Collections for general purpose datastructures and has it's 
own, specialised (and more compact) hash maps for graphs.

Replacing them all would be a significant undertaking, if that is what 
your proposal would entail.  Getting a bug-free, performant 
implementation of hash maps and arrays isn't completely trivial.  I'd 
want to know it did indeed lead to benefits, not just look better.

 > I believe that such a collection and set of iterators would make
 > streaming queries of large data sets much easier and faster.

Can we take this as a scenario to discuss?  Do you have a concrete use 
case we can use to see if the current system is having as best it might?

	Andy

On 28/08/12 09:47, Claude Warren wrote:
> It seems that one of the major stumbling blocks in building Jena based
> applications is that iterators is found in the iterator documentation: “The
> behavior of an iterator is unspecified if the underlying collection is
> modified while the iteration is in progress in any way other than by
> calling this [remove()] method.” For Jena applications this means that
> updates to the underlying graph are prohibited while an iterator is in
> operation. Iterators are commonly used in query operations.
>
> One work around for this has been to create a copy of the results of a
> query operation into a local collection and then iterate across that. The
> net result of this work around is excessive consumption of resources to
> store the data in flight as it is being iterated over.
>
> A second work around has been to use read and write locks and/or
> transactions to prohibit writing to a graph while the graph is being
> queried. This solution works well when the number of updates to the graph
> is small in comparison to the number of queries. When the number of updates
> increases the data store effectively becomes single threaded.
>
> I propose a third work around, a new class of Collection, be considered. I
> call this a quantum collection as the items in the collection do not exist
> until they are examined. Conceptually, the collection is defined in terms
> of constraints on the objects within it. The iterator simply retrieves one
> of the objects and presents it. Since the collection does not exist in the
> classical sense there is no modification of the collection – save for the
> modification of the constraints that define it.
>
> In a Java implementation of the Iterator interface on such a collection
> makes the following contract:
>
> hasNext(): Locates an element in the collection to return and returns true
> if one was found, false otherwise. This method identifies the element as
> part of the collection. The iterator guarantees that the element was in the
> collection when the collection was examined. It makes no guarantees that
> the element is still in the collection or even that the collection contains
> any elements at all.
>
> next(): returns the element located by hasNext() or throws a
> NoSuchElementException if there are no more elements. If hasNext() was not
> called, next will call hasNext() to locate the object or determine that
> none exists.
>
> remove(): not implemented always throws UnsupportedOperationException
>
> Notice that the collection makes no assumptions about the order or
> uniqueness of the objects presented by the iterator. The iterator may be
> constructed so as to return unique objects, or the constraint on the
> collection may introduce ordering or uniqueness.
>
> I believe that such a collection and set of iterators would make streaming
> queries of large data sets much easier and faster.   The trade off is that
> items that are added or removed over the course of the iteration may or may
> not be included in the iterator.  That is, the number of elements returned
> by the iterator is indeterminate until all elements have been returned.
>
> Thoughts?
>
> Claude Warren
>