You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Claude Warren <cl...@xenei.com> on 2012/09/01 09:15:33 UTC

Re: A modest proposal for a new class of Collection

Andy,


First, sorry for moving this discussion but I accidentally sent the
original from the old list and I don't seem to be getting any email from it
so I am reposting the entire discussion here with new comments.


I am thinking more of high contention environments with large data sets.
For example a SPARQL server where users can update the data. Under the
current environment all the queries have to create copies of the triples
they are working with or block the updates. Similarly updates block the
reads.  From what I have seen in the Jena code, much of the query engine is
designed to work with iterators and "streaming" data. However, the current
solution to the issues of graph modification while iterating is to lock out
all updates until the query is complete. In high volume update environments
this effectively leads to single threading the application.


Here are a couple of use cases:


== simultaneous update via polling threads ==


I have a smaller scale use case at DERI where we are monitoring multiple
devices via web calls. The application spawns threads to periodically poll
the devices and updates the graph accordingly. Our SPARQL server uses that
data in virtually every query -- and some of our queries take several
minutes to execute. So at times all our queries are held up while the
device data is updated. In our application would would be perfectly happy
executing against old data and we would deal with disappearing nodes in the
same way we deal with any other missing data.


== security model ==


A second use case would be a graph authorization system. (I am working on
one) where the access authorization to the graphs are maintained in a
single model. That model would have to be quickly readable to support
queries on the front end and updatable to support addition of information
about new graphs (and potentially triples or nodes) as they are added.


== current iteration error messages ==


A third use case is any of the several messages a week dealing with the
issue of modification of the graph while iterating.



Finally, I think that most of these use cases fall outside of the data
interactions defined by relational database transaction isolation levels
and is more in line with the nosql concept of eventual consistency. I
suppose that if the model transactions supported the standard RMDB
isolation levels this could be faked. But that would require that all
queries and graph dumps create duplicate local data for processing. (my
assumption here is that there is only a model level lock and that locks are
not pushed down to the triple level -- node level locking would bring up an
entirely new set of issues as I think I would involve multiple triple level
locks) By duplicate local data I mean that internally the queries have to
build collections of triples that answer some part of the query and then
merge those collections in some way (union, join, etc). While it is
possible to put many of those collections in memory, streaming them to disk
is necessary for large data sets, and for very large data sets handling all
that data multiple times introduces huge overhead.


=== Implementation Note ===


If you think back several versions of Java, before the dawn of iterators
everybody used enumerators. The enumerator does not have the same
restriction on concurrent update that the iterator does -- it also does not
do the delete.

So a quick implementation of the quantum collection is to create an
Enumerator on a collection and then drive the iterator from that.


I hope this hasn't been too rambling a response. Perhaps I should put it
all down in one place, in one cohesive presentation.


-- Claude


Start of quote >>>



Claude,

The problem you are highlighting is that wanting to process (query) results
in a way that needs updates to the data?

ARQ does now have various support classes for streaming intermediate
results to and from disk - this may be a way to effectively take a copy and
not run out of system resources (streaming to disk and back isn't nearly as
bad as random I/O to disk).

Another possibility is the transaction mechanism for TDB allows multiple
readers AND a single concurrent writer.

In transactions, the single writer can be a thread thread receiving changes
from other thread currently doing a read transactions (think actors). The
readers will see a consistent view of the data - no missed or
non-deterministic results. The transaction mechanism uses
write-ahead-logging - the changes are made to a copy, and flushed back
later.

A in-memory dataset with those transactional characteristics would be nice.
In-memory TDB is not scalable - it exactly follwos the disk versions, just
swaps in a RAM disk. It's there for testing in the first instance.

That new class of Collection is interesting, do you know of any libraries
to look at using?

Hwover, it does not pass as particularly "modest" :-) to me. Jena uses
java.util Collections for general purpose datastructures and has it's own,
specialised (and more compact) hash maps for graphs.

Replacing them all would be a significant undertaking, if that is what your
proposal would entail. Getting a bug-free, performant implementation of
hash maps and arrays isn't completely trivial. I'd want to know it did
indeed lead to benefits, not just look better.

I believe that such a collection and set of iterators would make streaming
queries of large data sets much easier and faster.

Can we take this as a scenario to discuss? Do you have a concrete use case
we can use to see if the current system is having as best it might?

Andy

On 28/08/12 09:47, Claude Warren wrote:

It seems that one of the major stumbling blocks in building Jena based
applications is that iterators is found in the iterator documentation: “The
behavior of an iterator is unspecified if the underlying collection is
modified while the iteration is in progress in any way other than by
calling this [remove()] method.” For Jena applications this means that
updates to the underlying graph are prohibited while an iterator is in
operation. Iterators are commonly used in query operations.

One work around for this has been to create a copy of the results of a
query operation into a local collection and then iterate across that. The
net result of this work around is excessive consumption of resources to
store the data in flight as it is being iterated over.

A second work around has been to use read and write locks and/or
transactions to prohibit writing to a graph while the graph is being
queried. This solution works well when the number of updates to the graph
is small in comparison to the number of queries. When the number of updates
increases the data store effectively becomes single threaded.

I propose a third work around, a new class of Collection, be considered. I
call this a quantum collection as the items in the collection do not exist
until they are examined. Conceptually, the collection is defined in terms
of constraints on the objects within it. The iterator simply retrieves one
of the objects and presents it. Since the collection does not exist in the
classical sense there is no modification of the collection – save for the
modification of the constraints that define it.

In a Java implementation of the Iterator interface on such a collection
makes the following contract:

hasNext(): Locates an element in the collection to return and returns true
if one was found, false otherwise. This method identifies the element as
part of the collection. The iterator guarantees that the element was in the
collection when the collection was examined. It makes no guarantees that
the element is still in the collection or even that the collection contains
any elements at all.

next(): returns the element located by hasNext() or throws a
NoSuchElementException if there are no more elements. If hasNext() was not
called, next will call hasNext() to locate the object or determine that
none exists.

remove(): not implemented always throws UnsupportedOperationException

Notice that the collection makes no assumptions about the order or
uniqueness of the objects presented by the iterator. The iterator may be
constructed so as to return unique objects, or the constraint on the
collection may introduce ordering or uniqueness.

I believe that such a collection and set of iterators would make streaming
queries of large data sets much easier and faster. The trade off is that
items that are added or removed over the course of the iteration may or may
not be included in the iterator. That is, the number of elements returned
by the iterator is indeterminate until all elements have been returned.

Thoughts?

Claude Warren


<< end of quote
-- 
I like: Like Like - The likeliest place on the web<http://like-like.xenei.com>
Identity: https://www.identify.nu/user.php?claude@xenei.com
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: A modest proposal for a new class of Collection

Posted by Andy Seaborne <an...@apache.org>.

Claude,

I think the key here is to design for high-frequency update in the 
presence of long-running queries.

TDB's transaction implementation design is slanted to that work load, 
rather than very large updates.  It does this because the changes are 
journalled, unlike some databases which shadow the read-data and preform 
undo processing.  TDB's transaction mechanism is not a lock-based one 
(far to hard!).

Queries are not held up by updates.

Oddly, it can mean that update in transactions is faster than 
non-transactionally because of grouping writes! even before TDB adds 
async write back. (The current code now buffers writebacks - the write 
transaction completes and the app continues before writeback.)

Other RDF systems don't do this - it's all about choices in the design 
space.  All systems are making choices.

The same could be done for an in-memory implementation of DatasetGraph. 
  It's not about the choice of collections to implement the DatasetGraph 
but the algorithms.

A simple version is pair of (DatasetGraph, changes) and write back the 
changes at quiet times.  If there are no quiet times, your system has 
more load than capacity - always a situation that needs dealing with.

I don't understand a number of your points:

Re:

Re: Eventual consistency

This is a good thing but in the NoSQL world there is something else 
going on as well.  The data model of the storage is related to the data 
model of the application.

Suppose that an app adds two triples, for example, the recorded 
temperature and the timestamp of the measurement. Seeing just the 
temperature, or just the timestamp, or seeing temperatures for two 
events, but no timestamps yet, pushes the burden back onto the 
application to handle more situations.

Concretely, the app has to cope with:

SELECT { ?x :temperature ?t }

working but

SELECT { ?x :temperature ?;
             :timestamp ?ts }

not.  That pushes the burden onto the app.

In the NoSQL eventual consistency world, something else is going on. 
Take a key-value store - the (key,value) pair is usually some entity 
that the application is working with, e.g. the whole temperature 
measurement.

Triple level is very fine-grained: there is an old-fashioned database 
version of this is row-level locking.  While this is maximising 
concurrency, typically, the overhead of locking a few bytes of user data 
is prohibitive.

Re: Enumerators vs Iterators

Whether CCME arise is not to do with the iterator but with the 
datastructure being accessed.  CCME were added at Java 1.2 so that 
inconsistency would be signalled.

Old-style Hashtable don't CCME - instead they can crash.  The 
implementation of Iterator next is:

         public T next() {
             if (modCount != expectedModCount)
                 throw new ConcurrentModificationException();
             return nextElement();
         }

and if you look at nextElement it treads very carefully through the 
implementation structure.  Hashtable does use synchronized.

ConcurrentHashMap is interesting but implementating a graph or dataset 
is coordinating several of these datastructures.


But we come back to it's about the storage, not the iterators (or 
collections - TDB does not use java collections).  If the collections 
were inconsistent, then a triple is in one index but not another.  Which 
might lead to some interesting effects!

So I agree that supporting high-frequency update/long read is 
interesting but disagree it's a matter of changing the collection 
implementation.  A cleaner and more scalable solution is possible, with 
less application burden.

	Andy


On 01/09/12 08:15, Claude Warren wrote:
> Andy,
>
>
> First, sorry for moving this discussion but I accidentally sent the
> original from the old list and I don't seem to be getting any email from it
> so I am reposting the entire discussion here with new comments.
>
>
> I am thinking more of high contention environments with large data sets.
> For example a SPARQL server where users can update the data. Under the
> current environment all the queries have to create copies of the triples
> they are working with or block the updates. Similarly updates block the
> reads.  From what I have seen in the Jena code, much of the query engine is
> designed to work with iterators and "streaming" data. However, the current
> solution to the issues of graph modification while iterating is to lock out
> all updates until the query is complete. In high volume update environments
> this effectively leads to single threading the application.
>
>
> Here are a couple of use cases:
>
>
> == simultaneous update via polling threads ==
>
>
> I have a smaller scale use case at DERI where we are monitoring multiple
> devices via web calls. The application spawns threads to periodically poll
> the devices and updates the graph accordingly. Our SPARQL server uses that
> data in virtually every query -- and some of our queries take several
> minutes to execute. So at times all our queries are held up while the
> device data is updated. In our application would would be perfectly happy
> executing against old data and we would deal with disappearing nodes in the
> same way we deal with any other missing data.
>
>
> == security model ==
>
>
> A second use case would be a graph authorization system. (I am working on
> one) where the access authorization to the graphs are maintained in a
> single model. That model would have to be quickly readable to support
> queries on the front end and updatable to support addition of information
> about new graphs (and potentially triples or nodes) as they are added.
>
>
> == current iteration error messages ==
>
>
> A third use case is any of the several messages a week dealing with the
> issue of modification of the graph while iterating.
>
>
>
> Finally, I think that most of these use cases fall outside of the data
> interactions defined by relational database transaction isolation levels
> and is more in line with the nosql concept of eventual consistency. I
> suppose that if the model transactions supported the standard RMDB
> isolation levels this could be faked. But that would require that all
> queries and graph dumps create duplicate local data for processing. (my
> assumption here is that there is only a model level lock and that locks are
> not pushed down to the triple level -- node level locking would bring up an
> entirely new set of issues as I think I would involve multiple triple level
> locks) By duplicate local data I mean that internally the queries have to
> build collections of triples that answer some part of the query and then
> merge those collections in some way (union, join, etc). While it is
> possible to put many of those collections in memory, streaming them to disk
> is necessary for large data sets, and for very large data sets handling all
> that data multiple times introduces huge overhead.
>
>
> === Implementation Note ===
>
>
> If you think back several versions of Java, before the dawn of iterators
> everybody used enumerators. The enumerator does not have the same
> restriction on concurrent update that the iterator does -- it also does not
> do the delete.
>
> So a quick implementation of the quantum collection is to create an
> Enumerator on a collection and then drive the iterator from that.
>
>
> I hope this hasn't been too rambling a response. Perhaps I should put it
> all down in one place, in one cohesive presentation.
>
>
> -- Claude
>
>
> Start of quote >>>
>
>
>
> Claude,
>
> The problem you are highlighting is that wanting to process (query) results
> in a way that needs updates to the data?
>
> ARQ does now have various support classes for streaming intermediate
> results to and from disk - this may be a way to effectively take a copy and
> not run out of system resources (streaming to disk and back isn't nearly as
> bad as random I/O to disk).
>
> Another possibility is the transaction mechanism for TDB allows multiple
> readers AND a single concurrent writer.
>
> In transactions, the single writer can be a thread thread receiving changes
> from other thread currently doing a read transactions (think actors). The
> readers will see a consistent view of the data - no missed or
> non-deterministic results. The transaction mechanism uses
> write-ahead-logging - the changes are made to a copy, and flushed back
> later.
>
> A in-memory dataset with those transactional characteristics would be nice.
> In-memory TDB is not scalable - it exactly follwos the disk versions, just
> swaps in a RAM disk. It's there for testing in the first instance.
>
> That new class of Collection is interesting, do you know of any libraries
> to look at using?
>
> Hwover, it does not pass as particularly "modest" :-) to me. Jena uses
> java.util Collections for general purpose datastructures and has it's own,
> specialised (and more compact) hash maps for graphs.
>
> Replacing them all would be a significant undertaking, if that is what your
> proposal would entail. Getting a bug-free, performant implementation of
> hash maps and arrays isn't completely trivial. I'd want to know it did
> indeed lead to benefits, not just look better.
>
> I believe that such a collection and set of iterators would make streaming
> queries of large data sets much easier and faster.
>
> Can we take this as a scenario to discuss? Do you have a concrete use case
> we can use to see if the current system is having as best it might?
>
> Andy
>
> On 28/08/12 09:47, Claude Warren wrote:
>
> It seems that one of the major stumbling blocks in building Jena based
> applications is that iterators is found in the iterator documentation: “The
> behavior of an iterator is unspecified if the underlying collection is
> modified while the iteration is in progress in any way other than by
> calling this [remove()] method.” For Jena applications this means that
> updates to the underlying graph are prohibited while an iterator is in
> operation. Iterators are commonly used in query operations.
>
> One work around for this has been to create a copy of the results of a
> query operation into a local collection and then iterate across that. The
> net result of this work around is excessive consumption of resources to
> store the data in flight as it is being iterated over.
>
> A second work around has been to use read and write locks and/or
> transactions to prohibit writing to a graph while the graph is being
> queried. This solution works well when the number of updates to the graph
> is small in comparison to the number of queries. When the number of updates
> increases the data store effectively becomes single threaded.
>
> I propose a third work around, a new class of Collection, be considered. I
> call this a quantum collection as the items in the collection do not exist
> until they are examined. Conceptually, the collection is defined in terms
> of constraints on the objects within it. The iterator simply retrieves one
> of the objects and presents it. Since the collection does not exist in the
> classical sense there is no modification of the collection – save for the
> modification of the constraints that define it.
>
> In a Java implementation of the Iterator interface on such a collection
> makes the following contract:
>
> hasNext(): Locates an element in the collection to return and returns true
> if one was found, false otherwise. This method identifies the element as
> part of the collection. The iterator guarantees that the element was in the
> collection when the collection was examined. It makes no guarantees that
> the element is still in the collection or even that the collection contains
> any elements at all.
>
> next(): returns the element located by hasNext() or throws a
> NoSuchElementException if there are no more elements. If hasNext() was not
> called, next will call hasNext() to locate the object or determine that
> none exists.
>
> remove(): not implemented always throws UnsupportedOperationException
>
> Notice that the collection makes no assumptions about the order or
> uniqueness of the objects presented by the iterator. The iterator may be
> constructed so as to return unique objects, or the constraint on the
> collection may introduce ordering or uniqueness.
>
> I believe that such a collection and set of iterators would make streaming
> queries of large data sets much easier and faster. The trade off is that
> items that are added or removed over the course of the iteration may or may
> not be included in the iterator. That is, the number of elements returned
> by the iterator is indeterminate until all elements have been returned.
>
> Thoughts?
>
> Claude Warren
>
>
> << end of quote
>