You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "ajs6f@virginia.edu" <aj...@virginia.edu> on 2015/09/03 22:15:11 UTC

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

> Probably a lot of codesign is possible if the index type is fixed leading to simpler and better performant code,  but it would be nice to have a pluggable approach so that you can pick the structures that work for your workload. 

This is certainly very true. I am trying to develop some good abstractions to this end.

So far the only sticking point is DatasetGraph::listGraphNodes. For the moment, I have added a listGraphNodes method to the index abstraction (alongside a find method that returns Iterator<Quad> to support DatasetGraph::find and ::findNG), because a map-based approach is just O(1) into iteration and I do want to be able to take advantage of that performance. But I will continue to look for a better approach.

---
A. Soroka
The University of Virginia Library

On Aug 31, 2015, at 11:20 AM, Paul Houle <on...@gmail.com> wrote:

> The thing about bloom filters is that they are (i) highly dense because
> they are bit packed and (ii) accessed in sequential order so that modern
> processors can fetch ahead and hide RAM latency.
> 
> Most kinds of indexes,  on the other hand,  involve pointer chasing over an
> area that does not fit in the cache,  so your brand new processor is
> performing like a Pentium II.  Even if the O() part is better for some
> other algorithm,  the constant in front is awesome for the bloom filter.
> 
> Probably a lot of codesign is possible if the index type is fixed leading
> to simpler and better performant code,  but it would be nice to have a
> pluggable approach so that you can pick the structures that work for your
> workload.  Also from a research perspective we could get real questions
> 
> There are a lot of patents in the 2000 or so time period on systems that
> profile the queries and automatically build indexes.  For instance,
> Salesforce.com has a patent,  and they also have an implementation where
> the "master copy" of data is stored in an Oracle table that looks a lot
> like a triple store and then it automatically creates materialized views
> and functional indexes in Oracle to support client workloads.
> 
> (I remember writing a query against one of the larger Salesforce orgs and
> being frustrated when it timed out,  only to run the same query two minutes
> later and have it work perfectly.)
> 
> Patents do expire and it won't be too long that this technology could make
> it to open source tools.
> 
> On Mon, Aug 31, 2015 at 10:04 AM, Claude Warren <cl...@xenei.com> wrote:
> 
>> step 3 is relatively expensive if there are a log of quads with the G and S
>> values (e.g. *,G,*,S) but few with G,S,*,*
>> 
>> Step 3 is about removing the false positives from the bloom filter.  It
>> does not require an index, it requires checking the values to ensure match.
>> 
>> While there is a time to space trade off here.  The time is not as much as
>> one would think.  Keep in mind that a bloomfilter check is very fast.  On
>> the other hand the traversal of a b-tree or other index is much more
>> computationally complex.
>> 
>> For example, I know that if you can devise a mechanism by which you can
>> index bloom filters such  that you have a 2 segment key and you know that
>> you want to check every key where the compound keys are greater than the
>> one you are looking for it is often faster to do the linear search through
>> the data even when dealing with 1 million entries.  There are special cases
>> where the index is faster but for the most part the linear search is
>> faster.
>> 
>> Claude
>> 
>> On Mon, Aug 31, 2015 at 1:53 PM, ajs6f@virginia.edu <aj...@virginia.edu>
>> wrote:
>> 
>>> I'm still a bit confused as to why you don't regard step 3 as being
>>> potentially very expensive. In order to verify a match, we will have to
>>> examine an "exact" index, and that (as Andy remarked) is likely to
>> require
>>> traversal, or else we throw away all the space gains.
>>> 
>>> Is this technique a way to pay a lot of time for a lot of space savings?
>>> Perhaps it is appropriate for an alternative implementation for very
>> large
>>> datasets?
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>> On Aug 31, 2015, at 6:48 AM, Claude Warren <cl...@xenei.com> wrote:
>>> 
>>>> to find find(G,S,*,*) with bloom filters and return an iterator you
>>>> 
>>>>  1. construct a bloom filter with G and S
>>>>  2. scan the list of quads checking for matches.
>>>>  3. for each result that matches verify that it has G and S (I have
>>> done this with an extended iterator in Jena)
>>>> 
>>>> result is an iterator that returns all (G,S,*,*) quads.
>>>> 
>>>> similar tests can be performed for any pattern -- same code used.
>>>> 
>>>> Step 2 is the expensive one.  But the bloom filter check is so
>> efficient
>>>> that it becomes very difficult to perform search operations in less
>> time
>>>> than it takes to scan the list.
>>>> 
>>>> Claude
>>>> 
>>>> On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <an...@apache.org>
>> wrote:
>>>> 
>>>>> On 29/08/15 14:55, Claude Warren wrote:
>>>>> 
>>>>>> Something I have been thinking about....
>>>>>> 
>>>>>> you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
>>>>>> bloomfilter implementation.  It means a 2 step process to find
>> matches
>>> but
>>>>>> it might be fast enough and reduce the overhead significantly.
>>>>>> 
>>>>>> I did an in-memory and a relational DB based version recently, but it
>>> was
>>>>>> just a quick POC.
>>>>>> 
>>>>>> Claude
>>>>>> 
>>>>> 
>>>>> So we're talking about in-memory, where the items are java classes.  A
>>>>> quad is 2 slots java overhead + 4 slots for G, S, P, O pointers.
>>> That's 48
>>>>> bytes if the heap is >32G and 24 bytes otherwise (compressed pointers
>>> or 32
>>>>> bit).
>>>>> 
>>>>> For storage, the key test is "contains" to maintain the "set of"
>>>>> semantics.  Something to stop index traversal for each insert would be
>>>>> great but it's still stored and 1 not up to 6 would be good. (Note
>> that
>>>>> most data is unique quads.)
>>>>> 
>>>>> The import retrieval operation is find(G,S,P,O) where any of those can
>>> be
>>>>> a wildcard and return (ideally as a stream) all matching quads with a
>>>>> prefix.  The multiple indexes exist to find based on prefix.
>>>>> 
>>>>> How would that work for, say find(G,S,*,*) with bloom filters and 1b
>>>>> quads?  How does the code go from returning G,S,P1,O1 to the next
>>> G,S,P1,O2
>>>>> without trying every value for the O slot?
>>>>> 
>>>>> For a hash map based hierarchical index G->S->P->O, it's O(1) to find
>>> the
>>>>> start of the scan then datastructure iteration.  A hash-based is not
>>>>> necessarily the best choice [*] but it's a baseline to discuss.
>>>>> 
>>>>> And in memory, will a bloom filter-based system be faster?  Because of
>>>>> false-positives, isn't a definitively index still needed?  If one is
>>> kept,
>>>>> not 6, there could be great space gains but every quad returned is a
>>>>> top-to-bottom traversal of that index (which is now a not a range
>>> index).
>>>>> 
>>>>> The design should work for 1+ billion in-memory quads - that's the way
>>> the
>>>>> world is going.
>>>>> 
>>>>> So each quad is reduced to a
>>>>>> single bloom filter comprising 4 items (15-bytes).
>>>>>> 
>>>>> 
>>>>>       Andy
>>>>> 
>>>>> [*] even in memory, it might be worth allocating internal ids and
>>> working
>>>>> in longs like a disk based system because it is more compact - naive
>>>>> hashmaps take a lot of space to when storing small items like quads.
>>>>> tradeoffs, tradeoffs, ...
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <aj...@virginia.edu>
>> wrote:
>>>>>> 
>>>>>> Hey, folks--
>>>>>>> 
>>>>>>> There hasn't been too much feedback on my proposal for a journaling
>>>>>>> DatasetGraph:
>>>>>>> 
>>>>>>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
>>>>>>> 
>>>>>>> which was and is to be a step towards JENA-624: Develop a new
>>> in-memory
>>>>>>> RDF Dataset implementation. So I'm moving on to look at the real
>>> problem:
>>>>>>> an in-memory  DatasetGraph with high concurrency, for use with
>> modern
>>>>>>> hardware running many, many threads in large core memory.
>>>>>>> 
>>>>>>> I'm beginning to sketch out rough code, and I'd like to run some
>>> design
>>>>>>> decisions past the list to get criticism/advice/horrified
>>>>>>> warnings/whatever
>>>>>>> needs to be said.
>>>>>>> 
>>>>>>> 1) All-transactional action: i.e. no non-transactional operation.
>>> This is
>>>>>>> obviously a great thing for simplifying my work, but I hope it won't
>>> be
>>>>>>> out
>>>>>>> of line with the expected uses for this stuff.
>>>>>>> 
>>>>>>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO,
>>> OPSG. I
>>>>>>> figure to play to the strength of in-core-memory operation: raw
>> speed,
>>>>>>> but
>>>>>>> obviously this is going to cost space.
>>>>>>> 
>>>>>>> 3) At least for now, all commits succeed.
>>>>>>> 
>>>>>>> 4) The use of persistent datastructures to avoid complex and
>>> error-prone
>>>>>>> fine-grained locking regimes. I'm using http://pcollections.org/
>> for
>>>>>>> now,
>>>>>>> but I am in no way committed to it nor do I claim to have thoroughly
>>>>>>> vetted
>>>>>>> it. It's simple but enough to get started, and that's all I need to
>>> bring
>>>>>>> the real design questions into focus.
>>>>>>> 
>>>>>>> 5) Snapshot isolation. Transactions do not see commits that occur
>>> during
>>>>>>> their lifetime. Each works entirely from the state of the
>>> DatasetGraph at
>>>>>>> the start of its life.
>>>>>>> 
>>>>>>> 6) Only as many as one transaction per thread, for now. Transactions
>>> are
>>>>>>> not thread-safe. These are simplifying assumptions that could be
>>> relaxed
>>>>>>> later.
>>>>>>> 
>>>>>>> My current design operates as follows:
>>>>>>> 
>>>>>>> At the start of a transaction, a fresh in-transaction reference is
>>> taken
>>>>>>> atomically from the AtomicReference that points to the index block.
>> As
>>>>>>> operations are performed in the transaction, that in-transaction
>>>>>>> reference
>>>>>>> is progressed (in the sense in which any persistent datastructure is
>>>>>>> progressed) while the operations are recorded. Upon an abort, the
>>>>>>> in-transaction reference and the record are just thrown away. Upon a
>>>>>>> commit, the in-transaction reference is thrown away and the
>> operation
>>>>>>> record is re-run against the main reference (the one that is copied
>> at
>>>>>>> the
>>>>>>> beginning of a transaction). That rerun happens inside an atomic
>>> update
>>>>>>> (hence the use of AtomicReference). This all should avoid the need
>> for
>>>>>>> explicit locking in Jena and should confine any blocking against the
>>>>>>> indexes to the actual duration of a commit.
>>>>>>> 
>>>>>>> What do you guys think?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---
>>>>>>> A. Soroka
>>>>>>> The University of Virginia Library
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> I like: Like Like - The likeliest place on the web
>>>> <http://like-like.xenei.com>
>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>> 
>>> 
>> 
>> 
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>> 
> 
> 
> 
> -- 
> Paul Houle
> 
> *Applying Schemas for Natural Language Processing, Distributed Systems,
> Classification and Text Mining and Data Lakes*
> 
> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
> 
> :BaseKB -- Query Freebase Data With SPARQL
> http://basekb.com/gold/
> 
> Legal Entity Identifier Lookup
> https://legalentityidentifier.info/lei/lookup/
> <http://legalentityidentifier.info/lei/lookup/>
> 
> Join our Data Lakes group on LinkedIn
> https://www.linkedin.com/grp/home?gid=8267275


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by "A. Soroka" <aj...@virginia.edu>.
I examined the Transactional interface more carefully, and the comment on end() reads "/** Finish the transaction - if a write transaction and commit() has not been called, then abort */“.

So it seems that there are definitely multiple ideas about the meaning of end()! {grin}

---
A. Soroka
The University of Virginia Library

> On Sep 25, 2015, at 11:43 AM, Andy Seaborne <an...@apache.org> wrote:
> 
> On 25/09/15 15:07, A. Soroka wrote:
>> Okay, I committed a pair of simple tests in commit() and abort() to
>> see to it that you can’t call them outside a transaction and that
>> made all of AbstractTestTransaction pass.
>> 
>> This worries me, because you suggested that it would probably require
>> a finer-grained lifecycle modeled, and I didn’t do that, so I’m
>> wondering what I’m missing. {grin}
> 
> It'll depend on the status of "end" -- I have considered the transaction to run from begin() to end() and end() in a W transaction without commit() or abort() is an error.  c.f. R transaction of begin(R)-end() scoping.
> 
> The idiom is then (and can be made nicer in Java8):
> 
> begin(W)
> try {
> ... reads and writes ...
> } finally { end() ; }
> 
> or variations with an auto-abort on exceptions.  The try{} is a consistency scope.  That does not work with the auto-autocommit.
> 
> In your case begin(W)-end() is a silent abort?
> 
> 
> BTW what happens if you get end-end?  index().end(); is called repeatedly.  Does that matter? (haven't had time to look at that)
> 
>> One remark of yours gives me
>> pause, though: “I managed to add after commit.” That’s definitely
>> intentional, because on your advice [*], I set up
>> DatasetGraphInMemory so that a mutating operation outside of a
>> transaction auto-wraps itself in a transaction. So calling add after
>> commit is actually two legitimate transactions “under-the-hood".
>> 
> > Did I misinterpret what you advised? Did you mean me to add
> > “auto-transact” in a more configurable way?
> 
> So in a multithreaded environment, the autocommit despite being inside the begin-end is not consistent with the begin-commit part.  Needs thinking about.
> 
> A general point : we need to decide the fine grain details of the transaction models (or variations) we want to provide. Until now it's been TDB and DatasetGraphWithLock so not a wide range of choices pushing the design boundaries and stuff is implicit.  Great thing about 624 is sorting that out.
> 
> 
> Current TDB works non-transactionally until the first begin then explicit transaction only.  That's to maximise compatibility. autocommit and disk is dire (disk I/O per quad add is going to easily become painful - just look at many database lists with "why does my simple app go so slow" questions)l in -memory is different.
> 
> TDB non-transactional isn't even an implicit transaction.  Current TDB2 is explicit transaction only because (1) it has no concept of update in-place due to immutable data structures and (2) I haven't spent any time on it.
> 
> I like the Java8 style Txn.executeWrite style (maybe shorted names - details).
> 
> In TDB2, transaction are promotable (read to write) though it is allowed only if no W transaction has run (strictly, started) and started it's own version of the DB.  The API is more general that TDB2 supports. begin() and auto-promote is possible but not IIRC done.
> 
> promote is the only time system aborts can happen so it creates interesting app issues.
> 
> My inclination is to suggest that we encourage explicit transactions as the recommended style.  (Rhetorically) Is it to much of app burden? Don't know. Thoughts?
> 
> 	Andy
> 
>> --- A. Soroka The University of Virginia Library
>> 
>> [*] http://permalink.gmane.org/gmane.comp.apache.jena.devel/10529
>> 
>>> On Sep 24, 2015, at 1:05 PM, Andy Seaborne <an...@apache.org>
>>> wrote:
>>> 
>>> The rest are testing that e.g. begin-abort-commit is an error.  It
>>> looks like the transaction lifecycle is not being tracked.  It
>>> needs finer granularity than DatasetGraphInMemory.isInTransaction.
>>> May be as simple as NOT_TXN -> ACTIVE_TXN -> FINISHING_TXN (->
>>> NOT_TXN). Just in/out isn't enough for, e.g.
>>> begin-commit-add_quad->end , begin-commit-commit.  I managed to add
>>> after commit.
>> 
> 


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by "A. Soroka" <aj...@virginia.edu>.
> On Sep 25, 2015, at 11:43 AM, Andy Seaborne <an...@apache.org> wrote:
> 
> It'll depend on the status of "end" -- I have considered the transaction to run from begin() to end() and end() in a W transaction without commit() or abort() is an error.

Ah, I failed to understand this correctly. I thought end() in a WRITE transaction without commit() or abort() amounted to abort(). Your semantics are clearer. My current semantics leaves a disconcerting overlap between abort() and end(). But it does leave me confused about the exact intent of end() in a WRITE transaction, because Transactional itself is documented: “The write lifcycle is: begin(WRITE) ... abort() or commit() [end()is optional]” and if end() is optional, what exactly does it mean in that context?

> c.f. R transaction of begin(R)-end() scoping.
> The idiom is then (and can be made nicer in Java8):
> begin(W)
> try {
> ... reads and writes ...
> } finally { end() ; }
> 
> or variations with an auto-abort on exceptions.  The try{} is a consistency scope.  That does not work with the auto-autocommit.

But that is more or less what I am doing with the auto-transact (and mixed with some Java 8 juice), just wrapped in a test to see whether the thread is already inside a transaction first. I do think that mixing explicit and implicit transactions could be very confusing.

> In your case begin(W)-end() is a silent abort?

Yes, and I now appreciate the problematic nature of that a little better.

> BTW what happens if you get end-end?  index().end(); is called repeatedly.  Does that matter? (haven't had time to look at that)

Not on my implementation, although because I let QuadTable inherit Transactional, I did not write that into the definition of QuadTable::end. It’s now clear to me that I didn’t understand Transactional::end well enough to narrow its semantic correctly anyway.

> So in a multithreaded environment, the autocommit despite being inside the begin-end is not consistent with the begin-commit part.  Needs thinking about.

In the 624 branch there is no auto-transact _inside_ a begin-end, or at least, if there is, that’s a bug for me to fix, and I think that the tests in TestDatasetGraphInMemoryThreading show that there is not. Once you have used begin(), you do not get auto-transact in play until you are back out of that transaction, one way or another. Then it reengages. Unless I misunderstand you?

> A general point : we need to decide the fine grain details of the transaction models (or variations) we want to provide. Until now it's been TDB and DatasetGraphWithLock so not a wide range of choices pushing the design boundaries and stuff is implicit.  Great thing about 624 is sorting that out.

That makes me feel a little better about not hitting the contracts quite right. {grin}

> Current TDB works non-transactionally until the first begin then explicit transaction only.  That's to maximise compatibility. autocommit and disk is dire (disk I/O per quad add is going to easily become painful - just look at many database lists with "why does my simple app go so slow" questions)l in -memory is different.

Right, that’s why I felt pretty safe with the auto-transact thing. A new transaction on the persistent map implementation is just setting a couple of flags and taking off a couple of ThreadLocal references, very cheap. I would think that it would be quite cheap for almost any in-memory implementation.

> TDB non-transactional isn't even an implicit transaction.  Current TDB2 is explicit transaction only because (1) it has no concept of update in-place due to immutable data structures and (2) I haven't spent any time on it.
> I like the Java8 style Txn.executeWrite style (maybe shorted names - details).

I want to be sure that I’m using the words “explicit” and “implicit” the same way you are. I mean by using them to distinguish between the client controlling transaction boundaries (explicit) and Jena controlling them (implicit). Is that what you mean?

> In TDB2, transaction are promotable (read to write) though it is allowed only if no W transaction has run (strictly, started) and started it's own version of the DB.  The API is more general that TDB2 supports. begin() and auto-promote is possible but not IIRC done. promote is the only time system aborts can happen so it creates interesting app issues.

Hm. This makes me realize that in the 624 branch, I have not specified promotion abilities or lack thereof nearly well enough. I think that 624 the scheme you describe for TDB2 makes a lot of sense. I will begin work on implementing it, unless there is some reason not to that I am missing.

> My inclination is to suggest that we encourage explicit transactions as the recommended style.  (Rhetorically) Is it to much of app burden? Don't know. Thoughts?

Putting on my Jena user’s hat, I would think it’s doable, but it would put some onus on Jena to a) document and publicize it really well (the way TDB works really confused me as a user at first), and b) make it as easy as possible (using idioms like those in Mantis Txn to which you referred above, maybe adding other ancillary machinery with which more easily to assemble transactions as explicit packages of operations… e.g.  fluent builders or the like). It could represent a barrier to adoption for some people if transactions obtruded as a concern in really simple use cases. 

---
A. Soroka
The University of Virginia Library


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by Andy Seaborne <an...@apache.org>.
On 25/09/15 15:07, A. Soroka wrote:
> Okay, I committed a pair of simple tests in commit() and abort() to
> see to it that you can’t call them outside a transaction and that
> made all of AbstractTestTransaction pass.
>
> This worries me, because you suggested that it would probably require
> a finer-grained lifecycle modeled, and I didn’t do that, so I’m
> wondering what I’m missing. {grin}

It'll depend on the status of "end" -- I have considered the transaction 
to run from begin() to end() and end() in a W transaction without 
commit() or abort() is an error.  c.f. R transaction of begin(R)-end() 
scoping.

The idiom is then (and can be made nicer in Java8):

begin(W)
try {
... reads and writes ...
} finally { end() ; }

or variations with an auto-abort on exceptions.  The try{} is a 
consistency scope.  That does not work with the auto-autocommit.

In your case begin(W)-end() is a silent abort?


BTW what happens if you get end-end?  index().end(); is called 
repeatedly.  Does that matter? (haven't had time to look at that)

> One remark of yours gives me
> pause, though: “I managed to add after commit.” That’s definitely
> intentional, because on your advice [*], I set up
> DatasetGraphInMemory so that a mutating operation outside of a
> transaction auto-wraps itself in a transaction. So calling add after
> commit is actually two legitimate transactions “under-the-hood".
>
 > Did I misinterpret what you advised? Did you mean me to add
 > “auto-transact” in a more configurable way?

So in a multithreaded environment, the autocommit despite being inside 
the begin-end is not consistent with the begin-commit part.  Needs 
thinking about.

A general point : we need to decide the fine grain details of the 
transaction models (or variations) we want to provide. Until now it's 
been TDB and DatasetGraphWithLock so not a wide range of choices pushing 
the design boundaries and stuff is implicit.  Great thing about 624 is 
sorting that out.


Current TDB works non-transactionally until the first begin then 
explicit transaction only.  That's to maximise compatibility. 
autocommit and disk is dire (disk I/O per quad add is going to easily 
become painful - just look at many database lists with "why does my 
simple app go so slow" questions)l in -memory is different.

TDB non-transactional isn't even an implicit transaction.  Current TDB2 
is explicit transaction only because (1) it has no concept of update 
in-place due to immutable data structures and (2) I haven't spent any 
time on it.

I like the Java8 style Txn.executeWrite style (maybe shorted names - 
details).

In TDB2, transaction are promotable (read to write) though it is allowed 
only if no W transaction has run (strictly, started) and started it's 
own version of the DB.  The API is more general that TDB2 supports. 
begin() and auto-promote is possible but not IIRC done.

promote is the only time system aborts can happen so it creates 
interesting app issues.

My inclination is to suggest that we encourage explicit transactions as 
the recommended style.  (Rhetorically) Is it to much of app burden? 
Don't know. Thoughts?

	Andy

> --- A. Soroka The University of Virginia Library
>
> [*] http://permalink.gmane.org/gmane.comp.apache.jena.devel/10529
>
>> On Sep 24, 2015, at 1:05 PM, Andy Seaborne <an...@apache.org>
>> wrote:
>>
>> The rest are testing that e.g. begin-abort-commit is an error.  It
>> looks like the transaction lifecycle is not being tracked.  It
>> needs finer granularity than DatasetGraphInMemory.isInTransaction.
>> May be as simple as NOT_TXN -> ACTIVE_TXN -> FINISHING_TXN (->
>> NOT_TXN). Just in/out isn't enough for, e.g.
>> begin-commit-add_quad->end , begin-commit-commit.  I managed to add
>> after commit.
>


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by "A. Soroka" <aj...@virginia.edu>.
Okay, I committed a pair of simple tests in commit() and abort() to see to it that you can’t call them outside a transaction and that made all of AbstractTestTransaction pass.

This worries me, because you suggested that it would probably require a finer-grained lifecycle modeled, and I didn’t do that, so I’m wondering what I’m missing. {grin} One remark of yours gives me pause, though: “I managed to add after commit.” That’s definitely intentional, because on your advice [*], I set up DatasetGraphInMemory so that a mutating operation outside of a transaction auto-wraps itself in a transaction. So calling add after commit is actually two legitimate transactions “under-the-hood".

Did I misinterpret what you advised? Did you mean me to add “auto-transact” in a more configurable way?

---
A. Soroka
The University of Virginia Library

[*] http://permalink.gmane.org/gmane.comp.apache.jena.devel/10529

> On Sep 24, 2015, at 1:05 PM, Andy Seaborne <an...@apache.org> wrote:
> 
> The rest are testing that e.g. begin-abort-commit is an error.  It looks like the transaction lifecycle is not being tracked.  It needs finer granularity than DatasetGraphInMemory.isInTransaction.  May be as simple as NOT_TXN -> ACTIVE_TXN -> FINISHING_TXN (-> NOT_TXN). Just in/out isn't enough for, e.g. begin-commit-add_quad->end , begin-commit-commit.  I managed to add after commit.


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by "A. Soroka" <aj...@virginia.edu>.
I don’t think so, because I was running those from the beginning. Just to be clear, the suite of tests for DatasetGraphInMemory is in TestDatasetGraphInMemory:

https://github.com/ajs6f/jena/blob/jena-624/jena-arq/src/test/java/org/apache/jena/sparql/core/mem/TestDatasetGraphInMemory.java

and it now includes:

AbstractDatasetGraphTests
TestDatasetGraphWithLock < AbstractTestDataset
TestDatasetGraphViewGraphs < AbstractTestGraphOverDataset
AbstractTestTransaction

and of course tests specific to DatasetGraphInMemory, mostly threading tests. I’ve committed to my branch a simple implementation of prefix management using a ConcurrentHashMap<String, PrefixMapping> and a sanity check test in TestDatasetGraphInMemory that prefixes work.

---
A. Soroka
The University of Virginia Library

> On Sep 25, 2015, at 10:13 AM, Andy Seaborne <an...@apache.org> wrote:
> 
> On 25/09/15 12:53, A. Soroka wrote:
>> Ah, cool— glad you mentioned this. I was about to add a “manages
>> prefixes reasonably" test to AbstractDatasetGraphTests. Instead I’ll
>> leave it in the in-memory-dsg-specific tests, unless you think it
>> work breaking out into its own class for reuse?
> 
> AbstractTestGraphOverDataset may have some.
> 
> 	Andy
> 
>> 
>> --- A. Soroka The University of Virginia Library
>> 
>>> On Sep 25, 2015, at 7:49 AM, Andy Seaborne <an...@apache.org>
>>> wrote:
>>> 
>>>> 3) It didn’t even occur to me to manage prefixes (prefixen?). I
>>>> didn’t realize that was a Dataset responsibility, although that
>>>> certainly makes sense. Well, that’s why we don’t make
>>>> assumptions! {grin} I’ll fix that based on DatasetPrefixStorage.
>>> 
>>> It's not an absolute requirement but it is nice to do it.
>> 
> 


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by Andy Seaborne <an...@apache.org>.
On 25/09/15 12:53, A. Soroka wrote:
> Ah, cool— glad you mentioned this. I was about to add a “manages
> prefixes reasonably" test to AbstractDatasetGraphTests. Instead I’ll
> leave it in the in-memory-dsg-specific tests, unless you think it
> work breaking out into its own class for reuse?

AbstractTestGraphOverDataset may have some.

	Andy

>
> --- A. Soroka The University of Virginia Library
>
>> On Sep 25, 2015, at 7:49 AM, Andy Seaborne <an...@apache.org>
>> wrote:
>>
>>> 3) It didn’t even occur to me to manage prefixes (prefixen?). I
>>> didn’t realize that was a Dataset responsibility, although that
>>> certainly makes sense. Well, that’s why we don’t make
>>> assumptions! {grin} I’ll fix that based on DatasetPrefixStorage.
>>
>> It's not an absolute requirement but it is nice to do it.
>


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by "A. Soroka" <aj...@virginia.edu>.
Ah, cool— glad you mentioned this. I was about to add a “manages prefixes reasonably" test to AbstractDatasetGraphTests. Instead I’ll leave it in the in-memory-dsg-specific tests, unless you think it work breaking out into its own class for reuse?

---
A. Soroka
The University of Virginia Library

> On Sep 25, 2015, at 7:49 AM, Andy Seaborne <an...@apache.org> wrote:
> 
>> 3) It didn’t even occur to me to manage prefixes (prefixen?). I didn’t realize that was a Dataset responsibility, although that certainly makes sense. Well, that’s why we don’t make assumptions! {grin} I’ll fix that based on DatasetPrefixStorage.
> 
> It's not an absolute requirement but it is nice to do it.


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by Andy Seaborne <an...@apache.org>.
On 24/09/15 18:46, A. Soroka wrote:
> Thanks for the feedback, Andy! In order of your points:
>
> 1) Okay, I will get on those transaction-related problems pronto. None of them seem like a big deal to fix, and you’ve already given me the main point (finer lifecycle granularity) and pointers to the Mantis material.

Mantis does not have internal documentation so feel free to ask questions.

The dboe-* modules are a rewrite of TDB for MVCC and putting module 
structure in.

And it would solve the large-delete problems that come up on users@. 
There is a companion "large add problem" where at the moment people have 
to resort to the bulk loader.

In TDB2, doing 100m loads into a live Fuseki database is no big deal and 
it's a write-once update, not the logged two-write of TDB1.

It's usable at the moment if you don't mind the disk space growing.

What is missing is an async compaction process to reduce a live database 
without needing to go off line.

> 2) Yeah, I can see that. I’ll switch it to QuadTable for now, pending further suggestions. You know, I think I actually called that “QuadIndex” in an earlier version!
>
> 3) It didn’t even occur to me to manage prefixes (prefixen?). I didn’t realize that was a Dataset responsibility, although that certainly makes sense. Well, that’s why we don’t make assumptions! {grin} I’ll fix that based on DatasetPrefixStorage.

It's not an absolute requirement but it is nice to do it.

>
> 4) That sounds great. First step: I’ll get these problems squared away and hopefully other people will glance over and find other things to fix.
>
> Thanks again!

	Andy

>
> ---
> A. Soroka
> The University of Virginia Library
>
>> On Sep 24, 2015, at 1:05 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>> On 23/09/15 17:44, A. Soroka wrote:
>>> Following up on this conversation, I have now have a branch available
>>> at:
>>>
>>> https://github.com/ajs6f/jena/tree/jena-624
>>>
>>> with a six-way-map-based version of this, advancing from (but not
>>> directly using) the journaling branch already discussed. (Of course I
>>> can separate these if so desired.)
>>>
>>> There is an Index type which hopefully is a start towards abstracting
>>> out index behavior, as Paul Houle suggested. There have already been
>>> interesting suggestions made by Andy and Claude about possible
>>> implementations that are more sophisticated than my simple approach,
>>> so I just hope that this branch will get the ball rolling.
>>>
>>> Comments/advice/criticism eagerly desired!
>>>
>>> --- A. Soroka The University of Virginia Library
>>
>> Hi - I pulled the code down and had a partial look.
>>
>> And it looks very good.
>>
>> (As you probably know by now, having me (test/demo/use/) anywhere near new code is very risky but I broke very little ... including TDB.)
>>
>> 0/ Basic read/write of files into the dataset worked as expected. :-)
>>
>> 1/
>> I tried with existing test harnesses:
>> (see below for code)
>>
>> AbstractDatasetGraphTests
>>   Green line!
>>
>> AbstractTestTransaction
>>   Red line.
>>
>> This has tests for various error conditions.
>>
>> The begin-begin cases: your code throws JenaException.  There is a sub-exception JenaTransactionException and that what the tests look for.
>>
>> The rest are testing that e.g. begin-abort-commit is an error.  It looks like the transaction lifecycle is not being tracked.  It needs finer granularity than DatasetGraphInMemory.isInTransaction.  May be as simple as NOT_TXN -> ACTIVE_TXN -> FINISHING_TXN (-> NOT_TXN). Just in/out isn't enough for, e.g. begin-commit-add_quad->end , begin-commit-commit.  I managed to add after commit.
>>
>> There are some presumable related things here: A writer that .end() without .commit or .abort doesn't indicate an error. It probably should. And be added to the AbstractTestTransaction tests.
>>
>> And plain commit (no begin) isn't caught as wrong.
>>
>>   Dataset ds = DatasetFactory.create(new DatasetGraphInMemory()) ;
>>   ds.commit() ;
>>
>> Ditto .abort.
>>
>> Stray .end()s are probably reasonable - "when in doubt call end()" - so multiple ends() on a transaction, which in effect is end on a non-transaction, is good to allow.
>>
>> (I'll add some tests to AbstractTestTransaction for these cases)
>>
>> 2/
>> I found the use of the name "Index" a odd.  Usually an index (in database speak) is a specific lookup pattern.  S->P->O->G and needing a prefix of that for partials.
>>
>> Hence, in SQL and NoSQL, having multiple indexes per table, one primary, and 0-N secondary.  The use in your code is more like the whole "table" (in individual components are all covering solike all RDF subsystems no need to have a distinguished "table"
>>
>> org.apache.jena.tdb.index.Index
>> org.apache.jena.tdb.index.RangeIndex
>>
>> Is this really something like "QuadTable"? "QuadStorage"?
>>
>> I am encountering a similar split between the storage and provision of the interface in TDB2.  There, I want to be able to swap the storage on the fly to give parallel storage a compaction option to a running database.  Being on-disk, there isn't a GC to manage the multi-version datastructures.
>>
>> 3/
>> No prefixes on the dataset?  I only got them to work with getDefaultModel etc.
>>
>> TDB uses DatasetPrefixStorage for managing prefixes and then GraphTDB
>> Some of DatasetPrefixesTDB could be extracted for common use.
>>
>> leading to...
>> 4/
>> We should look for commonality between TDB and InMem and pull out a separate framework.  That's long term - getting something working and released should not have "architectural internal reorg" on the critical path.
>>
>>     Andy
>>
>>
>> public class TestDatasetGraphInMem_AFS
>>              extends AbstractDatasetGraphTests {
>>     @Override
>>     protected DatasetGraph emptyDataset() {
>>         return new DatasetGraphInMemory() ;
>>     }
>> }
>>
>> public class TestDatasetGraphTxn_AFS
>>              extends AbstractTestTransaction {
>>     @Override
>>     protected Dataset create() {
>>         return DatasetFactory.create(new DatasetGraphInMemory()) ;
>>     }
>> }
>>
>> ----------------------
>>
>> And if all that begin-commit stuff is boring code ... WIP ...
>>
>> https://github.com/afs/mantis/blob/master/dboe-transaction/src/main/java/org/seaborne/dboe/transaction/Txn.java
>>
>>     Txn.executeRead(dsg, ()->{
>>     ... query ...
>>         }) ;
>>
>>
>> This includes ThreadTxn - starting a transaction on another thread and executing the body at some later date.  Great for isolation testing.
>>
>> This "Transactional" is a different interface to Jena's (slight change - backwards compatible) but the code should work for Jena.
>


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by "A. Soroka" <aj...@virginia.edu>.
Thanks for the feedback, Andy! In order of your points:

1) Okay, I will get on those transaction-related problems pronto. None of them seem like a big deal to fix, and you’ve already given me the main point (finer lifecycle granularity) and pointers to the Mantis material.

2) Yeah, I can see that. I’ll switch it to QuadTable for now, pending further suggestions. You know, I think I actually called that “QuadIndex” in an earlier version!

3) It didn’t even occur to me to manage prefixes (prefixen?). I didn’t realize that was a Dataset responsibility, although that certainly makes sense. Well, that’s why we don’t make assumptions! {grin} I’ll fix that based on DatasetPrefixStorage.

4) That sounds great. First step: I’ll get these problems squared away and hopefully other people will glance over and find other things to fix.

Thanks again!

---
A. Soroka
The University of Virginia Library

> On Sep 24, 2015, at 1:05 PM, Andy Seaborne <an...@apache.org> wrote:
> 
> On 23/09/15 17:44, A. Soroka wrote:
>> Following up on this conversation, I have now have a branch available
>> at:
>> 
>> https://github.com/ajs6f/jena/tree/jena-624
>> 
>> with a six-way-map-based version of this, advancing from (but not
>> directly using) the journaling branch already discussed. (Of course I
>> can separate these if so desired.)
>> 
>> There is an Index type which hopefully is a start towards abstracting
>> out index behavior, as Paul Houle suggested. There have already been
>> interesting suggestions made by Andy and Claude about possible
>> implementations that are more sophisticated than my simple approach,
>> so I just hope that this branch will get the ball rolling.
>> 
>> Comments/advice/criticism eagerly desired!
>> 
>> --- A. Soroka The University of Virginia Library
> 
> Hi - I pulled the code down and had a partial look.
> 
> And it looks very good.
> 
> (As you probably know by now, having me (test/demo/use/) anywhere near new code is very risky but I broke very little ... including TDB.)
> 
> 0/ Basic read/write of files into the dataset worked as expected. :-)
> 
> 1/
> I tried with existing test harnesses:
> (see below for code)
> 
> AbstractDatasetGraphTests
>  Green line!
> 
> AbstractTestTransaction
>  Red line.
> 
> This has tests for various error conditions.
> 
> The begin-begin cases: your code throws JenaException.  There is a sub-exception JenaTransactionException and that what the tests look for.
> 
> The rest are testing that e.g. begin-abort-commit is an error.  It looks like the transaction lifecycle is not being tracked.  It needs finer granularity than DatasetGraphInMemory.isInTransaction.  May be as simple as NOT_TXN -> ACTIVE_TXN -> FINISHING_TXN (-> NOT_TXN). Just in/out isn't enough for, e.g. begin-commit-add_quad->end , begin-commit-commit.  I managed to add after commit.
> 
> There are some presumable related things here: A writer that .end() without .commit or .abort doesn't indicate an error. It probably should. And be added to the AbstractTestTransaction tests.
> 
> And plain commit (no begin) isn't caught as wrong.
> 
>  Dataset ds = DatasetFactory.create(new DatasetGraphInMemory()) ;
>  ds.commit() ;
> 
> Ditto .abort.
> 
> Stray .end()s are probably reasonable - "when in doubt call end()" - so multiple ends() on a transaction, which in effect is end on a non-transaction, is good to allow.
> 
> (I'll add some tests to AbstractTestTransaction for these cases)
> 
> 2/
> I found the use of the name "Index" a odd.  Usually an index (in database speak) is a specific lookup pattern.  S->P->O->G and needing a prefix of that for partials.
> 
> Hence, in SQL and NoSQL, having multiple indexes per table, one primary, and 0-N secondary.  The use in your code is more like the whole "table" (in individual components are all covering solike all RDF subsystems no need to have a distinguished "table"
> 
> org.apache.jena.tdb.index.Index
> org.apache.jena.tdb.index.RangeIndex
> 
> Is this really something like "QuadTable"? "QuadStorage"?
> 
> I am encountering a similar split between the storage and provision of the interface in TDB2.  There, I want to be able to swap the storage on the fly to give parallel storage a compaction option to a running database.  Being on-disk, there isn't a GC to manage the multi-version datastructures.
> 
> 3/
> No prefixes on the dataset?  I only got them to work with getDefaultModel etc.
> 
> TDB uses DatasetPrefixStorage for managing prefixes and then GraphTDB
> Some of DatasetPrefixesTDB could be extracted for common use.
> 
> leading to...
> 4/
> We should look for commonality between TDB and InMem and pull out a separate framework.  That's long term - getting something working and released should not have "architectural internal reorg" on the critical path.
> 
>    Andy
> 
> 
> public class TestDatasetGraphInMem_AFS
>             extends AbstractDatasetGraphTests {
>    @Override
>    protected DatasetGraph emptyDataset() {
>        return new DatasetGraphInMemory() ;
>    }
> }
> 
> public class TestDatasetGraphTxn_AFS
>             extends AbstractTestTransaction {
>    @Override
>    protected Dataset create() {
>        return DatasetFactory.create(new DatasetGraphInMemory()) ;
>    }
> }
> 
> ----------------------
> 
> And if all that begin-commit stuff is boring code ... WIP ...
> 
> https://github.com/afs/mantis/blob/master/dboe-transaction/src/main/java/org/seaborne/dboe/transaction/Txn.java
> 
>    Txn.executeRead(dsg, ()->{
>    ... query ...
>        }) ;
> 
> 
> This includes ThreadTxn - starting a transaction on another thread and executing the body at some later date.  Great for isolation testing.
> 
> This "Transactional" is a different interface to Jena's (slight change - backwards compatible) but the code should work for Jena.


Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by Andy Seaborne <an...@apache.org>.
On 23/09/15 17:44, A. Soroka wrote:
> Following up on this conversation, I have now have a branch available
> at:
>
> https://github.com/ajs6f/jena/tree/jena-624
>
> with a six-way-map-based version of this, advancing from (but not
> directly using) the journaling branch already discussed. (Of course I
> can separate these if so desired.)
>
> There is an Index type which hopefully is a start towards abstracting
> out index behavior, as Paul Houle suggested. There have already been
> interesting suggestions made by Andy and Claude about possible
> implementations that are more sophisticated than my simple approach,
> so I just hope that this branch will get the ball rolling.
>
> Comments/advice/criticism eagerly desired!
>
> --- A. Soroka The University of Virginia Library

Hi - I pulled the code down and had a partial look.

And it looks very good.

(As you probably know by now, having me (test/demo/use/) anywhere near 
new code is very risky but I broke very little ... including TDB.)

0/ Basic read/write of files into the dataset worked as expected. :-)

1/
I tried with existing test harnesses:
(see below for code)

AbstractDatasetGraphTests
   Green line!

AbstractTestTransaction
   Red line.

This has tests for various error conditions.

The begin-begin cases: your code throws JenaException.  There is a 
sub-exception JenaTransactionException and that what the tests look for.

The rest are testing that e.g. begin-abort-commit is an error.  It looks 
like the transaction lifecycle is not being tracked.  It needs finer 
granularity than DatasetGraphInMemory.isInTransaction.  May be as simple 
as NOT_TXN -> ACTIVE_TXN -> FINISHING_TXN (-> NOT_TXN). Just in/out 
isn't enough for, e.g. begin-commit-add_quad->end , begin-commit-commit. 
  I managed to add after commit.

There are some presumable related things here: A writer that .end() 
without .commit or .abort doesn't indicate an error. It probably should. 
And be added to the AbstractTestTransaction tests.

And plain commit (no begin) isn't caught as wrong.

   Dataset ds = DatasetFactory.create(new DatasetGraphInMemory()) ;
   ds.commit() ;

Ditto .abort.

Stray .end()s are probably reasonable - "when in doubt call end()" - so 
multiple ends() on a transaction, which in effect is end on a 
non-transaction, is good to allow.

(I'll add some tests to AbstractTestTransaction for these cases)

2/
I found the use of the name "Index" a odd.  Usually an index (in 
database speak) is a specific lookup pattern.  S->P->O->G and needing a 
prefix of that for partials.

Hence, in SQL and NoSQL, having multiple indexes per table, one primary, 
and 0-N secondary.  The use in your code is more like the whole "table" 
(in individual components are all covering solike all RDF subsystems no 
need to have a distinguished "table"

org.apache.jena.tdb.index.Index
org.apache.jena.tdb.index.RangeIndex

Is this really something like "QuadTable"? "QuadStorage"?

I am encountering a similar split between the storage and provision of 
the interface in TDB2.  There, I want to be able to swap the storage on 
the fly to give parallel storage a compaction option to a running 
database.  Being on-disk, there isn't a GC to manage the multi-version 
datastructures.

3/
No prefixes on the dataset?  I only got them to work with 
getDefaultModel etc.

TDB uses DatasetPrefixStorage for managing prefixes and then GraphTDB
Some of DatasetPrefixesTDB could be extracted for common use.

leading to...
4/
We should look for commonality between TDB and InMem and pull out a 
separate framework.  That's long term - getting something working and 
released should not have "architectural internal reorg" on the critical 
path.

     Andy


public class TestDatasetGraphInMem_AFS
              extends AbstractDatasetGraphTests {
     @Override
     protected DatasetGraph emptyDataset() {
         return new DatasetGraphInMemory() ;
     }
}

public class TestDatasetGraphTxn_AFS
              extends AbstractTestTransaction {
     @Override
     protected Dataset create() {
         return DatasetFactory.create(new DatasetGraphInMemory()) ;
     }
}

----------------------

And if all that begin-commit stuff is boring code ... WIP ...

https://github.com/afs/mantis/blob/master/dboe-transaction/src/main/java/org/seaborne/dboe/transaction/Txn.java

     Txn.executeRead(dsg, ()->{
     ... query ...
         }) ;


This includes ThreadTxn - starting a transaction on another thread and 
executing the body at some later date.  Great for isolation testing.

This "Transactional" is a different interface to Jena's (slight change - 
backwards compatible) but the code should work for Jena.

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Posted by "A. Soroka" <aj...@virginia.edu>.
Following up on this conversation, I have now have a branch available at:

https://github.com/ajs6f/jena/tree/jena-624

with a six-way-map-based version of this, advancing from (but not directly using) the journaling branch already discussed. (Of course I can separate these if so desired.)

There is an Index type which hopefully is a start towards abstracting out index behavior, as Paul Houle suggested. There have already been interesting suggestions made by Andy and Claude about possible implementations that are more sophisticated than my simple approach, so I just hope that this branch will get the ball rolling.

Comments/advice/criticism eagerly desired!

---
A. Soroka
The University of Virginia Library

> On Sep 3, 2015, at 4:15 PM, ajs6f@virginia.edu <aj...@email.virginia.edu> wrote:
> 
>> Probably a lot of codesign is possible if the index type is fixed leading to simpler and better performant code,  but it would be nice to have a pluggable approach so that you can pick the structures that work for your workload. 
> 
> This is certainly very true. I am trying to develop some good abstractions to this end.
> 
> So far the only sticking point is DatasetGraph::listGraphNodes. For the moment, I have added a listGraphNodes method to the index abstraction (alongside a find method that returns Iterator<Quad> to support DatasetGraph::find and ::findNG), because a map-based approach is just O(1) into iteration and I do want to be able to take advantage of that performance. But I will continue to look for a better approach.
> 
> ---
> A. Soroka
> The University of Virginia Library
> 
> On Aug 31, 2015, at 11:20 AM, Paul Houle <on...@gmail.com> wrote:
> 
>> The thing about bloom filters is that they are (i) highly dense because
>> they are bit packed and (ii) accessed in sequential order so that modern
>> processors can fetch ahead and hide RAM latency.
>> 
>> Most kinds of indexes,  on the other hand,  involve pointer chasing over an
>> area that does not fit in the cache,  so your brand new processor is
>> performing like a Pentium II.  Even if the O() part is better for some
>> other algorithm,  the constant in front is awesome for the bloom filter.
>> 
>> Probably a lot of codesign is possible if the index type is fixed leading
>> to simpler and better performant code,  but it would be nice to have a
>> pluggable approach so that you can pick the structures that work for your
>> workload.  Also from a research perspective we could get real questions
>> 
>> There are a lot of patents in the 2000 or so time period on systems that
>> profile the queries and automatically build indexes.  For instance,
>> Salesforce.com has a patent,  and they also have an implementation where
>> the "master copy" of data is stored in an Oracle table that looks a lot
>> like a triple store and then it automatically creates materialized views
>> and functional indexes in Oracle to support client workloads.
>> 
>> (I remember writing a query against one of the larger Salesforce orgs and
>> being frustrated when it timed out,  only to run the same query two minutes
>> later and have it work perfectly.)
>> 
>> Patents do expire and it won't be too long that this technology could make
>> it to open source tools.
>> 
>> On Mon, Aug 31, 2015 at 10:04 AM, Claude Warren <cl...@xenei.com> wrote:
>> 
>>> step 3 is relatively expensive if there are a log of quads with the G and S
>>> values (e.g. *,G,*,S) but few with G,S,*,*
>>> 
>>> Step 3 is about removing the false positives from the bloom filter.  It
>>> does not require an index, it requires checking the values to ensure match.
>>> 
>>> While there is a time to space trade off here.  The time is not as much as
>>> one would think.  Keep in mind that a bloomfilter check is very fast.  On
>>> the other hand the traversal of a b-tree or other index is much more
>>> computationally complex.
>>> 
>>> For example, I know that if you can devise a mechanism by which you can
>>> index bloom filters such  that you have a 2 segment key and you know that
>>> you want to check every key where the compound keys are greater than the
>>> one you are looking for it is often faster to do the linear search through
>>> the data even when dealing with 1 million entries.  There are special cases
>>> where the index is faster but for the most part the linear search is
>>> faster.
>>> 
>>> Claude
>>> 
>>> On Mon, Aug 31, 2015 at 1:53 PM, ajs6f@virginia.edu <aj...@virginia.edu>
>>> wrote:
>>> 
>>>> I'm still a bit confused as to why you don't regard step 3 as being
>>>> potentially very expensive. In order to verify a match, we will have to
>>>> examine an "exact" index, and that (as Andy remarked) is likely to
>>> require
>>>> traversal, or else we throw away all the space gains.
>>>> 
>>>> Is this technique a way to pay a lot of time for a lot of space savings?
>>>> Perhaps it is appropriate for an alternative implementation for very
>>> large
>>>> datasets?
>>>> 
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>> On Aug 31, 2015, at 6:48 AM, Claude Warren <cl...@xenei.com> wrote:
>>>> 
>>>>> to find find(G,S,*,*) with bloom filters and return an iterator you
>>>>> 
>>>>> 1. construct a bloom filter with G and S
>>>>> 2. scan the list of quads checking for matches.
>>>>> 3. for each result that matches verify that it has G and S (I have
>>>> done this with an extended iterator in Jena)
>>>>> 
>>>>> result is an iterator that returns all (G,S,*,*) quads.
>>>>> 
>>>>> similar tests can be performed for any pattern -- same code used.
>>>>> 
>>>>> Step 2 is the expensive one.  But the bloom filter check is so
>>> efficient
>>>>> that it becomes very difficult to perform search operations in less
>>> time
>>>>> than it takes to scan the list.
>>>>> 
>>>>> Claude
>>>>> 
>>>>> On Mon, Aug 31, 2015 at 11:01 AM, Andy Seaborne <an...@apache.org>
>>> wrote:
>>>>> 
>>>>>> On 29/08/15 14:55, Claude Warren wrote:
>>>>>> 
>>>>>>> Something I have been thinking about....
>>>>>>> 
>>>>>>> you could replace  GSPO, GOPS, SPOG, OSGP, PGSO, OPSG. with a single
>>>>>>> bloomfilter implementation.  It means a 2 step process to find
>>> matches
>>>> but
>>>>>>> it might be fast enough and reduce the overhead significantly.
>>>>>>> 
>>>>>>> I did an in-memory and a relational DB based version recently, but it
>>>> was
>>>>>>> just a quick POC.
>>>>>>> 
>>>>>>> Claude
>>>>>>> 
>>>>>> 
>>>>>> So we're talking about in-memory, where the items are java classes.  A
>>>>>> quad is 2 slots java overhead + 4 slots for G, S, P, O pointers.
>>>> That's 48
>>>>>> bytes if the heap is >32G and 24 bytes otherwise (compressed pointers
>>>> or 32
>>>>>> bit).
>>>>>> 
>>>>>> For storage, the key test is "contains" to maintain the "set of"
>>>>>> semantics.  Something to stop index traversal for each insert would be
>>>>>> great but it's still stored and 1 not up to 6 would be good. (Note
>>> that
>>>>>> most data is unique quads.)
>>>>>> 
>>>>>> The import retrieval operation is find(G,S,P,O) where any of those can
>>>> be
>>>>>> a wildcard and return (ideally as a stream) all matching quads with a
>>>>>> prefix.  The multiple indexes exist to find based on prefix.
>>>>>> 
>>>>>> How would that work for, say find(G,S,*,*) with bloom filters and 1b
>>>>>> quads?  How does the code go from returning G,S,P1,O1 to the next
>>>> G,S,P1,O2
>>>>>> without trying every value for the O slot?
>>>>>> 
>>>>>> For a hash map based hierarchical index G->S->P->O, it's O(1) to find
>>>> the
>>>>>> start of the scan then datastructure iteration.  A hash-based is not
>>>>>> necessarily the best choice [*] but it's a baseline to discuss.
>>>>>> 
>>>>>> And in memory, will a bloom filter-based system be faster?  Because of
>>>>>> false-positives, isn't a definitively index still needed?  If one is
>>>> kept,
>>>>>> not 6, there could be great space gains but every quad returned is a
>>>>>> top-to-bottom traversal of that index (which is now a not a range
>>>> index).
>>>>>> 
>>>>>> The design should work for 1+ billion in-memory quads - that's the way
>>>> the
>>>>>> world is going.
>>>>>> 
>>>>>> So each quad is reduced to a
>>>>>>> single bloom filter comprising 4 items (15-bytes).
>>>>>>> 
>>>>>> 
>>>>>>      Andy
>>>>>> 
>>>>>> [*] even in memory, it might be worth allocating internal ids and
>>>> working
>>>>>> in longs like a disk based system because it is more compact - naive
>>>>>> hashmaps take a lot of space to when storing small items like quads.
>>>>>> tradeoffs, tradeoffs, ...
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Wed, Aug 26, 2015 at 3:27 PM, A. Soroka <aj...@virginia.edu>
>>> wrote:
>>>>>>> 
>>>>>>> Hey, folks--
>>>>>>>> 
>>>>>>>> There hasn't been too much feedback on my proposal for a journaling
>>>>>>>> DatasetGraph:
>>>>>>>> 
>>>>>>>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
>>>>>>>> 
>>>>>>>> which was and is to be a step towards JENA-624: Develop a new
>>>> in-memory
>>>>>>>> RDF Dataset implementation. So I'm moving on to look at the real
>>>> problem:
>>>>>>>> an in-memory  DatasetGraph with high concurrency, for use with
>>> modern
>>>>>>>> hardware running many, many threads in large core memory.
>>>>>>>> 
>>>>>>>> I'm beginning to sketch out rough code, and I'd like to run some
>>>> design
>>>>>>>> decisions past the list to get criticism/advice/horrified
>>>>>>>> warnings/whatever
>>>>>>>> needs to be said.
>>>>>>>> 
>>>>>>>> 1) All-transactional action: i.e. no non-transactional operation.
>>>> This is
>>>>>>>> obviously a great thing for simplifying my work, but I hope it won't
>>>> be
>>>>>>>> out
>>>>>>>> of line with the expected uses for this stuff.
>>>>>>>> 
>>>>>>>> 2) 6 covering indexes in the forms GSPO, GOPS, SPOG, OSGP, PGSO,
>>>> OPSG. I
>>>>>>>> figure to play to the strength of in-core-memory operation: raw
>>> speed,
>>>>>>>> but
>>>>>>>> obviously this is going to cost space.
>>>>>>>> 
>>>>>>>> 3) At least for now, all commits succeed.
>>>>>>>> 
>>>>>>>> 4) The use of persistent datastructures to avoid complex and
>>>> error-prone
>>>>>>>> fine-grained locking regimes. I'm using http://pcollections.org/
>>> for
>>>>>>>> now,
>>>>>>>> but I am in no way committed to it nor do I claim to have thoroughly
>>>>>>>> vetted
>>>>>>>> it. It's simple but enough to get started, and that's all I need to
>>>> bring
>>>>>>>> the real design questions into focus.
>>>>>>>> 
>>>>>>>> 5) Snapshot isolation. Transactions do not see commits that occur
>>>> during
>>>>>>>> their lifetime. Each works entirely from the state of the
>>>> DatasetGraph at
>>>>>>>> the start of its life.
>>>>>>>> 
>>>>>>>> 6) Only as many as one transaction per thread, for now. Transactions
>>>> are
>>>>>>>> not thread-safe. These are simplifying assumptions that could be
>>>> relaxed
>>>>>>>> later.
>>>>>>>> 
>>>>>>>> My current design operates as follows:
>>>>>>>> 
>>>>>>>> At the start of a transaction, a fresh in-transaction reference is
>>>> taken
>>>>>>>> atomically from the AtomicReference that points to the index block.
>>> As
>>>>>>>> operations are performed in the transaction, that in-transaction
>>>>>>>> reference
>>>>>>>> is progressed (in the sense in which any persistent datastructure is
>>>>>>>> progressed) while the operations are recorded. Upon an abort, the
>>>>>>>> in-transaction reference and the record are just thrown away. Upon a
>>>>>>>> commit, the in-transaction reference is thrown away and the
>>> operation
>>>>>>>> record is re-run against the main reference (the one that is copied
>>> at
>>>>>>>> the
>>>>>>>> beginning of a transaction). That rerun happens inside an atomic
>>>> update
>>>>>>>> (hence the use of AtomicReference). This all should avoid the need
>>> for
>>>>>>>> explicit locking in Jena and should confine any blocking against the
>>>>>>>> indexes to the actual duration of a commit.
>>>>>>>> 
>>>>>>>> What do you guys think?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ---
>>>>>>>> A. Soroka
>>>>>>>> The University of Virginia Library
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> I like: Like Like - The likeliest place on the web
>>>>> <http://like-like.xenei.com>
>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>> 
>> 
>> 
>> 
>> -- 
>> Paul Houle
>> 
>> *Applying Schemas for Natural Language Processing, Distributed Systems,
>> Classification and Text Mining and Data Lakes*
>> 
>> (607) 539 6254    paul.houle on Skype   ontology2@gmail.com
>> 
>> :BaseKB -- Query Freebase Data With SPARQL
>> http://basekb.com/gold/
>> 
>> Legal Entity Identifier Lookup
>> https://legalentityidentifier.info/lei/lookup/
>> <http://legalentityidentifier.info/lei/lookup/>
>> 
>> Join our Data Lakes group on LinkedIn
>> https://www.linkedin.com/grp/home?gid=8267275
>