You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Arne Bernhardt <ar...@gmail.com> on 2023/06/12 20:36:15 UTC

Re: Why DatasetGraphInMemory?

Hi Andy

you mentioned RoaringBitmaps. I took the time to experiment with them.
They are really amazing. The performance of #add, #remove and #contains is
comparable to Java HashSet. RoaringBitmaps are much faster at iterating
over values and they perform bit operations even between two quite large
bitmaps like a charm. RoaringBitmaps also need less memory than a
JavaHasSet. (even less than an optimized integer hash set based on the
concepts in HashCommon)
A first graph implementation was easy to create. (albeit with a little help
from ChatGPT, as I had no idea how to use RoaringBitmaps yet).
One only needs an indexed set of all triples and three maps indexed by
subject, predicate and object and bitmaps as values.
Each bitmap contains all indices of the triples with the corresponding node.
To find SPO --> use the set with all triples.
To find S__, _P_, or __O --> lookup the bitmap in the corresponding map and
iterate over all indices mapping to triples via the indexed set.
To find SP_, S_O, or _PO --> lookup the two bitmaps for both given nodes,
perform an "and" operation with both bitmaps and again iterate over the
resulting indices mapping to triples via the indexed set.
Especially the query of _PO is incredibly fast compared to GraphMem or
similarly structured graphs.
Just for fun, I replaced the bitmaps with two sets of integers and
simulated the "and" operation by iterating over the smallest set and
checking the entries in the larger set using #contains --> it is 10-100
times slower than the "and" operation of RoaringBitmaps.
Now I really understand the hype around RoaringBitmaps. It seems absolutely
justified to me.
Smaller graphs with RoaringBitmaps need about twice as much memory for the
indexing structures (triples excluded) as GraphMem.
(The additional memory requirement is not only due to the bitmaps, but also
to the additional indexed set of triples).
For larger graphs (> 500k and above), this gap begins to close. At 1M
triples, the variant with roaring bitmaps wins the advantage with 88MB
compared to 106MB with GraphMem.
After loading all the triples from bsbm-25m.nt.gz and two JVM warmup
iterations, it only took about 18 seconds to add them to the new graph, and
this graph only required an additional 1941 MB of memory.

I'm not sure how RoaringBitmaps handles permanent updates. I have tried
many #add and #remove calls on larger graphs and they seem to work well.
But there are two methods that caught my attention:
*
https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#runOptimize()
*
https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#trim()
I have no idea when it would be a good time to use them.
Removing and adding triples from a graph of size x in y iterations and
measuring the impact on memory and performance could be one way to find
potential problems.
Do you have a scenario in mind that I could use to test if I ever need one
of these methods?

   Arne

Andy Seaborne <an...@apache.org> schrieb am Mo., 22. Mai 2023, 16:52:

>
>
> On 20/05/2023 17:18, Arne Bernhardt wrote:
> > Hi Andy,
> > thank you, that was very helpful to get the whole picture.
> >
> > Some time ago, I told you that at my workplace we implemented an
> in-memory
> > SPARQL-Server based on a Delta
> > <
> https://jena.apache.org/documentation/javadoc/jena/org.apache.jena.core/org/apache/jena/graph/compose/Delta.html
> >
> > .
> > We started a few years ago, before RDF-patch
> > <https://jena.apache.org/documentation/rdf-patch/>, based on the
> "difference
> > model"
> > <https://lists.w3.org/Archives/Public/www-rdf-interest/2001Mar/0216.html
> >,
> > that has become part of the CGMES standard.
> > For our server, we strictly follow the CQRS with event-sourcing
> > <https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs>
> > pattern. All transactions are recorded as an event with a list of triples
> > added and a list of triples removed.
> > The events are stored in an RDBMS (Oracle or PostgreSQL). For query
> > execution we need the relevant data to fit into memory but all data and
> > versions are also persisted.
> > To be able to store and load graphs very fast, we use RDF Thrift with LZ4
> > compression and store them in blobs.
> > All queries are executed on projected datasets for the requested version
> > (any previous version) of the data and the requested named graphs.
> > Thanks to the versioning, we fully support MR+SW. We even support
> multiple
> > writers, with a git-like branching and merging approach and optimistic
> > locking.
>
> How does that work for RDF?
>
> Is at the unit of an "entity"
>
> > To prevent the chain of deltas from becoming too long, we have several
> > strategies to write full graphs into snapshots.
> >
> > DatasetGraphInMemory seems to be 5-7 times slower than GraphMem, at least
> > when adding triples. So it might be worth trying to meet your constraints
> > while leaving the poor performance behind.
> > My basic idea is:
> > - having two instances "writer graph" and "reader graph" of GraphMem and
> > switching between them
> >     - Unfortunately that would mean a doubling of memory. (but
> > DatasetGraphInMemory seems to use approx. 3 times the memory of GraphMem)
>
> There may be better persistent datastructure than Dexx, have you looked
> around? Seems a good iodea to adopt an existing library because get
> robustness and concurrency safety is not trivial.
>
> > - implement a dataset that performs the write transactions on the "writer
> > graph" and in deltas
>
> Jena now has BufferingDatasetGraph and BufferingGraph.
>
> These capture changes while leaving the underlying dataset or graph
> untouched until a flush is done.
>
> Only one copy and the diffs.
>
> If you have two copies you can atomically swap to get MR+SW - bufgferign
> flush is by default a write-txn on the underlying store.
>
> Two copies in memory isn't so bad if the strings (lexical forms, URIs)
> are shared.
>
> A parser run does this with a LRU cache - but it isn't guaranteeing
> uniquely intern'ed strings. (It save about 30% on long parser runs)
>
> You could use a different FactoryRDFCaching which shares state across
> parser runs and is fully intern'ing nodes. i.e. a Node pool for the
> whole JVM.
>
> >    - meanwhile any number of readers could read from the previous
> version by
> > reading the "reader graph"
> >    - when the writer transaction finishes
> >       - if there are no readers using the "reader graph", it can be
> swapped
> > with the "writer graph" and the former "reader graph" can be updated
> using
> > the deltas
>
> What happens if a reader turns up while carrying out the updates? Or are
> you deltas usually "small" and it does not matter?
>
> > - when the next writer starts, it would again write into the "writer
> graph"
> >    - meanwhile any number of readers could read from previous version(s)
> by
> > reading "reader graph" [+ deltas]
> > - when the least reader of  the oldest version closes the transaction
> >      - if there are no writers using the "writer graph", it can be
> reversed
> > using deltas in reverse order. Then "reader graph" can be swapped with
> the
> > "writer graph" for the next read transaction. Then the former "reader
> > graph" needs to be updated using all deltas.
> > - only if there is no point in time without overlapping read and write
> > transactions, a short pause may be needed occasionally to clear the
> deltas
> > that are piling up.
>
> Yes :-) And if the system is constantly loaded
>
> (TDB1 does something similar - it eventually decides it really must
> clear the backlog and locks the DB down - better than running out of
> resources).
>
> > - It is not lock-free
>
> But no latches (locks on parts of the data, not locks controlling
> e.g switching graphs)
>
> > - there would be no background tasks or scheduling involved, "only"
> having
> > each graph twice and all #add and #remove operations would have to be
> done
> > 3-4 times.
> >
> > The idea is still a bit blurry in my head. What do you think?
>
> Interesting for your use case.
>
> It's not exposed but TDB2 does actually have the previous versions until
> compaction happens. Each index is immutable after update and delta tree
> gets created (all the way back to the tree root). The tree roots are
> still in the DB until it is cleared up by compaction.
>
> Sounds like you have the style, but applied to the graph, and can use
> the GC for clearing up.
>
> ---
>
> Another is to use bitmap indexes. https://roaringbitmap.org/. (I don't
> what the time/space tradeoff is for RDF usage.)
>
>      Andy
>
> >
> >       Arne
> >
> > Am Sa., 20. Mai 2023 um 15:19 Uhr schrieb Andy Seaborne <andy@apache.org
> >:
> >
> >> Hi Arne,
> >>
> >> On 19/05/2023 21:21, Arne Bernhardt wrote:
> >>> Hi,
> >>> in a recent  response
> >>> <https://github.com/apache/jena/issues/1867#issuecomment-1546931793>
> to
> >> an
> >>> issue it was said that   "Fuseki - uses DatasetGraphInMemory mostly"  .
> >>> For my  PR <https://github.com/apache/jena/pull/1865>, I added a JMH
> >>> benchmark suite to the project. So it was easy for me to compare the
> >>> performance of GraphMem with
> >>> "DatasetGraphFactory.createTxnMem().getDefaultGraph()".
> >>> DatasetGraphInMemory is much slower in every discipline tested (#add,
> >>> #delete, #contains, #find, #stream).
> >>> Maybe my approach is too naive?
> >>> I understand very well that the underlying Dexx Collections Framework,
> >> with
> >>> its immutable persistent data structures, makes threading and
> transaction
> >>> handling easy
> >>
> >> DatasetGraphInMemory (TIM = Transactions In Memory) has one big
> advantage.
> >>
> >> It supports multiple-readers and a single-writer (MR+SW) at the same
> >> time - truly concurrent. So does TDB2 (TDB1 is sort of hybrid).
> >>
> >> MR+SW has a cost which is a copy-on-write overhead, a reader-centric
> >> design choice allowing the readers to run latch-free.
> >>
> >> You can't directly use a regular hash map with concurrent updates. (And
> >> no, ConcurrentHashMap does not solve all problems, even for a single
> >> datastructure. A dataset needs to coordinate changes to multiple
> >> datastructure into a single transactional unit.
> >>
> >> GraphMem can not do MR+SW - for all storage datasets/graphs that do not
> >> have built-in for MR+SW, the best that can be done is MRSW -
> >> multiple-readers or a single-writer.
> >>
> >> For MRSW, when a writer starts, the system has to hold up subsequent
> >> readers, let existing ones finish, then let the writer run, then release
> >> any readers held up. (variations possible - whether readers or writers
> >> get priority).
> >>
> >> This is bad in a general concurrent environment. e.g. Fuseki.
> >>
> >> One writer can "accidently" lock-out the dataset.
> >>
> >> Maybe the application isn't doing updates, in which case, a memory
> >> dataset focuses on read throughput is better, especially with better
> >> triple density in memory.
> >>
> >> Maybe the application is single threaded or can control threads itself
> >> (non-Fuseki).
> >>
> >>> and that there are no issues with consuming iterators or
> >>> streams even after a read transaction has closed.
> >>
> >> Continuing to use an iterator after the end of a transaction should not
> >> be allowed.
> >>
> >>> Is it currently supported for consumers to use iterators and streams
> >> after
> >>> a transaction has been closed?
> >>
> >> Consumers that want this must copy the iterator - it's an explicit
> opt-in.
> >>
> >> Does this happen with Dexx? It may do, because Dexx relies on the
> >> garbage collector so some things just happen.
> >>
> >>> If so, I don't currently see an easy way to
> >>> replace DatasetGraphInMemory with a faster implementation. (although
> >>> transaction-aware iterators that copy the remaining elements into lists
> >>> could be an option).
> >>
> >> copy-iterators are going to be expensive in RAM - a denial of service
> >> issue - and speed (lesser issue, possibly).
> >>
> >>> Are there other reasons why DatasetGraphInMemory is the preferred
> dataset
> >>> implementation for Fuseki?
> >>
> >> MR+SW in an environment where there is no other information about
> >> requirements is the safe choice.
> >>
> >> If an app wants to trade the issues of MRSW for better performance, it
> >> is a choice it needs to make. One case for Fuseki is publishing
> >> relatively static data - e.g. reference data, changes from a known, well
> >> behaved, application
> >>
> >> Both a general purpose TIM and a higher density, faster dataset have
> >> their places.
> >>
> >>       Andy
> >>
> >>>
> >>> Cheers,
> >>> Arne
> >>>
> >>
> >
>

Re: Why DatasetGraphInMemory?

Posted by Arne Bernhardt <ar...@gmail.com>.
Hi Andy,

in the meantime, I've been using my bulk update tests (e.g. 1 in 1M graph
updating about 200,000 triples over and over again) to observe if it slows
down and how memory is freed up when I call CG between iterations and
measure memory. RoaringBitmaps seem to be quite GC friendly. There was no
increase in memory usage. It even seems to be more GC friendly than
GraphMem in the same scenario.
So I decided to create https://github.com/apache/jena/issues/1912.

   Arne

Am Sa., 17. Juni 2023 um 17:33 Uhr schrieb Andy Seaborne <an...@apache.org>:

>
>
> On 12/06/2023 21:36, Arne Bernhardt wrote:
> > Hi Andy
> >
> > you mentioned RoaringBitmaps. I took the time to experiment with them.
> > They are really amazing. The performance of #add, #remove and #contains
> is
> > comparable to Java HashSet. RoaringBitmaps are much faster at iterating
> > over values and they perform bit operations even between two quite large
> > bitmaps like a charm. RoaringBitmaps also need less memory than a
> > JavaHasSet. (even less than an optimized integer hash set based on the
> > concepts in HashCommon)
> > A first graph implementation was easy to create. (albeit with a little
> help
> > from ChatGPT, as I had no idea how to use RoaringBitmaps yet).
> > One only needs an indexed set of all triples and three maps indexed by
> > subject, predicate and object and bitmaps as values.
> > Each bitmap contains all indices of the triples with the corresponding
> node.
> > To find SPO --> use the set with all triples.
> > To find S__, _P_, or __O --> lookup the bitmap in the corresponding map
> and
> > iterate over all indices mapping to triples via the indexed set.
> > To find SP_, S_O, or _PO --> lookup the two bitmaps for both given nodes,
> > perform an "and" operation with both bitmaps and again iterate over the
> > resulting indices mapping to triples via the indexed set.
> > Especially the query of _PO is incredibly fast compared to GraphMem or
> > similarly structured graphs.
> > Just for fun, I replaced the bitmaps with two sets of integers and
> > simulated the "and" operation by iterating over the smallest set and
> > checking the entries in the larger set using #contains --> it is 10-100
> > times slower than the "and" operation of RoaringBitmaps.
> > Now I really understand the hype around RoaringBitmaps. It seems
> absolutely
> > justified to me.
> > Smaller graphs with RoaringBitmaps need about twice as much memory for
> the
> > indexing structures (triples excluded) as GraphMem.
> > (The additional memory requirement is not only due to the bitmaps, but
> also
> > to the additional indexed set of triples).
> > For larger graphs (> 500k and above), this gap begins to close. At 1M
> > triples, the variant with roaring bitmaps wins the advantage with 88MB
> > compared to 106MB with GraphMem.
> > After loading all the triples from bsbm-25m.nt.gz and two JVM warmup
> > iterations, it only took about 18 seconds to add them to the new graph,
> and
> > this graph only required an additional 1941 MB of memory.
> >
> > I'm not sure how RoaringBitmaps handles permanent updates. I have tried
> > many #add and #remove calls on larger graphs and they seem to work well.
> > But there are two methods that caught my attention:
> > *
> >
> https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#runOptimize()
> > *
> >
> https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#trim()
> > I have no idea when it would be a good time to use them.
> > Removing and adding triples from a graph of size x in y iterations and
> > measuring the impact on memory and performance could be one way to find
> > potential problems.
> > Do you have a scenario in mind that I could use to test if I ever need
> one
> > of these methods?
>
> Just from reading the javadoc - #runOptimize() might be useful for a
> load-and-readonly graph - do a lot of loading work and switch to the
> more efficient. It depends no how much space it saves. My instinct is
> that the saving for the overall graph may not be that great because the
> RDF terms take up a log of the space at scale so savings on the the
> bitmaps might, overall, not be significant.
>
> >
> >     Arne
> >
> > Andy Seaborne <an...@apache.org> schrieb am Mo., 22. Mai 2023, 16:52:
> >
> >>
> >>
> >> On 20/05/2023 17:18, Arne Bernhardt wrote:
> >>> Hi Andy,
> >>> thank you, that was very helpful to get the whole picture.
> >>>
> >>> Some time ago, I told you that at my workplace we implemented an
> >> in-memory
> >>> SPARQL-Server based on a Delta
> >>> <
> >>
> https://jena.apache.org/documentation/javadoc/jena/org.apache.jena.core/org/apache/jena/graph/compose/Delta.html
> >>>
> >>> .
> >>> We started a few years ago, before RDF-patch
> >>> <https://jena.apache.org/documentation/rdf-patch/>, based on the
> >> "difference
> >>> model"
> >>> <
> https://lists.w3.org/Archives/Public/www-rdf-interest/2001Mar/0216.html
> >>> ,
> >>> that has become part of the CGMES standard.
> >>> For our server, we strictly follow the CQRS with event-sourcing
> >>> <https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs>
> >>> pattern. All transactions are recorded as an event with a list of
> triples
> >>> added and a list of triples removed.
> >>> The events are stored in an RDBMS (Oracle or PostgreSQL). For query
> >>> execution we need the relevant data to fit into memory but all data and
> >>> versions are also persisted.
> >>> To be able to store and load graphs very fast, we use RDF Thrift with
> LZ4
> >>> compression and store them in blobs.
> >>> All queries are executed on projected datasets for the requested
> version
> >>> (any previous version) of the data and the requested named graphs.
> >>> Thanks to the versioning, we fully support MR+SW. We even support
> >> multiple
> >>> writers, with a git-like branching and merging approach and optimistic
> >>> locking.
> >>
> >> How does that work for RDF?
> >>
> >> Is at the unit of an "entity"
> >>
> >>> To prevent the chain of deltas from becoming too long, we have several
> >>> strategies to write full graphs into snapshots.
> >>>
> >>> DatasetGraphInMemory seems to be 5-7 times slower than GraphMem, at
> least
> >>> when adding triples. So it might be worth trying to meet your
> constraints
> >>> while leaving the poor performance behind.
> >>> My basic idea is:
> >>> - having two instances "writer graph" and "reader graph" of GraphMem
> and
> >>> switching between them
> >>>      - Unfortunately that would mean a doubling of memory. (but
> >>> DatasetGraphInMemory seems to use approx. 3 times the memory of
> GraphMem)
> >>
> >> There may be better persistent datastructure than Dexx, have you looked
> >> around? Seems a good iodea to adopt an existing library because get
> >> robustness and concurrency safety is not trivial.
> >>
> >>> - implement a dataset that performs the write transactions on the
> "writer
> >>> graph" and in deltas
> >>
> >> Jena now has BufferingDatasetGraph and BufferingGraph.
> >>
> >> These capture changes while leaving the underlying dataset or graph
> >> untouched until a flush is done.
> >>
> >> Only one copy and the diffs.
> >>
> >> If you have two copies you can atomically swap to get MR+SW - bufgferign
> >> flush is by default a write-txn on the underlying store.
> >>
> >> Two copies in memory isn't so bad if the strings (lexical forms, URIs)
> >> are shared.
> >>
> >> A parser run does this with a LRU cache - but it isn't guaranteeing
> >> uniquely intern'ed strings. (It save about 30% on long parser runs)
> >>
> >> You could use a different FactoryRDFCaching which shares state across
> >> parser runs and is fully intern'ing nodes. i.e. a Node pool for the
> >> whole JVM.
> >>
> >>>     - meanwhile any number of readers could read from the previous
> >> version by
> >>> reading the "reader graph"
> >>>     - when the writer transaction finishes
> >>>        - if there are no readers using the "reader graph", it can be
> >> swapped
> >>> with the "writer graph" and the former "reader graph" can be updated
> >> using
> >>> the deltas
> >>
> >> What happens if a reader turns up while carrying out the updates? Or are
> >> you deltas usually "small" and it does not matter?
> >>
> >>> - when the next writer starts, it would again write into the "writer
> >> graph"
> >>>     - meanwhile any number of readers could read from previous
> version(s)
> >> by
> >>> reading "reader graph" [+ deltas]
> >>> - when the least reader of  the oldest version closes the transaction
> >>>       - if there are no writers using the "writer graph", it can be
> >> reversed
> >>> using deltas in reverse order. Then "reader graph" can be swapped with
> >> the
> >>> "writer graph" for the next read transaction. Then the former "reader
> >>> graph" needs to be updated using all deltas.
> >>> - only if there is no point in time without overlapping read and write
> >>> transactions, a short pause may be needed occasionally to clear the
> >> deltas
> >>> that are piling up.
> >>
> >> Yes :-) And if the system is constantly loaded
> >>
> >> (TDB1 does something similar - it eventually decides it really must
> >> clear the backlog and locks the DB down - better than running out of
> >> resources).
> >>
> >>> - It is not lock-free
> >>
> >> But no latches (locks on parts of the data, not locks controlling
> >> e.g switching graphs)
> >>
> >>> - there would be no background tasks or scheduling involved, "only"
> >> having
> >>> each graph twice and all #add and #remove operations would have to be
> >> done
> >>> 3-4 times.
> >>>
> >>> The idea is still a bit blurry in my head. What do you think?
> >>
> >> Interesting for your use case.
> >>
> >> It's not exposed but TDB2 does actually have the previous versions until
> >> compaction happens. Each index is immutable after update and delta tree
> >> gets created (all the way back to the tree root). The tree roots are
> >> still in the DB until it is cleared up by compaction.
> >>
> >> Sounds like you have the style, but applied to the graph, and can use
> >> the GC for clearing up.
> >>
> >> ---
> >>
> >> Another is to use bitmap indexes. https://roaringbitmap.org/. (I don't
> >> what the time/space tradeoff is for RDF usage.)
> >>
> >>       Andy
> >>
> >>>
> >>>        Arne
> >>>
> >>> Am Sa., 20. Mai 2023 um 15:19 Uhr schrieb Andy Seaborne <
> andy@apache.org
> >>> :
> >>>
> >>>> Hi Arne,
> >>>>
> >>>> On 19/05/2023 21:21, Arne Bernhardt wrote:
> >>>>> Hi,
> >>>>> in a recent  response
> >>>>> <https://github.com/apache/jena/issues/1867#issuecomment-1546931793>
> >> to
> >>>> an
> >>>>> issue it was said that   "Fuseki - uses DatasetGraphInMemory
> mostly"  .
> >>>>> For my  PR <https://github.com/apache/jena/pull/1865>, I added a JMH
> >>>>> benchmark suite to the project. So it was easy for me to compare the
> >>>>> performance of GraphMem with
> >>>>> "DatasetGraphFactory.createTxnMem().getDefaultGraph()".
> >>>>> DatasetGraphInMemory is much slower in every discipline tested (#add,
> >>>>> #delete, #contains, #find, #stream).
> >>>>> Maybe my approach is too naive?
> >>>>> I understand very well that the underlying Dexx Collections
> Framework,
> >>>> with
> >>>>> its immutable persistent data structures, makes threading and
> >> transaction
> >>>>> handling easy
> >>>>
> >>>> DatasetGraphInMemory (TIM = Transactions In Memory) has one big
> >> advantage.
> >>>>
> >>>> It supports multiple-readers and a single-writer (MR+SW) at the same
> >>>> time - truly concurrent. So does TDB2 (TDB1 is sort of hybrid).
> >>>>
> >>>> MR+SW has a cost which is a copy-on-write overhead, a reader-centric
> >>>> design choice allowing the readers to run latch-free.
> >>>>
> >>>> You can't directly use a regular hash map with concurrent updates.
> (And
> >>>> no, ConcurrentHashMap does not solve all problems, even for a single
> >>>> datastructure. A dataset needs to coordinate changes to multiple
> >>>> datastructure into a single transactional unit.
> >>>>
> >>>> GraphMem can not do MR+SW - for all storage datasets/graphs that do
> not
> >>>> have built-in for MR+SW, the best that can be done is MRSW -
> >>>> multiple-readers or a single-writer.
> >>>>
> >>>> For MRSW, when a writer starts, the system has to hold up subsequent
> >>>> readers, let existing ones finish, then let the writer run, then
> release
> >>>> any readers held up. (variations possible - whether readers or writers
> >>>> get priority).
> >>>>
> >>>> This is bad in a general concurrent environment. e.g. Fuseki.
> >>>>
> >>>> One writer can "accidently" lock-out the dataset.
> >>>>
> >>>> Maybe the application isn't doing updates, in which case, a memory
> >>>> dataset focuses on read throughput is better, especially with better
> >>>> triple density in memory.
> >>>>
> >>>> Maybe the application is single threaded or can control threads itself
> >>>> (non-Fuseki).
> >>>>
> >>>>> and that there are no issues with consuming iterators or
> >>>>> streams even after a read transaction has closed.
> >>>>
> >>>> Continuing to use an iterator after the end of a transaction should
> not
> >>>> be allowed.
> >>>>
> >>>>> Is it currently supported for consumers to use iterators and streams
> >>>> after
> >>>>> a transaction has been closed?
> >>>>
> >>>> Consumers that want this must copy the iterator - it's an explicit
> >> opt-in.
> >>>>
> >>>> Does this happen with Dexx? It may do, because Dexx relies on the
> >>>> garbage collector so some things just happen.
> >>>>
> >>>>> If so, I don't currently see an easy way to
> >>>>> replace DatasetGraphInMemory with a faster implementation. (although
> >>>>> transaction-aware iterators that copy the remaining elements into
> lists
> >>>>> could be an option).
> >>>>
> >>>> copy-iterators are going to be expensive in RAM - a denial of service
> >>>> issue - and speed (lesser issue, possibly).
> >>>>
> >>>>> Are there other reasons why DatasetGraphInMemory is the preferred
> >> dataset
> >>>>> implementation for Fuseki?
> >>>>
> >>>> MR+SW in an environment where there is no other information about
> >>>> requirements is the safe choice.
> >>>>
> >>>> If an app wants to trade the issues of MRSW for better performance, it
> >>>> is a choice it needs to make. One case for Fuseki is publishing
> >>>> relatively static data - e.g. reference data, changes from a known,
> well
> >>>> behaved, application
> >>>>
> >>>> Both a general purpose TIM and a higher density, faster dataset have
> >>>> their places.
> >>>>
> >>>>        Andy
> >>>>
> >>>>>
> >>>>> Cheers,
> >>>>> Arne
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: Why DatasetGraphInMemory?

Posted by Andy Seaborne <an...@apache.org>.

On 12/06/2023 21:36, Arne Bernhardt wrote:
> Hi Andy
> 
> you mentioned RoaringBitmaps. I took the time to experiment with them.
> They are really amazing. The performance of #add, #remove and #contains is
> comparable to Java HashSet. RoaringBitmaps are much faster at iterating
> over values and they perform bit operations even between two quite large
> bitmaps like a charm. RoaringBitmaps also need less memory than a
> JavaHasSet. (even less than an optimized integer hash set based on the
> concepts in HashCommon)
> A first graph implementation was easy to create. (albeit with a little help
> from ChatGPT, as I had no idea how to use RoaringBitmaps yet).
> One only needs an indexed set of all triples and three maps indexed by
> subject, predicate and object and bitmaps as values.
> Each bitmap contains all indices of the triples with the corresponding node.
> To find SPO --> use the set with all triples.
> To find S__, _P_, or __O --> lookup the bitmap in the corresponding map and
> iterate over all indices mapping to triples via the indexed set.
> To find SP_, S_O, or _PO --> lookup the two bitmaps for both given nodes,
> perform an "and" operation with both bitmaps and again iterate over the
> resulting indices mapping to triples via the indexed set.
> Especially the query of _PO is incredibly fast compared to GraphMem or
> similarly structured graphs.
> Just for fun, I replaced the bitmaps with two sets of integers and
> simulated the "and" operation by iterating over the smallest set and
> checking the entries in the larger set using #contains --> it is 10-100
> times slower than the "and" operation of RoaringBitmaps.
> Now I really understand the hype around RoaringBitmaps. It seems absolutely
> justified to me.
> Smaller graphs with RoaringBitmaps need about twice as much memory for the
> indexing structures (triples excluded) as GraphMem.
> (The additional memory requirement is not only due to the bitmaps, but also
> to the additional indexed set of triples).
> For larger graphs (> 500k and above), this gap begins to close. At 1M
> triples, the variant with roaring bitmaps wins the advantage with 88MB
> compared to 106MB with GraphMem.
> After loading all the triples from bsbm-25m.nt.gz and two JVM warmup
> iterations, it only took about 18 seconds to add them to the new graph, and
> this graph only required an additional 1941 MB of memory.
> 
> I'm not sure how RoaringBitmaps handles permanent updates. I have tried
> many #add and #remove calls on larger graphs and they seem to work well.
> But there are two methods that caught my attention:
> *
> https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#runOptimize()
> *
> https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#trim()
> I have no idea when it would be a good time to use them.
> Removing and adding triples from a graph of size x in y iterations and
> measuring the impact on memory and performance could be one way to find
> potential problems.
> Do you have a scenario in mind that I could use to test if I ever need one
> of these methods?

Just from reading the javadoc - #runOptimize() might be useful for a 
load-and-readonly graph - do a lot of loading work and switch to the 
more efficient. It depends no how much space it saves. My instinct is 
that the saving for the overall graph may not be that great because the 
RDF terms take up a log of the space at scale so savings on the the 
bitmaps might, overall, not be significant.

> 
>     Arne
> 
> Andy Seaborne <an...@apache.org> schrieb am Mo., 22. Mai 2023, 16:52:
> 
>>
>>
>> On 20/05/2023 17:18, Arne Bernhardt wrote:
>>> Hi Andy,
>>> thank you, that was very helpful to get the whole picture.
>>>
>>> Some time ago, I told you that at my workplace we implemented an
>> in-memory
>>> SPARQL-Server based on a Delta
>>> <
>> https://jena.apache.org/documentation/javadoc/jena/org.apache.jena.core/org/apache/jena/graph/compose/Delta.html
>>>
>>> .
>>> We started a few years ago, before RDF-patch
>>> <https://jena.apache.org/documentation/rdf-patch/>, based on the
>> "difference
>>> model"
>>> <https://lists.w3.org/Archives/Public/www-rdf-interest/2001Mar/0216.html
>>> ,
>>> that has become part of the CGMES standard.
>>> For our server, we strictly follow the CQRS with event-sourcing
>>> <https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs>
>>> pattern. All transactions are recorded as an event with a list of triples
>>> added and a list of triples removed.
>>> The events are stored in an RDBMS (Oracle or PostgreSQL). For query
>>> execution we need the relevant data to fit into memory but all data and
>>> versions are also persisted.
>>> To be able to store and load graphs very fast, we use RDF Thrift with LZ4
>>> compression and store them in blobs.
>>> All queries are executed on projected datasets for the requested version
>>> (any previous version) of the data and the requested named graphs.
>>> Thanks to the versioning, we fully support MR+SW. We even support
>> multiple
>>> writers, with a git-like branching and merging approach and optimistic
>>> locking.
>>
>> How does that work for RDF?
>>
>> Is at the unit of an "entity"
>>
>>> To prevent the chain of deltas from becoming too long, we have several
>>> strategies to write full graphs into snapshots.
>>>
>>> DatasetGraphInMemory seems to be 5-7 times slower than GraphMem, at least
>>> when adding triples. So it might be worth trying to meet your constraints
>>> while leaving the poor performance behind.
>>> My basic idea is:
>>> - having two instances "writer graph" and "reader graph" of GraphMem and
>>> switching between them
>>>      - Unfortunately that would mean a doubling of memory. (but
>>> DatasetGraphInMemory seems to use approx. 3 times the memory of GraphMem)
>>
>> There may be better persistent datastructure than Dexx, have you looked
>> around? Seems a good iodea to adopt an existing library because get
>> robustness and concurrency safety is not trivial.
>>
>>> - implement a dataset that performs the write transactions on the "writer
>>> graph" and in deltas
>>
>> Jena now has BufferingDatasetGraph and BufferingGraph.
>>
>> These capture changes while leaving the underlying dataset or graph
>> untouched until a flush is done.
>>
>> Only one copy and the diffs.
>>
>> If you have two copies you can atomically swap to get MR+SW - bufgferign
>> flush is by default a write-txn on the underlying store.
>>
>> Two copies in memory isn't so bad if the strings (lexical forms, URIs)
>> are shared.
>>
>> A parser run does this with a LRU cache - but it isn't guaranteeing
>> uniquely intern'ed strings. (It save about 30% on long parser runs)
>>
>> You could use a different FactoryRDFCaching which shares state across
>> parser runs and is fully intern'ing nodes. i.e. a Node pool for the
>> whole JVM.
>>
>>>     - meanwhile any number of readers could read from the previous
>> version by
>>> reading the "reader graph"
>>>     - when the writer transaction finishes
>>>        - if there are no readers using the "reader graph", it can be
>> swapped
>>> with the "writer graph" and the former "reader graph" can be updated
>> using
>>> the deltas
>>
>> What happens if a reader turns up while carrying out the updates? Or are
>> you deltas usually "small" and it does not matter?
>>
>>> - when the next writer starts, it would again write into the "writer
>> graph"
>>>     - meanwhile any number of readers could read from previous version(s)
>> by
>>> reading "reader graph" [+ deltas]
>>> - when the least reader of  the oldest version closes the transaction
>>>       - if there are no writers using the "writer graph", it can be
>> reversed
>>> using deltas in reverse order. Then "reader graph" can be swapped with
>> the
>>> "writer graph" for the next read transaction. Then the former "reader
>>> graph" needs to be updated using all deltas.
>>> - only if there is no point in time without overlapping read and write
>>> transactions, a short pause may be needed occasionally to clear the
>> deltas
>>> that are piling up.
>>
>> Yes :-) And if the system is constantly loaded
>>
>> (TDB1 does something similar - it eventually decides it really must
>> clear the backlog and locks the DB down - better than running out of
>> resources).
>>
>>> - It is not lock-free
>>
>> But no latches (locks on parts of the data, not locks controlling
>> e.g switching graphs)
>>
>>> - there would be no background tasks or scheduling involved, "only"
>> having
>>> each graph twice and all #add and #remove operations would have to be
>> done
>>> 3-4 times.
>>>
>>> The idea is still a bit blurry in my head. What do you think?
>>
>> Interesting for your use case.
>>
>> It's not exposed but TDB2 does actually have the previous versions until
>> compaction happens. Each index is immutable after update and delta tree
>> gets created (all the way back to the tree root). The tree roots are
>> still in the DB until it is cleared up by compaction.
>>
>> Sounds like you have the style, but applied to the graph, and can use
>> the GC for clearing up.
>>
>> ---
>>
>> Another is to use bitmap indexes. https://roaringbitmap.org/. (I don't
>> what the time/space tradeoff is for RDF usage.)
>>
>>       Andy
>>
>>>
>>>        Arne
>>>
>>> Am Sa., 20. Mai 2023 um 15:19 Uhr schrieb Andy Seaborne <andy@apache.org
>>> :
>>>
>>>> Hi Arne,
>>>>
>>>> On 19/05/2023 21:21, Arne Bernhardt wrote:
>>>>> Hi,
>>>>> in a recent  response
>>>>> <https://github.com/apache/jena/issues/1867#issuecomment-1546931793>
>> to
>>>> an
>>>>> issue it was said that   "Fuseki - uses DatasetGraphInMemory mostly"  .
>>>>> For my  PR <https://github.com/apache/jena/pull/1865>, I added a JMH
>>>>> benchmark suite to the project. So it was easy for me to compare the
>>>>> performance of GraphMem with
>>>>> "DatasetGraphFactory.createTxnMem().getDefaultGraph()".
>>>>> DatasetGraphInMemory is much slower in every discipline tested (#add,
>>>>> #delete, #contains, #find, #stream).
>>>>> Maybe my approach is too naive?
>>>>> I understand very well that the underlying Dexx Collections Framework,
>>>> with
>>>>> its immutable persistent data structures, makes threading and
>> transaction
>>>>> handling easy
>>>>
>>>> DatasetGraphInMemory (TIM = Transactions In Memory) has one big
>> advantage.
>>>>
>>>> It supports multiple-readers and a single-writer (MR+SW) at the same
>>>> time - truly concurrent. So does TDB2 (TDB1 is sort of hybrid).
>>>>
>>>> MR+SW has a cost which is a copy-on-write overhead, a reader-centric
>>>> design choice allowing the readers to run latch-free.
>>>>
>>>> You can't directly use a regular hash map with concurrent updates. (And
>>>> no, ConcurrentHashMap does not solve all problems, even for a single
>>>> datastructure. A dataset needs to coordinate changes to multiple
>>>> datastructure into a single transactional unit.
>>>>
>>>> GraphMem can not do MR+SW - for all storage datasets/graphs that do not
>>>> have built-in for MR+SW, the best that can be done is MRSW -
>>>> multiple-readers or a single-writer.
>>>>
>>>> For MRSW, when a writer starts, the system has to hold up subsequent
>>>> readers, let existing ones finish, then let the writer run, then release
>>>> any readers held up. (variations possible - whether readers or writers
>>>> get priority).
>>>>
>>>> This is bad in a general concurrent environment. e.g. Fuseki.
>>>>
>>>> One writer can "accidently" lock-out the dataset.
>>>>
>>>> Maybe the application isn't doing updates, in which case, a memory
>>>> dataset focuses on read throughput is better, especially with better
>>>> triple density in memory.
>>>>
>>>> Maybe the application is single threaded or can control threads itself
>>>> (non-Fuseki).
>>>>
>>>>> and that there are no issues with consuming iterators or
>>>>> streams even after a read transaction has closed.
>>>>
>>>> Continuing to use an iterator after the end of a transaction should not
>>>> be allowed.
>>>>
>>>>> Is it currently supported for consumers to use iterators and streams
>>>> after
>>>>> a transaction has been closed?
>>>>
>>>> Consumers that want this must copy the iterator - it's an explicit
>> opt-in.
>>>>
>>>> Does this happen with Dexx? It may do, because Dexx relies on the
>>>> garbage collector so some things just happen.
>>>>
>>>>> If so, I don't currently see an easy way to
>>>>> replace DatasetGraphInMemory with a faster implementation. (although
>>>>> transaction-aware iterators that copy the remaining elements into lists
>>>>> could be an option).
>>>>
>>>> copy-iterators are going to be expensive in RAM - a denial of service
>>>> issue - and speed (lesser issue, possibly).
>>>>
>>>>> Are there other reasons why DatasetGraphInMemory is the preferred
>> dataset
>>>>> implementation for Fuseki?
>>>>
>>>> MR+SW in an environment where there is no other information about
>>>> requirements is the safe choice.
>>>>
>>>> If an app wants to trade the issues of MRSW for better performance, it
>>>> is a choice it needs to make. One case for Fuseki is publishing
>>>> relatively static data - e.g. reference data, changes from a known, well
>>>> behaved, application
>>>>
>>>> Both a general purpose TIM and a higher density, faster dataset have
>>>> their places.
>>>>
>>>>        Andy
>>>>
>>>>>
>>>>> Cheers,
>>>>> Arne
>>>>>
>>>>
>>>
>>
>