You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Rob Vesse <rv...@dotnetrdf.org> on 2017/02/02 10:08:52 UTC

Re: Horizontal scalability and limits of TDB

Following up on one of Andy’s comments In line:

On 23/01/2017 19:02, "Andy Seaborne" <an...@apache.org> wrote:

    Hi there - good to hear from you.
    
    I hope all these pointers in the thread are helpful.
    
    On 22/01/17 15:31, A. Soroka wrote:
    > First, to your specific questions:
    >
    >> 1. Atomicity, consistency, isolation and durability of a
    >> transaction on a single tdb database: Apart from the limitations on
    >> the documentation of TDB Transactions and Txn,  there are current
    >> issues? edge cases detected and not yet covered?
    >
    > I'm not really sure what we mean by "consistency" once we go beyond a
    > single writer. Without a schema and therefore without any
    > understanding of data dependencies within the database, it's not
    > clear to me how we can automatically understand when a state is
    > consistent. It seems we have to leave that to the applications, for
    > the most part. I'm very interested myself in ways we could "hint" to
    > a triplestore the data dependencies we want it to understand (perhaps
    > something like OWL/ICV), but that's not really a scaling issue.
    >
    > I've recently been investigating the possibility of lock regions more
    > granular that a whole dataset:
    >
    > https://github.com/apache/jena/pull/204
    >
    > for the special case of named graphs as the lock regions. We
    > discussed this about a year ago when Claude Warren (Jena
    > committer/PMC) made up some designs for discussion:
    >
    > https://lists.apache.org/thread.html/916eed68e9847c6f4c0330fecff8b6f416a27344f2d995400e834562@1451744303@%3Cdev.jena.apache.org%3E
    >
    >  and there is a _lot_ more to be thought about there.
    >
    > Jena uses threads as stand-ins for transactions, and there is
    > definitely work to be done to separate those ideas so that more than
    > one thread can participate in a transaction
    
    Even in TDB1, transactions can have multiple threads, as 
    multiple-reader, single writer. It's only the API that has the 
    thread-transaction linkage.
    
    > and so that transactions
    > can be managed independently of threading and low-level concurrency.
    > That would be a pretty major change in the codebase, but Andy has
    > been making some moves that will help set that up by changing from a
    > single class being transactional to several type together composing a
    > transactional thing.
    
    In RDF stores, TDB included, the basic storage and query execution is 
    more decoupled than an SQL DBMS.  There are improvements to query 
    processing that are separate from the storage.
    
    Interesting things to consider are multilthreading (Java fork/join) for 
    query processing or Apache Spark.
    
    >> 2. Are there currently available strategies to achieve a
    >> horizontal-scaled tdb database?
    >
    > I'l let Andy speak to this, but I know of none (and I would very much
    > like to!).
    
    Currently - no, not for sideways scale, only (nearly ready!) for High 
    Availability.
    
    >> 3. What do you think of try to implement a horizontal scalability
    >> with DatasetGraph or something else with, let's say, cockroachdb,
    >> voltdb, postgresql, etc?
    
    I did a Project Voldemort backend once.  The issue is the amount of data 
    to move between storage and query engine.  It needs careful 
    pattern-sensitive caching.
    
    Apache Spark looks interesting - there are a few papers around that have 
    looked at it but I think the starting point is specific problem to 
    solve.  There are reasonable but different design choices depending on 
    what the problem to be addressed is.  Otherwise, without a focus, its 
    hard to make choices with confidence.
    
    Apeche Spark is also easy to work with on a local machine.
    
    >
    > See Claude's reply about Cassandra. Claude's is not the only work
    > with Cassandra for RDF. There is also:
    >
    > https://github.com/cumulusrdf/cumulusrdf
    >
    > but that does not seem to be a very active project.
    >
    >> 4. If there are some stress tests available, e.g. I read about a
    >> 100M of BSBM test, is it included in the src? or may I have a copy
    >> of it? ... Or, some guidelines, so I can start to create this
    >> stress code. Will it be useful to you also?
    
    BSBM is https://sourceforge.net/projects/bsbmtools/
    
    I've hacked it for my own use to add handling local databases (i.e same 
    process) not just over remote connections.
    
        https://github.com/afs/BSBM-Local/
    
    Like an benchmark, it emphasises some aspects and not others.  It is 
    filter-focused.
    
    There are commercial benchmarks from the Linked Data Benchmark Council 
    (not just RDF) but they are commercially focused.  It's trying to be 
    TPC, including fees.
    
    100m is not a huge database these days ... except for the SP2B benchmark 
    which is all about basic pattern joins. Even 25m is hard for that.
    
    One to avoid is LUBM - even the originators say that should not be used.
    
    >
    > You will definitely want to know about the work Rob Vesse (Jena
    > committer/PMC) has done on this front:
    >
    > https://github.com/rvesse/sparql-query-bm
    
    This is probably the best starting point - BSBM is not maintained as far 
    as I can see so a BSBM setup for Rob's project would have more life.  I 
    think (Rob correct me if I'm wrong) it needs the "random parameters" 
    added - or more practically, generate a set of queries from templates 
    and then use sparql-query-bm.  BSBMtools does both at once which makes 
    it inflexible - I think separate benchmark execution from benchmark 
    generation means spliting the functional roles and allowing each to do 
    its part well.

 The framework already includes the ability to run queries which are parameterised. Currently the support requires a predefined set of sets of parameters. However it would not be particularly difficult extension to extend that to allow plugging in more advanced parameter providers.
    
    >
    > Modeling workloads for triplestores, in general, is hard because
    > people use them in so many different ways. Also knowing (say) the
    > maximum number of nodes you could put in a dataset might not help you
    > very much if the query time for that dataset with your queries isn't
    > what you need. That's not to discourage you from working on this
    > problem, just to point out that there is a lot of subtlety to even
    > defining and scoping the problem well. It seems to me that most
    > famous benchmarks for RDF stores take up a particular system of use
    > cases and model that.

This is what I have always try to emphasise with my benchmarking work. Benchmarks always represent a specific use pattern/case, therefore the most realistic benchmark will be one you design yourself using your own data and queries. The standard benchmarks will give you a general comparison between systems and may not emphasise the things that matter to you. Often I see that people emphasise throughput which makes sense if you’ll use case is OLTP but that is less relevant if your use case is interactive usage where latency and time to solution are more important.

Rob

    >
    > Otherwise: I've been thinking about scale-out for Jena for a while,
    > too. Particularly I've been inspired by some of the advanced ideas
    > being worked on in RDFox and TriAD [1], [2], and Andy pointed out
    > this [3] blog post from the folks working on the closed-source
    > product Stardog.
    >
    > In fact, I was about to write some questions to the list
    > (particularly Andy) about how we might start thinking about working
    > in ARQ to split queries to partitions in different nodes, perhaps
    > using summary graphs to avoid sending BGPs where they aren't going to
    > find results or even using metadata at the branching nodes of the
    > query tree to do cost accounting and results cardinality bounding. It
    > seems we could at least get basic partitioning with enough time to
    > work on it (he wrote blithely!). We might use something like Apache
    > Zookeeper to manage the partitions and nodes and help figure out
    > where to send different branches of the query. TriAD and RDFox are
    > using clever ways of letting different paths through the query slip
    > asynchronously against each other, but that seems to me like a bridge
    > too far at first. Just getting a distributed approach basically
    > working and giving correct results would be a great start! :grin:
    >
    > --- A. Soroka
    >
    > [1]
    > https://www.cs.ox.ac.uk/ian.horrocks/Publications/download/2016/PoMH16a.pdf
    >
    >
    [2] http://adrem.ua.ac.be/~tmartin/Gurajada-Sigmod14.pdf
    > [3] http://blog.stardog.com/how-to-read-stardog-query-plans/
    >
    >> On Jan 20, 2017, at 8:38 PM, De Gyves <de...@gmail.com> wrote:
    >>
    >> I'd like to participate on the storage portion of Jena, maybe TDB.
    >> As I have worked many years developing with RBDMS I like to explore
    >> new horizonts of persistence and graph based ones seem very
    >> promising to my next projects, so i'd like to use SPARQL and RDF
    >> with Jena/TDB and see how far I can go.
    >>
    >> So I've spent the last two days exploring subjects of the mail
    >> archives from august 2015 to january of this year the of jena-dev
    >> and found some interesting threads, as the development of TDB2, the
    >> tests of 100m of BSBM data, a question of horizontal scaling, and
    >> that anything that implements DatasetGraph can be used for a
    >> triples store. Some readings of jena doc include: SPARQL, The RDF
    >> API, Txn and TDB transactions.
    >>
    >> What I am looking for is to get a clear perspective of some
    >> requirements which are taken for granted on a traditional RDBMS.
    >> These are:
    >>
    >> 1. Atomicity, consistency, isolation and durability of a
    >> transaction on a single tdb database: Apart from the limitations on
    >> the documentation of TDB Transactions and Txn,  there are current
    >> issues? edge cases detected and not yet covered? 2. Are there
    >> currently available strategies to achieve a horizontal-scaled tdb
    >> database? 3. What do you think of try to implement a horizontal
    >> scalability with DatasetGraph or something else with, let's say,
    >> cockroachdb, voltdb, postgresql, etc? 4. If there are some stress
    >> tests available, e.g. I read about a 100M of BSBM test, is it
    >> included in the src? or may I have a copy of it? I'd like to see
    >> what the limits are of the current TDB, and maybe of TDB2: maximum
    >> size on disk of a dataset, max number of nodes on a dataset, of
    >> models or graphs on a dataset, the limiting behavior of a typical
    >> read/write transaction vs. the number of nodes, datasets, etcetera.
    >> Or, some guidelines, so I can start to create this stress code.
    >> Will it be useful to you also?
    >>
    >> -- Víctor-Polo de Gyvés Montero. +52 (55) 4926 9478 (Cellphone in
    >> Mexico city) Address: Daniel Delgadillo 7 6A, Agricultura
    >> neighborhood, Miguel Hidalgo burough ZIP: 11360, México City.
    >>
    >> http://degyves.googlepages.com
    >
    >
    >