You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@marmotta.apache.org by Raffaele Palmieri <ra...@gmail.com> on 2013/05/21 22:50:49 UTC

Bulk load of triples on DB

Hi to all,
I would propose a little a change to architecture of Importer Service.
Actually for every triple there are single SQL commands invoked from
SailConnectionBase that persist triple informations on DB. That's probably
one of major causes of delay of import operation.
I thought a way to optimize that operation, building for example a csv,
tsv, or *sv file that the major part of RDBMS are able to import in an
optimized way.
For example, MySQL has Load Data Infile command, Postgresql has Copy
command, H2 has Insert into ... Select from Csvread.
I am checking if this modification is feasible; it surely will need a
specialization of sql dialect depending on used RDBMS.
What do you think about? would it have too much impacts?
Regards,
Raffaele.

Re: Bulk load of triples on DB

Posted by Sergio Fernández <wi...@apache.org>.

I find the proposal quite interesting: whatever improves the import 
performance would be nice!

Personally I'm not an expert on RDBMS, so I registered MARMOTTA-245 for 
checking if your proposal is actually feasible. Whenever you'll get your 
committer rights over the git repository, fell free to open a topic 
branch if you consider the changes may need more to actually be usable.

The remembers me, that you would have to take a look to our development 
practices:

   http://marmotta.incubator.apache.org/development.html

For the moment in Marmotta we are using a branching workflow, that is 
currently still under discussion/development:

   http://markmail.org/message/ki5n2ren5mkh57hb

All contributions on those aspects are also welcomed!

Cheers,

On 21/05/13 22:50, Raffaele Palmieri wrote:
> Hi to all,
> I would propose a little a change to architecture of Importer Service.
> Actually for every triple there are single SQL commands invoked from
> SailConnectionBase that persist triple informations on DB. That's probably
> one of major causes of delay of import operation.
> I thought a way to optimize that operation, building for example a csv,
> tsv, or *sv file that the major part of RDBMS are able to import in an
> optimized way.
> For example, MySQL has Load Data Infile command, Postgresql has Copy
> command, H2 has Insert into ... Select from Csvread.
> I am checking if this modification is feasible; it surely will need a
> specialization of sql dialect depending on used RDBMS.
> What do you think about? would it have too much impacts?
> Regards,
> Raffaele.
>

-- 
Sergio Fernández

Re: Bulk load of triples on DB

Posted by Sergio Fernández <wi...@apache.org>.

Hi,

related with this thread, Rupert has pointed me a very interesting paper 
that brings back the idea to have a triple store implementation on top 
of a NoSQL database:

Rya: A Scalable RDF Triple Store for the Clouds
http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf

I don not find the tools available out there, but it is very well 
described, fir in Marmotta (they use Sesame) and the numbers look 
promising to at least explore this path too.

Cheers,


On 21/05/13 22:50, Raffaele Palmieri wrote:
> Hi to all,
> I would propose a little a change to architecture of Importer Service.
> Actually for every triple there are single SQL commands invoked from
> SailConnectionBase that persist triple informations on DB. That's probably
> one of major causes of delay of import operation.
> I thought a way to optimize that operation, building for example a csv,
> tsv, or *sv file that the major part of RDBMS are able to import in an
> optimized way.
> For example, MySQL has Load Data Infile command, Postgresql has Copy
> command, H2 has Insert into ... Select from Csvread.
> I am checking if this modification is feasible; it surely will need a
> specialization of sql dialect depending on used RDBMS.
> What do you think about? would it have too much impacts?
> Regards,
> Raffaele.
>

-- 
Sergio Fernández

Re: Bulk load of triples on DB

Posted by Sebastian Schaffert <se...@gmail.com>.

BTW, if you are interested in really high-performance big data imports,
have a look at this Sail implementation:

http://www.systap.com/bigdata.htm

Actually it was one of the reasons for planning the exchangeable backends.
Unfortunately the library is GPL, so it cannot be directly included into
Marmotta, but every user is free to download it himself and use it ;-)

Greetings,

Sebastian


2013/5/23 Sebastian Schaffert <se...@gmail.com>

> Hi Raffaele,
>
> the idea was anyways to allow different backends besides KiWi, because
> each has its advantages and disadvantages (KiWi's advantages are the
> versioning and the reasoner). The issue is documented under
>
> https://issues.apache.org/jira/browse/MARMOTTA-85
>
> and the individual backends have subsequent numbers. See e.g.
>
> https://issues.apache.org/jira/browse/MARMOTTA-89
>
> for the SDB backend implementation.
>
> Changing backends is currently not possible, but it is foreseen in the
> architecture and it would take me about one day of work to change the
> platform in a way that other backends can be used. The main change will be
> in the SesameServiceImpl which sets up the underlying triple store. The
> initialisation method for this service stacks together different sails
> depending on the configuration and is already very modular. The only thing
> that is currently hardcoded there is the initialisation of a new KiWiStore,
> but in principle it could be any Sesame Sail.
>
> But there are some consequences and dependencies, e.g. the
> marmotta-versioning and marmotta-reasoner modules cannot be used if the
> backend is not KiWi, and I need to find a clean way to model these
> dependencies (Maven is unfortunately probably not enough, because several
> backends could be on the classpath and only one backend selected - on the
> other hand we could simply create different backend configurations in Maven
> that only include one backend to be used - we will see).
>
> If you want to try with SDB and TDB, the first step would be to implement
> a clean wrapper that allows accessing Jena through the Sesame SAIL API.
> Peter Ansell has already worked on such adapters:
>
> https://github.com/ansell/JenaSesame
>
> Maybe this would be a good starting point. I will in parallel try to work
> on modularizing the backends. Not sure when I will be able to finish this,
> because other things are currently on my priority list...
>
> Greetings,
>
> Sebastian
>
>
> 2013/5/23 Raffaele Palmieri <ra...@gmail.com>
>
>> Hi Sebastian, below are some considerations that induce me to think that
>> Jena SDB(or TDB) could be a better solution, but I understand that's a big
>> impact on codebase, and so I would go cautious.
>>
>> On 23 May 2013 12:20, Sebastian Schaffert <sebastian.schaffert@gmail.com
>> >wrote:
>>
>> > Hi Raffaele,
>> >
>> >
>> > 2013/5/22 Raffaele Palmieri <ra...@gmail.com>
>> >
>> > > On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
>> > >
>> > > > What is the current loading rate?
>> > > >
>> > >
>> > > Tried a test with a graph of 661 nodes and 957 triples: it took about
>> 18
>> > > sec. So, looking the triples the medium rate is 18.8 ms per triple;
>> > tested
>> > > on Tomcat with maximum size of 1.5 Gb.
>> > >
>> > >
>> > This is a bit too small for a real test, because you will have a high
>> > influence of side effects (like cache initialisation). I have done some
>> > performance comparisons with importing about 10% of GeoNames (about 15
>> > million triples, 1.5 million resources). The test uses a specialised
>> > parallel importer that was configured to run 8 importing threads in
>> > parallel. Here are some figures on different hardware:
>> > *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13
>> seconds,
>> > 8 in parallel). In case of VMWare, the CPU is waiting most of the time
>> for
>> > I/O, so apparently the harddisk is slow. Could also be related to an
>> older
>> > Linux kernel or the host the instance is running on (might not have 4
>> > physical CPUs assigned to the instance).*
>> > *- QEmu**
>> >
>>
>> >    -
>> >
>> >    4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5
>> seconds, 8
>> >    in parallel). The change to SSD does not deliver the expected
>> > performance
>> >    gain, the limit is mostly the CPU power (load always between
>> 90-100%).
>> >    However, the variance in the average time for 100 is less, so the
>> > results
>> >    are more stable over time.
>> >
>> > *
>> >  - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
>> > seconds, 8 in parallel). Running on physical hardware obviously shows
>> the
>> > highest performance. All 8 CPUs between 85-95% load.*
>> >
>> > In this setup, my observation was that about 80% of the CPU time is
>> > actually spent in Postgres, and most of the database time in SELECTs
>> (not
>> > INSERTs) because of checking if a node or triple already exists. So the
>> > highest performance gain will be achieved by reducing the load on the
>> > database. There is already a quite good caching system in place (using
>> > EHCache), unfortunately the caching cannot solve the issue of checking
>> for
>> > non-existance (a cache can only help when checking for existance). This
>> is
>> > why especially the initial import is comparably slow.
>> >
>> >
>>   You are right that my test of about 1000 triples was limitative,
>> specially with lower resources than yours; but with the same graph and the
>> same resources Jena SDB offers better performance, however I agree with
>> you
>> that we will need more benchmarks.
>> In favor of Jena, both SDB and TDB have already a command line tool to
>> access directly to the database.
>>
>>
>> > Conceptually, when inserting a triple, the workflow currently looks as
>> > follows:
>> >
>> > 1. for each node of subject, predicate, object, context:
>> > 1.1. check for existence of node
>> > 1.1.a node exists in cache, return its database ID
>> > 1.1.b node does not exist in cache, look in the database if it exists
>> there
>> > (SELECT) and return its ID, or null
>> > 1.2. if the database ID is null:
>> > 1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
>> > sequence simulation table (MySQL: SELECT) to get the next database ID
>> and
>> > assign it to the node
>> > 1.2.2 store the node in the database (INSERT) and add it to the cache
>> > 2. check for existance of triple:
>> > 2.a triple exists in cache, return its database ID
>> > 2.b triple does not exist in cache, look in the database if it exists
>> there
>> > (SELECT) and return its ID, or null
>> > 3. if the triple ID is null:
>> > 3.1 query the sequence or the sequence simulation table (MySQL) to get
>> the
>> > next database ID for triples and assign it to the triple
>> > 3.2 store the triple in the database (INSERT) and add it to the cache
>> >
>> > So, in the worst case (i.e. all nodes are new and the triple is new, so
>> > nothing can be answered from the cache) you will have:
>> > - 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
>> > - 4 SELECT commands for existence checking (three nodes, 1 triple),
>> these
>> > are comparably expensive
>> > - 4 SELECT from sequence commands in case of PostgeSQL or H2, very
>> cheap or
>> > 4 SELECT from table commands in case of MySQL, comparably cheap (but
>> not as
>> > good as a real sequence)
>> > what is even worse is that the INSERT and SELECT commands will be
>> > interwoven, i.e. there will be alternating SELECTs and INSERTS, which
>> > databases do not really like.
>> >
>> >
>> This workflow is for duplication check,  from documentation I see that
>> Jena
>> SDB Loader already makes duplication suppression.
>>
>>
>> > To optimize the performance, the best options are therefore:
>> > - avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
>> > least for each triple batch the node insertions)
>> > - avoiding the comparably expensive existence checks (e.g. other way of
>> > caching/looking up that supports checking for non-existance)
>> >
>> > If bulk import then is still slow, it might make sense looking into the
>> > database specific bulk loading commands you suggested.
>> >
>> > If I find some time, I might be able to look into the first optimization
>> > (i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
>> > certain improvement can already be achieved by optimizing this per
>> triple.
>> >
>> > If you want to try out more sophisticated improvements or completely
>> > alternate ways of bulk loading, I would be very happy to see it. Just
>> make
>> > sure the database schema and integrity constraints are kept as they are
>> and
>> > the rest will work. The main constraint is that nodes are unique (i.e.
>> each
>> > URI or Literal has exactly one database row) and non-deleted triples are
>> > unique (i.e. each non-deleted triple has exactly one database ID).
>> >
>> >
>> > > >
>> > > > The Jena SDB bulk loader may have some ideas you can draw on.  It
>> bulk
>> > > > loads a chunk (typically 10K) of triples at a time uses DB temporary
>> > > tables
>> > > > as working storage.  The database layout is a triples+nodes database
>> > > > layout.  SDB manipulates those tables in the DB to find new nodes to
>> > add
>> > > to
>> > > > the node table and new triples to add to the triples table as single
>> > SQL
>> > > > operations.  The original designer may be around on dev@jena.a.o
>> > > >
>> > > >
>> > > This design looks interesting and it seems to be a similar approach
>> to my
>> > > idea, it could be investigated. In the case, can we think about use of
>> > Jena
>> > > SDB in Marmotta?
>> > >
>> > >
>> > This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
>> > Sail, and actually there is already an issue for this in Jira. However,
>> > when doing this you will loose support for the Marmotta/KiWi Reasoner
>> and
>> > Versioning.
>>
>>
>> That would be a good thing that avoids doing too much refactoring. For
>> KiWi
>> Reasoner and Versioning a move on to the Jena RDF API could be needed.
>>
>>
>> > My suggestion would instead be to look how Jena SDB is
>> > implementing the bulk import and try a similar solution.
>>
>>
>> We could implement the same approach from scratch(with queue, chunk and
>> threads), combining with the use of JDBC batch processing, we will obtain
>> a
>> better result, but hasn't it more sense try to use directly already
>> implemented solution?
>>
>>
>> > But if we start
>> > with the optimizations I have already suggested, there might be a huge
>> gain
>> > already. It just has not been in our focus right now, because the
>> scenarios
>> > we are working on do not require bulk-loading huge amounts of data. Data
>> > consistency and parallel access was more important to us. But it would
>> be a
>> > nice feature to be able to run a local copy of GeoNames or DBPedia using
>> > Marmotta ;-)
>>
>>
>>   Yes, it would be nice :)
>>
>>
>> >
>>
>>
>> > Greetings,
>> >
>> > Sebastian
>> >
>>
>> Greetings
>> Raffaele.
>>
>
>

Re: Bulk load of triples on DB

Posted by Peter Ansell <an...@gmail.com>.

On 23 May 2013 23:03, Sebastian Schaffert <se...@gmail.com>wrote:

> Hi Raffaele,
>
> the idea was anyways to allow different backends besides KiWi, because each
> has its advantages and disadvantages (KiWi's advantages are the versioning
> and the reasoner). The issue is documented under
>
> https://issues.apache.org/jira/browse/MARMOTTA-85
>
> and the individual backends have subsequent numbers. See e.g.
>
> https://issues.apache.org/jira/browse/MARMOTTA-89
>
> for the SDB backend implementation.
>
> Changing backends is currently not possible, but it is foreseen in the
> architecture and it would take me about one day of work to change the
> platform in a way that other backends can be used. The main change will be
> in the SesameServiceImpl which sets up the underlying triple store. The
> initialisation method for this service stacks together different sails
> depending on the configuration and is already very modular. The only thing
> that is currently hardcoded there is the initialisation of a new KiWiStore,
> but in principle it could be any Sesame Sail.
>
> But there are some consequences and dependencies, e.g. the
> marmotta-versioning and marmotta-reasoner modules cannot be used if the
> backend is not KiWi, and I need to find a clean way to model these
> dependencies (Maven is unfortunately probably not enough, because several
> backends could be on the classpath and only one backend selected - on the
> other hand we could simply create different backend configurations in Maven
> that only include one backend to be used - we will see).
>
> If you want to try with SDB and TDB, the first step would be to implement a
> clean wrapper that allows accessing Jena through the Sesame SAIL API. Peter
> Ansell has already worked on such adapters:
>
> https://github.com/ansell/JenaSesame
>

That project was created originally by Andy to focus on Jena access to
Sesame Repositories. The sesame-jena module was created by me but it only
wraps Jena Models as Sesame Graphs (that could be updated to Sesame Model
now but I haven't had a chance to do it). I am not sure what would be
required to wrap Jena Models as Sesame Sails/Repositories.

I haven't worked with it in production before so I am not sure what the
performance cost is. There is likely to be some noticeable performance cost
with creation of temporary objects as there is no caching at the boundary
right now.

So far I have only been using it to test out the use of the Jena-based
SPIN-API in a Sesame application. Jerven Bolleman has recently started
working on a SPIN-to-SPARQL parser for Sesame [1] so I may not need it
anymore for that usecase.

I am still happy to accept Pull Requests. Ideally Andy will also accept
Pull Requests back to his repository, although the last one stalled for
almost a year so I closed it myself [2].

Cheers,

Peter

[1] https://bitbucket.org/jbollema/sesame/commits/branch/SPIN
[2] https://github.com/afs/JenaSesame/pull/3

Re: Bulk load of triples on DB

Posted by Peter Ansell <an...@gmail.com>.

On 28 May 2013 03:50, Sebastian Schaffert <se...@gmail.com>wrote:

> Hi Raffaele,
>
> first of all: I did a number of database improvements. Maybe you can check
> how in your example this affects performance (check out the latest source
> release in the development branch)?
>
> Most of the refactoring for this change has already been done, the main
> things that are needed are:
> - allow SesameService to inject different implementations of storage
> backends, in the same style it already handles SailProviders
> - create separate modules for the different backends (e.g.
> marmotta-backend-kiwi, marmotta-backend-native, marmotta-backend-bigdata,
> marmotta-backend-tdb, marmotta-backend-sdb)
> - check which other modules are affected by this change, e.g. the current
> versioning and reasoning will only work with the kiwi backend, but other
> backends also support different kinds of reasoning so maybe the
> marmotta-reasoner can be made more generic to support different styles of
> reasoning, or there are different reasoning modules depending on the
> backend like marmotta-reasoner-kiwi, marmotta-reasoner-bigdata, ...
>
> What I still need to think about is how to ensure only one backend is used
> even if several are found on the classpath - because unfortunately Maven
> does not provide a way to define mutual exclusion of dependencies (or does
> it?). Maybe the user would even want to be able to select the backend at
> runtime - but then this has consequences on which other modules are
> available.
>
>
> Since I probably know the architecture best, I will try to provide the
> necessary infrastructure in the coming weeks. In the meantime, if you want
> to work on a Sesame API wrapper for Jena SDB and TDB, this would be a major
> step towards this goal. ;-)
>
>
If you can do this generically, as a Sail then it would have a large
audience, based on current interest in the JenaSesame wrapper.

Cheers,

Peter

Fwd: Bulk load of triples on DB

Posted by Raffaele Palmieri <ra...@gmail.com>.

Sorry, forward to dev.
Raffaele.

---------- Forwarded message ----------
From: Raffaele Palmieri <ra...@gmail.com>
Date: 28 May 2013 17:32
Subject: Re: Bulk load of triples on DB
To: Sebastian Schaffert <se...@gmail.com>


Hi Sebastian,

Il giorno lunedì 27 maggio 2013, Sebastian Schaffert ha scritto:

 Hi Raffaele,
>
> first of all: I did a number of database improvements. Maybe you can check
> how in your example this affects performance (check out the latest source
> release in the development branch)?
>
>
the time for about 1000 triples now is halved(about 9 sec). :)


> Most of the refactoring for this change has already been done, the main
> things that are needed are:
> - allow SesameService to inject different implementations of storage
> backends, in the same style it already handles SailProviders
> - create separate modules for the different backends (e.g.
> marmotta-backend-kiwi, marmotta-backend-native, marmotta-backend-bigdata,
> marmotta-backend-tdb, marmotta-backend-sdb)
> - check which other modules are affected by this change, e.g. the current
> versioning and reasoning will only work with the kiwi backend, but other
> backends also support different kinds of reasoning so maybe the
> marmotta-reasoner can be made more generic to support different styles of
> reasoning, or there are different reasoning modules depending on the
> backend like marmotta-reasoner-kiwi, marmotta-reasoner-bigdata, ...
>
> What I still need to think about is how to ensure only one backend is used
> even if several are found on the classpath - because unfortunately Maven
> does not provide a way to define mutual exclusion of dependencies (or does
> it?).
>

With maven assembly plugin making different release with different modules,
documentation, etc. would be possible. I look that this assembly is already
used for production of launchers, it could be extended also for different
releases.


> Maybe the user would even want to be able to select the backend at runtime
> - but then this has consequences on which other modules are available.
>
>
>
 At that point, the user could choose the suitable release for him.


> Since I probably know the architecture best, I will try to provide the
> necessary infrastructure in the coming weeks. In the meantime, if you want
> to work on a Sesame API wrapper for Jena SDB and TDB, this would be a major
> step towards this goal. ;-)
>
>
I was looking the projects of Andy, and then that of Peter(bridge between
Sesame and Jena); it's not so trivial and requires a lot of effort with no
guaranteed results.



> Greetings,
>
> Sebastian
>
>
>
Greetings,
Raffaele.


>
> 2013/5/27 Raffaele Palmieri <ra...@gmail.com>
>
>
>
> Il giorno venerdì 24 maggio 2013, Sergio Fernández ha scritto:
>
> Hi,
>
> IMHO just switch to Jena SDB is not a right idea, but porting some ideas
> to KiWi triple store would be nice.
>
>
> Yes, i agree with you in consequence also of created issues
> Marmotta-85/89; Jena SDB would be only a possibile option, not the only
> backend.
>
>
>
> In parallel, when we switched to a pure Sesame-backend, the idea for the
> mid-long term was to be able to run Marmotta on top of other triple stores.
> I particularly would like to be able to use Jena TDB. But there are still
> many part of Marmotta that should be refactored for allowing such.
>
>
> So, how do we think to approach this refactoring? Can we identify the
> parts that need refactoring or is it premature and so do we focus on other
> issues?
>
>
>
> Cheers,
>
>
> Regards,
> Raffaele.
>
>
>
>
> On 23/05/13 15:03, Sebastian Schaffert wrote:
>
> Hi Raffaele,
>
> the idea was anyways to allow different backends besides KiWi, because each
> has its advantages and disadvantages (KiWi's advantages are the versioning
> and the reasoner). The issue is documented under
>
> https://issues.apache.org/**jira/browse/MARMOTTA-85<https://issues.apache.org/jira/browse/MARMOTTA-85>
>
> and the individual backends have subsequent numbers. See e.g.
>
> https://issues.apache.org/**jira/browse/MARMOTTA-89<https://issues.apache.org/jira/browse/MARMOTTA-89>
>
> for the SDB backend implementation.
>
> Changing backends is currently not possible, but it is foreseen in the
> architecture and it would take me about one day of work to change the
> platform in a way that other backends can be used. The main change will be
> in the SesameServiceImpl which sets up the underlying triple store. The
> initialisation method for this service stacks together different sails
> depending on the configuration and is already very modular. The only thing
> that is currently hardcoded there is the initialisation of a new KiWiStore,
> but in principle it could be any Sesame Sail.
>
> But there are some consequences and dependencies, e.g. the
> marmotta-versioning and marmotta-reasoner modules cannot be used if the
> backend is not KiWi, and I need to find a clean way to model these
> dependencies (Maven is unfortunately probably not enough, because several
> backends could be on the classpath and only one backend selected - on the
> other hand we could simply create different backend configurations in Maven
> that only include one backend to be used - we will see).
>
> If you want to try with SDB and TDB, the first step would be to implement a
> clean wrapper that allows accessing Jena through the Sesame SAIL API. Peter
> Ansell has already worked on such adapters:
>
> https://github.com/ansell/**JenaSesame<https://github.com/ansell/JenaSesame>
>
> Maybe this would be a good starting point. I will in parallel try to work
> on modularizing the backends. Not sure when I will be able to finish this,
> because other things are currently on my priority list...
>
> Greetings,
>
> Sebastian
>
>
> 2013/5/23 Raffaele Palmieri <ra...@gmail.com>
>
>  Hi Sebastian, below are some considerations that induce me to think that
> Jena SDB(or TDB) could be a better solution, but I understand that's a big
> impact on codebase, and so I would go cautious.
>
> On 23 May 2013 12:20, Sebastian Schaffert <sebastian.schaffert@gmail.com
>
> wrote:
>
>
>  Hi Raffaele,
>
>
> 2013/5/22 Raffaele Palmieri <ra...@gmail.com>
>
>

Re: Bulk load of triples on DB

Posted by Sebastian Schaffert <se...@gmail.com>.

Hi Raffaele,

first of all: I did a number of database improvements. Maybe you can check
how in your example this affects performance (check out the latest source
release in the development branch)?

Most of the refactoring for this change has already been done, the main
things that are needed are:
- allow SesameService to inject different implementations of storage
backends, in the same style it already handles SailProviders
- create separate modules for the different backends (e.g.
marmotta-backend-kiwi, marmotta-backend-native, marmotta-backend-bigdata,
marmotta-backend-tdb, marmotta-backend-sdb)
- check which other modules are affected by this change, e.g. the current
versioning and reasoning will only work with the kiwi backend, but other
backends also support different kinds of reasoning so maybe the
marmotta-reasoner can be made more generic to support different styles of
reasoning, or there are different reasoning modules depending on the
backend like marmotta-reasoner-kiwi, marmotta-reasoner-bigdata, ...

What I still need to think about is how to ensure only one backend is used
even if several are found on the classpath - because unfortunately Maven
does not provide a way to define mutual exclusion of dependencies (or does
it?). Maybe the user would even want to be able to select the backend at
runtime - but then this has consequences on which other modules are
available.


Since I probably know the architecture best, I will try to provide the
necessary infrastructure in the coming weeks. In the meantime, if you want
to work on a Sesame API wrapper for Jena SDB and TDB, this would be a major
step towards this goal. ;-)

Greetings,

Sebastian



2013/5/27 Raffaele Palmieri <ra...@gmail.com>

>
>
> Il giorno venerdì 24 maggio 2013, Sergio Fernández ha scritto:
>
> Hi,
>>
>> IMHO just switch to Jena SDB is not a right idea, but porting some ideas
>> to KiWi triple store would be nice.
>
>
> Yes, i agree with you in consequence also of created issues
> Marmotta-85/89; Jena SDB would be only a possibile option, not the only
> backend.
>
>
>>
>> In parallel, when we switched to a pure Sesame-backend, the idea for the
>> mid-long term was to be able to run Marmotta on top of other triple stores.
>> I particularly would like to be able to use Jena TDB. But there are still
>> many part of Marmotta that should be refactored for allowing such.
>>
>>
> So, how do we think to approach this refactoring? Can we identify the
> parts that need refactoring or is it premature and so do we focus on other
> issues?
>
>
>
>> Cheers,
>>
>
> Regards,
> Raffaele.
>
>
>>
>>
>> On 23/05/13 15:03, Sebastian Schaffert wrote:
>>
>> Hi Raffaele,
>>
>> the idea was anyways to allow different backends besides KiWi, because
>> each
>> has its advantages and disadvantages (KiWi's advantages are the versioning
>> and the reasoner). The issue is documented under
>>
>> https://issues.apache.org/**jira/browse/MARMOTTA-85<https://issues.apache.org/jira/browse/MARMOTTA-85>
>>
>> and the individual backends have subsequent numbers. See e.g.
>>
>> https://issues.apache.org/**jira/browse/MARMOTTA-89<https://issues.apache.org/jira/browse/MARMOTTA-89>
>>
>> for the SDB backend implementation.
>>
>> Changing backends is currently not possible, but it is foreseen in the
>> architecture and it would take me about one day of work to change the
>> platform in a way that other backends can be used. The main change will be
>> in the SesameServiceImpl which sets up the underlying triple store. The
>> initialisation method for this service stacks together different sails
>> depending on the configuration and is already very modular. The only thing
>> that is currently hardcoded there is the initialisation of a new
>> KiWiStore,
>> but in principle it could be any Sesame Sail.
>>
>> But there are some consequences and dependencies, e.g. the
>> marmotta-versioning and marmotta-reasoner modules cannot be used if the
>> backend is not KiWi, and I need to find a clean way to model these
>> dependencies (Maven is unfortunately probably not enough, because several
>> backends could be on the classpath and only one backend selected - on the
>> other hand we could simply create different backend configurations in
>> Maven
>> that only include one backend to be used - we will see).
>>
>> If you want to try with SDB and TDB, the first step would be to implement
>> a
>> clean wrapper that allows accessing Jena through the Sesame SAIL API.
>> Peter
>> Ansell has already worked on such adapters:
>>
>> https://github.com/ansell/**JenaSesame<https://github.com/ansell/JenaSesame>
>>
>> Maybe this would be a good starting point. I will in parallel try to work
>> on modularizing the backends. Not sure when I will be able to finish this,
>> because other things are currently on my priority list...
>>
>> Greetings,
>>
>> Sebastian
>>
>>
>> 2013/5/23 Raffaele Palmieri <ra...@gmail.com>
>>
>>  Hi Sebastian, below are some considerations that induce me to think that
>> Jena SDB(or TDB) could be a better solution, but I understand that's a big
>> impact on codebase, and so I would go cautious.
>>
>> On 23 May 2013 12:20, Sebastian Schaffert <sebastian.schaffert@gmail.com
>>
>> wrote:
>>
>>
>>  Hi Raffaele,
>>
>>
>> 2013/5/22 Raffaele Palmieri <ra...@gmail.com>
>>
>>  On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
>>
>>  What is the current loading rate?
>>
>>
>> Tried a test with a graph of 661 nodes and 957 triples: it took about
>>
>> 18
>>
>> sec. So, looking the triples the medium rate is 18.8 ms per triple;
>>
>> tested
>>
>> on Tomcat with maximum size of 1.5 Gb.
>>
>>
>>  This is a bit too small for a real test, because you will have a high
>> influence of side effects (like cache initialisation). I have done some
>> performance comparisons with importing about 10% of GeoNames (about 15
>> million triples, 1.5 million resources). The test uses a specialised
>> parallel importer that was configured to run 8 importing threads in
>> parallel. Here are some figures on different hardware:
>> *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13
>>
>> seconds,
>>
>> 8 in parallel). In case of VMWare, the CPU is waiting most of the time
>>
>> for
>>
>> --
>> Sergio Fernández
>>
>

Re: Bulk load of triples on DB

Posted by Raffaele Palmieri <ra...@gmail.com>.

Il giorno venerdì 24 maggio 2013, Sergio Fernández ha scritto:

> Hi,
>
> IMHO just switch to Jena SDB is not a right idea, but porting some ideas
> to KiWi triple store would be nice.


Yes, i agree with you in consequence also of created issues Marmotta-85/89;
Jena SDB would be only a possibile option, not the only backend.


>
> In parallel, when we switched to a pure Sesame-backend, the idea for the
> mid-long term was to be able to run Marmotta on top of other triple stores.
> I particularly would like to be able to use Jena TDB. But there are still
> many part of Marmotta that should be refactored for allowing such.
>
>
So, how do we think to approach this refactoring? Can we identify the parts
that need refactoring or is it premature and so do we focus on other issues?



> Cheers,
>

Regards,
Raffaele.


>
>
> On 23/05/13 15:03, Sebastian Schaffert wrote:
>
> Hi Raffaele,
>
> the idea was anyways to allow different backends besides KiWi, because each
> has its advantages and disadvantages (KiWi's advantages are the versioning
> and the reasoner). The issue is documented under
>
> https://issues.apache.org/**jira/browse/MARMOTTA-85<https://issues.apache.org/jira/browse/MARMOTTA-85>
>
> and the individual backends have subsequent numbers. See e.g.
>
> https://issues.apache.org/**jira/browse/MARMOTTA-89<https://issues.apache.org/jira/browse/MARMOTTA-89>
>
> for the SDB backend implementation.
>
> Changing backends is currently not possible, but it is foreseen in the
> architecture and it would take me about one day of work to change the
> platform in a way that other backends can be used. The main change will be
> in the SesameServiceImpl which sets up the underlying triple store. The
> initialisation method for this service stacks together different sails
> depending on the configuration and is already very modular. The only thing
> that is currently hardcoded there is the initialisation of a new KiWiStore,
> but in principle it could be any Sesame Sail.
>
> But there are some consequences and dependencies, e.g. the
> marmotta-versioning and marmotta-reasoner modules cannot be used if the
> backend is not KiWi, and I need to find a clean way to model these
> dependencies (Maven is unfortunately probably not enough, because several
> backends could be on the classpath and only one backend selected - on the
> other hand we could simply create different backend configurations in Maven
> that only include one backend to be used - we will see).
>
> If you want to try with SDB and TDB, the first step would be to implement a
> clean wrapper that allows accessing Jena through the Sesame SAIL API. Peter
> Ansell has already worked on such adapters:
>
> https://github.com/ansell/**JenaSesame<https://github.com/ansell/JenaSesame>
>
> Maybe this would be a good starting point. I will in parallel try to work
> on modularizing the backends. Not sure when I will be able to finish this,
> because other things are currently on my priority list...
>
> Greetings,
>
> Sebastian
>
>
> 2013/5/23 Raffaele Palmieri <ra...@gmail.com>
>
>  Hi Sebastian, below are some considerations that induce me to think that
> Jena SDB(or TDB) could be a better solution, but I understand that's a big
> impact on codebase, and so I would go cautious.
>
> On 23 May 2013 12:20, Sebastian Schaffert <sebastian.schaffert@gmail.com
>
> wrote:
>
>
>  Hi Raffaele,
>
>
> 2013/5/22 Raffaele Palmieri <ra...@gmail.com>
>
>  On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
>
>  What is the current loading rate?
>
>
> Tried a test with a graph of 661 nodes and 957 triples: it took about
>
> 18
>
> sec. So, looking the triples the medium rate is 18.8 ms per triple;
>
> tested
>
> on Tomcat with maximum size of 1.5 Gb.
>
>
>  This is a bit too small for a real test, because you will have a high
> influence of side effects (like cache initialisation). I have done some
> performance comparisons with importing about 10% of GeoNames (about 15
> million triples, 1.5 million resources). The test uses a specialised
> parallel importer that was configured to run 8 importing threads in
> parallel. Here are some figures on different hardware:
> *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13
>
> seconds,
>
> 8 in parallel). In case of VMWare, the CPU is waiting most of the time
>
> for
>
> --
> Sergio Fernández
>

Re: Bulk load of triples on DB

Posted by Sergio Fernández <wi...@apache.org>.

Hi,

IMHO just switch to Jena SDB is not a right idea, but porting some ideas 
to KiWi triple store would be nice.

In parallel, when we switched to a pure Sesame-backend, the idea for the 
mid-long term was to be able to run Marmotta on top of other triple 
stores. I particularly would like to be able to use Jena TDB. But there 
are still many part of Marmotta that should be refactored for allowing such.

Cheers,


On 23/05/13 15:03, Sebastian Schaffert wrote:
> Hi Raffaele,
>
> the idea was anyways to allow different backends besides KiWi, because each
> has its advantages and disadvantages (KiWi's advantages are the versioning
> and the reasoner). The issue is documented under
>
> https://issues.apache.org/jira/browse/MARMOTTA-85
>
> and the individual backends have subsequent numbers. See e.g.
>
> https://issues.apache.org/jira/browse/MARMOTTA-89
>
> for the SDB backend implementation.
>
> Changing backends is currently not possible, but it is foreseen in the
> architecture and it would take me about one day of work to change the
> platform in a way that other backends can be used. The main change will be
> in the SesameServiceImpl which sets up the underlying triple store. The
> initialisation method for this service stacks together different sails
> depending on the configuration and is already very modular. The only thing
> that is currently hardcoded there is the initialisation of a new KiWiStore,
> but in principle it could be any Sesame Sail.
>
> But there are some consequences and dependencies, e.g. the
> marmotta-versioning and marmotta-reasoner modules cannot be used if the
> backend is not KiWi, and I need to find a clean way to model these
> dependencies (Maven is unfortunately probably not enough, because several
> backends could be on the classpath and only one backend selected - on the
> other hand we could simply create different backend configurations in Maven
> that only include one backend to be used - we will see).
>
> If you want to try with SDB and TDB, the first step would be to implement a
> clean wrapper that allows accessing Jena through the Sesame SAIL API. Peter
> Ansell has already worked on such adapters:
>
> https://github.com/ansell/JenaSesame
>
> Maybe this would be a good starting point. I will in parallel try to work
> on modularizing the backends. Not sure when I will be able to finish this,
> because other things are currently on my priority list...
>
> Greetings,
>
> Sebastian
>
>
> 2013/5/23 Raffaele Palmieri <ra...@gmail.com>
>
>> Hi Sebastian, below are some considerations that induce me to think that
>> Jena SDB(or TDB) could be a better solution, but I understand that's a big
>> impact on codebase, and so I would go cautious.
>>
>> On 23 May 2013 12:20, Sebastian Schaffert <sebastian.schaffert@gmail.com
>>> wrote:
>>
>>> Hi Raffaele,
>>>
>>>
>>> 2013/5/22 Raffaele Palmieri <ra...@gmail.com>
>>>
>>>> On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>> What is the current loading rate?
>>>>>
>>>>
>>>> Tried a test with a graph of 661 nodes and 957 triples: it took about
>> 18
>>>> sec. So, looking the triples the medium rate is 18.8 ms per triple;
>>> tested
>>>> on Tomcat with maximum size of 1.5 Gb.
>>>>
>>>>
>>> This is a bit too small for a real test, because you will have a high
>>> influence of side effects (like cache initialisation). I have done some
>>> performance comparisons with importing about 10% of GeoNames (about 15
>>> million triples, 1.5 million resources). The test uses a specialised
>>> parallel importer that was configured to run 8 importing threads in
>>> parallel. Here are some figures on different hardware:
>>> *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13
>> seconds,
>>> 8 in parallel). In case of VMWare, the CPU is waiting most of the time
>> for
>>> I/O, so apparently the harddisk is slow. Could also be related to an
>> older
>>> Linux kernel or the host the instance is running on (might not have 4
>>> physical CPUs assigned to the instance).*
>>> *- QEmu**
>>>
>>
>>>     -
>>>
>>>     4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds,
>> 8
>>>     in parallel). The change to SSD does not deliver the expected
>>> performance
>>>     gain, the limit is mostly the CPU power (load always between 90-100%).
>>>     However, the variance in the average time for 100 is less, so the
>>> results
>>>     are more stable over time.
>>>
>>> *
>>>   - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
>>> seconds, 8 in parallel). Running on physical hardware obviously shows the
>>> highest performance. All 8 CPUs between 85-95% load.*
>>>
>>> In this setup, my observation was that about 80% of the CPU time is
>>> actually spent in Postgres, and most of the database time in SELECTs (not
>>> INSERTs) because of checking if a node or triple already exists. So the
>>> highest performance gain will be achieved by reducing the load on the
>>> database. There is already a quite good caching system in place (using
>>> EHCache), unfortunately the caching cannot solve the issue of checking
>> for
>>> non-existance (a cache can only help when checking for existance). This
>> is
>>> why especially the initial import is comparably slow.
>>>
>>>
>>    You are right that my test of about 1000 triples was limitative,
>> specially with lower resources than yours; but with the same graph and the
>> same resources Jena SDB offers better performance, however I agree with you
>> that we will need more benchmarks.
>> In favor of Jena, both SDB and TDB have already a command line tool to
>> access directly to the database.
>>
>>
>>> Conceptually, when inserting a triple, the workflow currently looks as
>>> follows:
>>>
>>> 1. for each node of subject, predicate, object, context:
>>> 1.1. check for existence of node
>>> 1.1.a node exists in cache, return its database ID
>>> 1.1.b node does not exist in cache, look in the database if it exists
>> there
>>> (SELECT) and return its ID, or null
>>> 1.2. if the database ID is null:
>>> 1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
>>> sequence simulation table (MySQL: SELECT) to get the next database ID and
>>> assign it to the node
>>> 1.2.2 store the node in the database (INSERT) and add it to the cache
>>> 2. check for existance of triple:
>>> 2.a triple exists in cache, return its database ID
>>> 2.b triple does not exist in cache, look in the database if it exists
>> there
>>> (SELECT) and return its ID, or null
>>> 3. if the triple ID is null:
>>> 3.1 query the sequence or the sequence simulation table (MySQL) to get
>> the
>>> next database ID for triples and assign it to the triple
>>> 3.2 store the triple in the database (INSERT) and add it to the cache
>>>
>>> So, in the worst case (i.e. all nodes are new and the triple is new, so
>>> nothing can be answered from the cache) you will have:
>>> - 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
>>> - 4 SELECT commands for existence checking (three nodes, 1 triple), these
>>> are comparably expensive
>>> - 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap
>> or
>>> 4 SELECT from table commands in case of MySQL, comparably cheap (but not
>> as
>>> good as a real sequence)
>>> what is even worse is that the INSERT and SELECT commands will be
>>> interwoven, i.e. there will be alternating SELECTs and INSERTS, which
>>> databases do not really like.
>>>
>>>
>> This workflow is for duplication check,  from documentation I see that Jena
>> SDB Loader already makes duplication suppression.
>>
>>
>>> To optimize the performance, the best options are therefore:
>>> - avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
>>> least for each triple batch the node insertions)
>>> - avoiding the comparably expensive existence checks (e.g. other way of
>>> caching/looking up that supports checking for non-existance)
>>>
>>> If bulk import then is still slow, it might make sense looking into the
>>> database specific bulk loading commands you suggested.
>>>
>>> If I find some time, I might be able to look into the first optimization
>>> (i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
>>> certain improvement can already be achieved by optimizing this per
>> triple.
>>>
>>> If you want to try out more sophisticated improvements or completely
>>> alternate ways of bulk loading, I would be very happy to see it. Just
>> make
>>> sure the database schema and integrity constraints are kept as they are
>> and
>>> the rest will work. The main constraint is that nodes are unique (i.e.
>> each
>>> URI or Literal has exactly one database row) and non-deleted triples are
>>> unique (i.e. each non-deleted triple has exactly one database ID).
>>>
>>>
>>>>>
>>>>> The Jena SDB bulk loader may have some ideas you can draw on.  It
>> bulk
>>>>> loads a chunk (typically 10K) of triples at a time uses DB temporary
>>>> tables
>>>>> as working storage.  The database layout is a triples+nodes database
>>>>> layout.  SDB manipulates those tables in the DB to find new nodes to
>>> add
>>>> to
>>>>> the node table and new triples to add to the triples table as single
>>> SQL
>>>>> operations.  The original designer may be around on dev@jena.a.o
>>>>>
>>>>>
>>>> This design looks interesting and it seems to be a similar approach to
>> my
>>>> idea, it could be investigated. In the case, can we think about use of
>>> Jena
>>>> SDB in Marmotta?
>>>>
>>>>
>>> This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
>>> Sail, and actually there is already an issue for this in Jira. However,
>>> when doing this you will loose support for the Marmotta/KiWi Reasoner and
>>> Versioning.
>>
>>
>> That would be a good thing that avoids doing too much refactoring. For KiWi
>> Reasoner and Versioning a move on to the Jena RDF API could be needed.
>>
>>
>>> My suggestion would instead be to look how Jena SDB is
>>> implementing the bulk import and try a similar solution.
>>
>>
>> We could implement the same approach from scratch(with queue, chunk and
>> threads), combining with the use of JDBC batch processing, we will obtain a
>> better result, but hasn't it more sense try to use directly already
>> implemented solution?
>>
>>
>>> But if we start
>>> with the optimizations I have already suggested, there might be a huge
>> gain
>>> already. It just has not been in our focus right now, because the
>> scenarios
>>> we are working on do not require bulk-loading huge amounts of data. Data
>>> consistency and parallel access was more important to us. But it would
>> be a
>>> nice feature to be able to run a local copy of GeoNames or DBPedia using
>>> Marmotta ;-)
>>
>>
>>    Yes, it would be nice :)
>>
>>
>>>
>>
>>
>>> Greetings,
>>>
>>> Sebastian
>>>
>>
>> Greetings
>> Raffaele.
>>
>

-- 
Sergio Fernández

Re: Bulk load of triples on DB

Posted by Sebastian Schaffert <se...@gmail.com>.

Hi Raffaele,

the idea was anyways to allow different backends besides KiWi, because each
has its advantages and disadvantages (KiWi's advantages are the versioning
and the reasoner). The issue is documented under

https://issues.apache.org/jira/browse/MARMOTTA-85

and the individual backends have subsequent numbers. See e.g.

https://issues.apache.org/jira/browse/MARMOTTA-89

for the SDB backend implementation.

Changing backends is currently not possible, but it is foreseen in the
architecture and it would take me about one day of work to change the
platform in a way that other backends can be used. The main change will be
in the SesameServiceImpl which sets up the underlying triple store. The
initialisation method for this service stacks together different sails
depending on the configuration and is already very modular. The only thing
that is currently hardcoded there is the initialisation of a new KiWiStore,
but in principle it could be any Sesame Sail.

But there are some consequences and dependencies, e.g. the
marmotta-versioning and marmotta-reasoner modules cannot be used if the
backend is not KiWi, and I need to find a clean way to model these
dependencies (Maven is unfortunately probably not enough, because several
backends could be on the classpath and only one backend selected - on the
other hand we could simply create different backend configurations in Maven
that only include one backend to be used - we will see).

If you want to try with SDB and TDB, the first step would be to implement a
clean wrapper that allows accessing Jena through the Sesame SAIL API. Peter
Ansell has already worked on such adapters:

https://github.com/ansell/JenaSesame

Maybe this would be a good starting point. I will in parallel try to work
on modularizing the backends. Not sure when I will be able to finish this,
because other things are currently on my priority list...

Greetings,

Sebastian


2013/5/23 Raffaele Palmieri <ra...@gmail.com>

> Hi Sebastian, below are some considerations that induce me to think that
> Jena SDB(or TDB) could be a better solution, but I understand that's a big
> impact on codebase, and so I would go cautious.
>
> On 23 May 2013 12:20, Sebastian Schaffert <sebastian.schaffert@gmail.com
> >wrote:
>
> > Hi Raffaele,
> >
> >
> > 2013/5/22 Raffaele Palmieri <ra...@gmail.com>
> >
> > > On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
> > >
> > > > What is the current loading rate?
> > > >
> > >
> > > Tried a test with a graph of 661 nodes and 957 triples: it took about
> 18
> > > sec. So, looking the triples the medium rate is 18.8 ms per triple;
> > tested
> > > on Tomcat with maximum size of 1.5 Gb.
> > >
> > >
> > This is a bit too small for a real test, because you will have a high
> > influence of side effects (like cache initialisation). I have done some
> > performance comparisons with importing about 10% of GeoNames (about 15
> > million triples, 1.5 million resources). The test uses a specialised
> > parallel importer that was configured to run 8 importing threads in
> > parallel. Here are some figures on different hardware:
> > *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13
> seconds,
> > 8 in parallel). In case of VMWare, the CPU is waiting most of the time
> for
> > I/O, so apparently the harddisk is slow. Could also be related to an
> older
> > Linux kernel or the host the instance is running on (might not have 4
> > physical CPUs assigned to the instance).*
> > *- QEmu**
> >
>
> >    -
> >
> >    4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds,
> 8
> >    in parallel). The change to SSD does not deliver the expected
> > performance
> >    gain, the limit is mostly the CPU power (load always between 90-100%).
> >    However, the variance in the average time for 100 is less, so the
> > results
> >    are more stable over time.
> >
> > *
> >  - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
> > seconds, 8 in parallel). Running on physical hardware obviously shows the
> > highest performance. All 8 CPUs between 85-95% load.*
> >
> > In this setup, my observation was that about 80% of the CPU time is
> > actually spent in Postgres, and most of the database time in SELECTs (not
> > INSERTs) because of checking if a node or triple already exists. So the
> > highest performance gain will be achieved by reducing the load on the
> > database. There is already a quite good caching system in place (using
> > EHCache), unfortunately the caching cannot solve the issue of checking
> for
> > non-existance (a cache can only help when checking for existance). This
> is
> > why especially the initial import is comparably slow.
> >
> >
>   You are right that my test of about 1000 triples was limitative,
> specially with lower resources than yours; but with the same graph and the
> same resources Jena SDB offers better performance, however I agree with you
> that we will need more benchmarks.
> In favor of Jena, both SDB and TDB have already a command line tool to
> access directly to the database.
>
>
> > Conceptually, when inserting a triple, the workflow currently looks as
> > follows:
> >
> > 1. for each node of subject, predicate, object, context:
> > 1.1. check for existence of node
> > 1.1.a node exists in cache, return its database ID
> > 1.1.b node does not exist in cache, look in the database if it exists
> there
> > (SELECT) and return its ID, or null
> > 1.2. if the database ID is null:
> > 1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
> > sequence simulation table (MySQL: SELECT) to get the next database ID and
> > assign it to the node
> > 1.2.2 store the node in the database (INSERT) and add it to the cache
> > 2. check for existance of triple:
> > 2.a triple exists in cache, return its database ID
> > 2.b triple does not exist in cache, look in the database if it exists
> there
> > (SELECT) and return its ID, or null
> > 3. if the triple ID is null:
> > 3.1 query the sequence or the sequence simulation table (MySQL) to get
> the
> > next database ID for triples and assign it to the triple
> > 3.2 store the triple in the database (INSERT) and add it to the cache
> >
> > So, in the worst case (i.e. all nodes are new and the triple is new, so
> > nothing can be answered from the cache) you will have:
> > - 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
> > - 4 SELECT commands for existence checking (three nodes, 1 triple), these
> > are comparably expensive
> > - 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap
> or
> > 4 SELECT from table commands in case of MySQL, comparably cheap (but not
> as
> > good as a real sequence)
> > what is even worse is that the INSERT and SELECT commands will be
> > interwoven, i.e. there will be alternating SELECTs and INSERTS, which
> > databases do not really like.
> >
> >
> This workflow is for duplication check,  from documentation I see that Jena
> SDB Loader already makes duplication suppression.
>
>
> > To optimize the performance, the best options are therefore:
> > - avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
> > least for each triple batch the node insertions)
> > - avoiding the comparably expensive existence checks (e.g. other way of
> > caching/looking up that supports checking for non-existance)
> >
> > If bulk import then is still slow, it might make sense looking into the
> > database specific bulk loading commands you suggested.
> >
> > If I find some time, I might be able to look into the first optimization
> > (i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
> > certain improvement can already be achieved by optimizing this per
> triple.
> >
> > If you want to try out more sophisticated improvements or completely
> > alternate ways of bulk loading, I would be very happy to see it. Just
> make
> > sure the database schema and integrity constraints are kept as they are
> and
> > the rest will work. The main constraint is that nodes are unique (i.e.
> each
> > URI or Literal has exactly one database row) and non-deleted triples are
> > unique (i.e. each non-deleted triple has exactly one database ID).
> >
> >
> > > >
> > > > The Jena SDB bulk loader may have some ideas you can draw on.  It
> bulk
> > > > loads a chunk (typically 10K) of triples at a time uses DB temporary
> > > tables
> > > > as working storage.  The database layout is a triples+nodes database
> > > > layout.  SDB manipulates those tables in the DB to find new nodes to
> > add
> > > to
> > > > the node table and new triples to add to the triples table as single
> > SQL
> > > > operations.  The original designer may be around on dev@jena.a.o
> > > >
> > > >
> > > This design looks interesting and it seems to be a similar approach to
> my
> > > idea, it could be investigated. In the case, can we think about use of
> > Jena
> > > SDB in Marmotta?
> > >
> > >
> > This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
> > Sail, and actually there is already an issue for this in Jira. However,
> > when doing this you will loose support for the Marmotta/KiWi Reasoner and
> > Versioning.
>
>
> That would be a good thing that avoids doing too much refactoring. For KiWi
> Reasoner and Versioning a move on to the Jena RDF API could be needed.
>
>
> > My suggestion would instead be to look how Jena SDB is
> > implementing the bulk import and try a similar solution.
>
>
> We could implement the same approach from scratch(with queue, chunk and
> threads), combining with the use of JDBC batch processing, we will obtain a
> better result, but hasn't it more sense try to use directly already
> implemented solution?
>
>
> > But if we start
> > with the optimizations I have already suggested, there might be a huge
> gain
> > already. It just has not been in our focus right now, because the
> scenarios
> > we are working on do not require bulk-loading huge amounts of data. Data
> > consistency and parallel access was more important to us. But it would
> be a
> > nice feature to be able to run a local copy of GeoNames or DBPedia using
> > Marmotta ;-)
>
>
>   Yes, it would be nice :)
>
>
> >
>
>
> > Greetings,
> >
> > Sebastian
> >
>
> Greetings
> Raffaele.
>

Re: Bulk load of triples on DB

Posted by Raffaele Palmieri <ra...@gmail.com>.

Hi Sebastian, below are some considerations that induce me to think that
Jena SDB(or TDB) could be a better solution, but I understand that's a big
impact on codebase, and so I would go cautious.

On 23 May 2013 12:20, Sebastian Schaffert <se...@gmail.com>wrote:

> Hi Raffaele,
>
>
> 2013/5/22 Raffaele Palmieri <ra...@gmail.com>
>
> > On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
> >
> > > What is the current loading rate?
> > >
> >
> > Tried a test with a graph of 661 nodes and 957 triples: it took about 18
> > sec. So, looking the triples the medium rate is 18.8 ms per triple;
> tested
> > on Tomcat with maximum size of 1.5 Gb.
> >
> >
> This is a bit too small for a real test, because you will have a high
> influence of side effects (like cache initialisation). I have done some
> performance comparisons with importing about 10% of GeoNames (about 15
> million triples, 1.5 million resources). The test uses a specialised
> parallel importer that was configured to run 8 importing threads in
> parallel. Here are some figures on different hardware:
> *- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13 seconds,
> 8 in parallel). In case of VMWare, the CPU is waiting most of the time for
> I/O, so apparently the harddisk is slow. Could also be related to an older
> Linux kernel or the host the instance is running on (might not have 4
> physical CPUs assigned to the instance).*
> *- QEmu**
>

>    -
>
>    4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds, 8
>    in parallel). The change to SSD does not deliver the expected
> performance
>    gain, the limit is mostly the CPU power (load always between 90-100%).
>    However, the variance in the average time for 100 is less, so the
> results
>    are more stable over time.
>
> *
>  - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
> seconds, 8 in parallel). Running on physical hardware obviously shows the
> highest performance. All 8 CPUs between 85-95% load.*
>
> In this setup, my observation was that about 80% of the CPU time is
> actually spent in Postgres, and most of the database time in SELECTs (not
> INSERTs) because of checking if a node or triple already exists. So the
> highest performance gain will be achieved by reducing the load on the
> database. There is already a quite good caching system in place (using
> EHCache), unfortunately the caching cannot solve the issue of checking for
> non-existance (a cache can only help when checking for existance). This is
> why especially the initial import is comparably slow.
>
>
  You are right that my test of about 1000 triples was limitative,
specially with lower resources than yours; but with the same graph and the
same resources Jena SDB offers better performance, however I agree with you
that we will need more benchmarks.
In favor of Jena, both SDB and TDB have already a command line tool to
access directly to the database.


> Conceptually, when inserting a triple, the workflow currently looks as
> follows:
>
> 1. for each node of subject, predicate, object, context:
> 1.1. check for existence of node
> 1.1.a node exists in cache, return its database ID
> 1.1.b node does not exist in cache, look in the database if it exists there
> (SELECT) and return its ID, or null
> 1.2. if the database ID is null:
> 1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
> sequence simulation table (MySQL: SELECT) to get the next database ID and
> assign it to the node
> 1.2.2 store the node in the database (INSERT) and add it to the cache
> 2. check for existance of triple:
> 2.a triple exists in cache, return its database ID
> 2.b triple does not exist in cache, look in the database if it exists there
> (SELECT) and return its ID, or null
> 3. if the triple ID is null:
> 3.1 query the sequence or the sequence simulation table (MySQL) to get the
> next database ID for triples and assign it to the triple
> 3.2 store the triple in the database (INSERT) and add it to the cache
>
> So, in the worst case (i.e. all nodes are new and the triple is new, so
> nothing can be answered from the cache) you will have:
> - 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
> - 4 SELECT commands for existence checking (three nodes, 1 triple), these
> are comparably expensive
> - 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap or
> 4 SELECT from table commands in case of MySQL, comparably cheap (but not as
> good as a real sequence)
> what is even worse is that the INSERT and SELECT commands will be
> interwoven, i.e. there will be alternating SELECTs and INSERTS, which
> databases do not really like.
>
>
This workflow is for duplication check,  from documentation I see that Jena
SDB Loader already makes duplication suppression.


> To optimize the performance, the best options are therefore:
> - avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
> least for each triple batch the node insertions)
> - avoiding the comparably expensive existence checks (e.g. other way of
> caching/looking up that supports checking for non-existance)
>
> If bulk import then is still slow, it might make sense looking into the
> database specific bulk loading commands you suggested.
>
> If I find some time, I might be able to look into the first optimization
> (i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
> certain improvement can already be achieved by optimizing this per triple.
>
> If you want to try out more sophisticated improvements or completely
> alternate ways of bulk loading, I would be very happy to see it. Just make
> sure the database schema and integrity constraints are kept as they are and
> the rest will work. The main constraint is that nodes are unique (i.e. each
> URI or Literal has exactly one database row) and non-deleted triples are
> unique (i.e. each non-deleted triple has exactly one database ID).
>
>
> > >
> > > The Jena SDB bulk loader may have some ideas you can draw on.  It bulk
> > > loads a chunk (typically 10K) of triples at a time uses DB temporary
> > tables
> > > as working storage.  The database layout is a triples+nodes database
> > > layout.  SDB manipulates those tables in the DB to find new nodes to
> add
> > to
> > > the node table and new triples to add to the triples table as single
> SQL
> > > operations.  The original designer may be around on dev@jena.a.o
> > >
> > >
> > This design looks interesting and it seems to be a similar approach to my
> > idea, it could be investigated. In the case, can we think about use of
> Jena
> > SDB in Marmotta?
> >
> >
> This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
> Sail, and actually there is already an issue for this in Jira. However,
> when doing this you will loose support for the Marmotta/KiWi Reasoner and
> Versioning.


That would be a good thing that avoids doing too much refactoring. For KiWi
Reasoner and Versioning a move on to the Jena RDF API could be needed.


> My suggestion would instead be to look how Jena SDB is
> implementing the bulk import and try a similar solution.


We could implement the same approach from scratch(with queue, chunk and
threads), combining with the use of JDBC batch processing, we will obtain a
better result, but hasn't it more sense try to use directly already
implemented solution?


> But if we start
> with the optimizations I have already suggested, there might be a huge gain
> already. It just has not been in our focus right now, because the scenarios
> we are working on do not require bulk-loading huge amounts of data. Data
> consistency and parallel access was more important to us. But it would be a
> nice feature to be able to run a local copy of GeoNames or DBPedia using
> Marmotta ;-)


  Yes, it would be nice :)


>


> Greetings,
>
> Sebastian
>

Greetings
Raffaele.

Re: Bulk load of triples on DB

Posted by Sebastian Schaffert <se...@gmail.com>.

Hi Raffaele,

2013/5/22 Raffaele Palmieri <ra...@gmail.com>

> On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
>
> > What is the current loading rate?
> >
>
> Tried a test with a graph of 661 nodes and 957 triples: it took about 18
> sec. So, looking the triples the medium rate is 18.8 ms per triple; tested
> on Tomcat with maximum size of 1.5 Gb.
>
>
This is a bit too small for a real test, because you will have a high
influence of side effects (like cache initialisation). I have done some
performance comparisons with importing about 10% of GeoNames (about 15
million triples, 1.5 million resources). The test uses a specialised
parallel importer that was configured to run 8 importing threads in
parallel. Here are some figures on different hardware:
*- VMWare, 4CPU, 6GB RAM, HDD: 4:20h (avg per 100 resources: 10-13 seconds,
8 in parallel). In case of VMWare, the CPU is waiting most of the time for
I/O, so apparently the harddisk is slow. Could also be related to an older
Linux kernel or the host the instance is running on (might not have 4
physical CPUs assigned to the instance).*
*- QEmu**

   -

   4CPU (8GHz), 6GB RAM, SSD: 2:10h (avg per 100 resources: 4-5 seconds, 8
   in parallel). The change to SSD does not deliver the expected performance
   gain, the limit is mostly the CPU power (load always between 90-100%).
   However, the variance in the average time for 100 is less, so the results
   are more stable over time.

*
 - *Workstation, 8CPU, 24GB RAM, SSD: 0:40 (avg per 100 resources: 1-2
seconds, 8 in parallel). Running on physical hardware obviously shows the
highest performance. All 8 CPUs between 85-95% load.*

In this setup, my observation was that about 80% of the CPU time is
actually spent in Postgres, and most of the database time in SELECTs (not
INSERTs) because of checking if a node or triple already exists. So the
highest performance gain will be achieved by reducing the load on the
database. There is already a quite good caching system in place (using
EHCache), unfortunately the caching cannot solve the issue of checking for
non-existance (a cache can only help when checking for existance). This is
why especially the initial import is comparably slow.

Conceptually, when inserting a triple, the workflow currently looks as
follows:

1. for each node of subject, predicate, object, context:
1.1. check for existence of node
1.1.a node exists in cache, return its database ID
1.1.b node does not exist in cache, look in the database if it exists there
(SELECT) and return its ID, or null
1.2. if the database ID is null:
1.2.1 query the sequence (H2, PostgreSQL: SELECT nextval(...)) or the
sequence simulation table (MySQL: SELECT) to get the next database ID and
assign it to the node
1.2.2 store the node in the database (INSERT) and add it to the cache
2. check for existance of triple:
2.a triple exists in cache, return its database ID
2.b triple does not exist in cache, look in the database if it exists there
(SELECT) and return its ID, or null
3. if the triple ID is null:
3.1 query the sequence or the sequence simulation table (MySQL) to get the
next database ID for triples and assign it to the triple
3.2 store the triple in the database (INSERT) and add it to the cache

So, in the worst case (i.e. all nodes are new and the triple is new, so
nothing can be answered from the cache) you will have:
- 4 INSERT commands (three nodes, 1 triple), these are comparably cheap
- 4 SELECT commands for existence checking (three nodes, 1 triple), these
are comparably expensive
- 4 SELECT from sequence commands in case of PostgeSQL or H2, very cheap or
4 SELECT from table commands in case of MySQL, comparably cheap (but not as
good as a real sequence)
what is even worse is that the INSERT and SELECT commands will be
interwoven, i.e. there will be alternating SELECTs and INSERTS, which
databases do not really like.

To optimize the performance, the best options are therefore:
- avoiding alternating SELECTS and INSERTS as much as possible (e.g. at
least for each triple batch the node insertions)
- avoiding the comparably expensive existence checks (e.g. other way of
caching/looking up that supports checking for non-existance)

If bulk import then is still slow, it might make sense looking into the
database specific bulk loading commands you suggested.

If I find some time, I might be able to look into the first optimization
(i.e. avoiding too many alternate SELECT and INSERT commands). Maybe a
certain improvement can already be achieved by optimizing this per triple.

If you want to try out more sophisticated improvements or completely
alternate ways of bulk loading, I would be very happy to see it. Just make
sure the database schema and integrity constraints are kept as they are and
the rest will work. The main constraint is that nodes are unique (i.e. each
URI or Literal has exactly one database row) and non-deleted triples are
unique (i.e. each non-deleted triple has exactly one database ID).

> >
> > The Jena SDB bulk loader may have some ideas you can draw on.  It bulk
> > loads a chunk (typically 10K) of triples at a time uses DB temporary
> tables
> > as working storage.  The database layout is a triples+nodes database
> > layout.  SDB manipulates those tables in the DB to find new nodes to add
> to
> > the node table and new triples to add to the triples table as single SQL
> > operations.  The original designer may be around on dev@jena.a.o
> >
> >
> This design looks interesting and it seems to be a similar approach to my
> idea, it could be investigated. In the case, can we think about use of Jena
> SDB in Marmotta?
>
>
This could be implemented by wrapping Jena SDB (or also TDB) in a Sesame
Sail, and actually there is already an issue for this in Jira. However,
when doing this you will loose support for the Marmotta/KiWi Reasoner and
Versioning. My suggestion would instead be to look how Jena SDB is
implementing the bulk import and try a similar solution. But if we start
with the optimizations I have already suggested, there might be a huge gain
already. It just has not been in our focus right now, because the scenarios
we are working on do not require bulk-loading huge amounts of data. Data
consistency and parallel access was more important to us. But it would be a
nice feature to be able to run a local copy of GeoNames or DBPedia using
Marmotta ;-)

Greetings,

Sebastian

Re: Bulk load of triples on DB

Posted by Andy Seaborne <an...@apache.org>.

On 22/05/13 14:54, Raffaele Palmieri wrote:
> On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:
>
>> What is the current loading rate?
>>
>
> Tried a test with a graph of 661 nodes and 957 triples: it took about 18
> sec. So, looking the triples the medium rate is 18.8 ms per triple; tested
> on Tomcat with maximum size of 1.5 Gb.
>
>
>>
>> The Jena SDB bulk loader may have some ideas you can draw on.  It bulk
>> loads a chunk (typically 10K) of triples at a time uses DB temporary tables
>> as working storage.  The database layout is a triples+nodes database
>> layout.  SDB manipulates those tables in the DB to find new nodes to add to
>> the node table and new triples to add to the triples table as single SQL
>> operations.  The original designer may be around on dev@jena.a.o
>>
>>
> This design looks interesting and it seems to be a similar approach to my
> idea, it could be investigated. In the case, can we think about use of Jena
> SDB in Marmotta?

Presumably that would have big repercussions across the codebase unless 
there clear SPARQL interface layer (i.e. it's 3-tier) and there are no 
additional capabilities of the SPARQL engine being relied on.  Sorry - 
don't know for sure what the answer is.

(I'm a mentor here - you should really remove my name from the 
contributors section in the POM!)


	Andy

>
>
>>          Andy
>>
>>
> Cheers,
> Raffaele.
>
>
>>
>> On 22/05/13 09:00, Sebastian Schaffert wrote:
>>
>>> Hi Raffaele,
>>>
>>> thanks for your ideas. I have been spending a lot of time thinking on how
>>> to improve the performance of bulk imports. There are currently several
>>> reasons why a bulk import is slow:
>>> 1) Marmotta uses (database) transactions to ensure a good behaviour and
>>> consistent data in highly parallel environments; transactions, however,
>>> introduce a big performance impact especially when they get long (because
>>> the database needs to keep a journal and merge it at the end)
>>> 2) Marmotta needs to check before creating a node or triple if this node
>>> or
>>> triple already exists, because you don't want to have duplicates
>>> 3) Marmotta needs to issue a single SQL command for every inserted triple
>>> (because of 2)
>>>
>>> 3) could be addressed as you say, but even the Java JDBC API offers "batch
>>> commands" that would improve performance, i.e. if you manage to run the
>>> same statement in a sequence many times, the performance will be greatly
>>> optimized. Unfortunately, I was not able to do this because I don't have a
>>> good solution for 2). 3) depends on 2) because for every inserted triple I
>>> need to check if the nodes already exist, so there will be select
>>> statements before the insert statements.
>>>
>>> 2) is a really tricky issue, because the check is needed to ensure data
>>> integrity. I have been thinking about different options here. Keep in mind
>>> that two tables are affected (triples and nodes) and both need to be
>>> handled in a different way:
>>> - if you know that the *triples* do not yet exist (e.g. empty database or
>>> the user assures that they do not exist) you can avoid the check for
>>> triple
>>> existance, but the node check is still needed because several triples
>>> might
>>> refer to the same node
>>> - if the dataset is reasonably small, you can implement the node check
>>> using an in-memory hashtable, which would be very fast; unfortunately you
>>> don't know this in advance, and once a node exists the Marmotta caching
>>> backends anyways takes care of it as long as Marmotta has memory, so the
>>> expensive part is checking for non-existance rather than for existance
>>> - you could also implement a persistent hash map (like MapDB) to keep
>>> track
>>> of the node ids, but I doubt it would give you much benefit over the
>>> database lookup once the dataset is bug
>>> Even if you implement this solution, you would need a two-pass import to
>>> achieve a bulk-load-behaviour in the database, because two tables are
>>> affected, i.e. in the first pass you would import only the nodes, and in
>>> the second pass only the triples.
>>>
>>> Another possibility is to relax the data integrity constraints a bit (e.g.
>>> allowing the same node to exist with different IDs), but I cannot foresee
>>> the consequences of such a choice - it is against the data model.
>>>
>>>
>>> 1) is easy to solve by putting Marmotta in some kind of "maintenance
>>> mode",
>>> i.e. when bulk importing there is an exclusive lock on the database for
>>> the
>>> import process. Another (similar) solution is to provide a separate
>>> command
>>> line tool for importing into a database while Marmotta is not running at
>>> all.
>>>
>>>
>>> The solution I was going to implement as a result of this thinking is as
>>> follows:
>>> - a separate command-line tool that accesses the database directly
>>> - when importing, all nodes and triples are first only created in-memory
>>> and stored in standard Java data structures (or in a temporary log on the
>>> file system)
>>> - when the import is finished, first all nodes are bulk-inserted and the
>>> Java objects get IDs
>>> - second, all triples are bulk-imported with the proper node ids
>>>
>>>
>>> If you want to try out different solutions, I'd be happy if this problem
>>> can be solved ;-)
>>>
>>>
>>> Greetings,
>>>
>>> Sebastian
>>>
>>>
>>> 2013/5/21 Raffaele Palmieri <ra...@gmail.com>
>>>
>>>   Hi to all,
>>>> I would propose a little a change to architecture of Importer Service.
>>>> Actually for every triple there are single SQL commands invoked from
>>>> SailConnectionBase that persist triple informations on DB. That's
>>>> probably
>>>> one of major causes of delay of import operation.
>>>> I thought a way to optimize that operation, building for example a csv,
>>>> tsv, or *sv file that the major part of RDBMS are able to import in an
>>>> optimized way.
>>>> For example, MySQL has Load Data Infile command, Postgresql has Copy
>>>> command, H2 has Insert into ... Select from Csvread.
>>>> I am checking if this modification is feasible; it surely will need a
>>>> specialization of sql dialect depending on used RDBMS.
>>>> What do you think about? would it have too much impacts?
>>>> Regards,
>>>> Raffaele.
>>>>
>>>>
>>>
>>
>

Re: Bulk load of triples on DB

Posted by Raffaele Palmieri <ra...@gmail.com>.

On 22 May 2013 15:04, Andy Seaborne <an...@apache.org> wrote:

> What is the current loading rate?
>

Tried a test with a graph of 661 nodes and 957 triples: it took about 18
sec. So, looking the triples the medium rate is 18.8 ms per triple; tested
on Tomcat with maximum size of 1.5 Gb.


>
> The Jena SDB bulk loader may have some ideas you can draw on.  It bulk
> loads a chunk (typically 10K) of triples at a time uses DB temporary tables
> as working storage.  The database layout is a triples+nodes database
> layout.  SDB manipulates those tables in the DB to find new nodes to add to
> the node table and new triples to add to the triples table as single SQL
> operations.  The original designer may be around on dev@jena.a.o
>
>
This design looks interesting and it seems to be a similar approach to my
idea, it could be investigated. In the case, can we think about use of Jena
SDB in Marmotta?


>         Andy
>
>
Cheers,
Raffaele.


>
> On 22/05/13 09:00, Sebastian Schaffert wrote:
>
>> Hi Raffaele,
>>
>> thanks for your ideas. I have been spending a lot of time thinking on how
>> to improve the performance of bulk imports. There are currently several
>> reasons why a bulk import is slow:
>> 1) Marmotta uses (database) transactions to ensure a good behaviour and
>> consistent data in highly parallel environments; transactions, however,
>> introduce a big performance impact especially when they get long (because
>> the database needs to keep a journal and merge it at the end)
>> 2) Marmotta needs to check before creating a node or triple if this node
>> or
>> triple already exists, because you don't want to have duplicates
>> 3) Marmotta needs to issue a single SQL command for every inserted triple
>> (because of 2)
>>
>> 3) could be addressed as you say, but even the Java JDBC API offers "batch
>> commands" that would improve performance, i.e. if you manage to run the
>> same statement in a sequence many times, the performance will be greatly
>> optimized. Unfortunately, I was not able to do this because I don't have a
>> good solution for 2). 3) depends on 2) because for every inserted triple I
>> need to check if the nodes already exist, so there will be select
>> statements before the insert statements.
>>
>> 2) is a really tricky issue, because the check is needed to ensure data
>> integrity. I have been thinking about different options here. Keep in mind
>> that two tables are affected (triples and nodes) and both need to be
>> handled in a different way:
>> - if you know that the *triples* do not yet exist (e.g. empty database or
>> the user assures that they do not exist) you can avoid the check for
>> triple
>> existance, but the node check is still needed because several triples
>> might
>> refer to the same node
>> - if the dataset is reasonably small, you can implement the node check
>> using an in-memory hashtable, which would be very fast; unfortunately you
>> don't know this in advance, and once a node exists the Marmotta caching
>> backends anyways takes care of it as long as Marmotta has memory, so the
>> expensive part is checking for non-existance rather than for existance
>> - you could also implement a persistent hash map (like MapDB) to keep
>> track
>> of the node ids, but I doubt it would give you much benefit over the
>> database lookup once the dataset is bug
>> Even if you implement this solution, you would need a two-pass import to
>> achieve a bulk-load-behaviour in the database, because two tables are
>> affected, i.e. in the first pass you would import only the nodes, and in
>> the second pass only the triples.
>>
>> Another possibility is to relax the data integrity constraints a bit (e.g.
>> allowing the same node to exist with different IDs), but I cannot foresee
>> the consequences of such a choice - it is against the data model.
>>
>>
>> 1) is easy to solve by putting Marmotta in some kind of "maintenance
>> mode",
>> i.e. when bulk importing there is an exclusive lock on the database for
>> the
>> import process. Another (similar) solution is to provide a separate
>> command
>> line tool for importing into a database while Marmotta is not running at
>> all.
>>
>>
>> The solution I was going to implement as a result of this thinking is as
>> follows:
>> - a separate command-line tool that accesses the database directly
>> - when importing, all nodes and triples are first only created in-memory
>> and stored in standard Java data structures (or in a temporary log on the
>> file system)
>> - when the import is finished, first all nodes are bulk-inserted and the
>> Java objects get IDs
>> - second, all triples are bulk-imported with the proper node ids
>>
>>
>> If you want to try out different solutions, I'd be happy if this problem
>> can be solved ;-)
>>
>>
>> Greetings,
>>
>> Sebastian
>>
>>
>> 2013/5/21 Raffaele Palmieri <ra...@gmail.com>
>>
>>  Hi to all,
>>> I would propose a little a change to architecture of Importer Service.
>>> Actually for every triple there are single SQL commands invoked from
>>> SailConnectionBase that persist triple informations on DB. That's
>>> probably
>>> one of major causes of delay of import operation.
>>> I thought a way to optimize that operation, building for example a csv,
>>> tsv, or *sv file that the major part of RDBMS are able to import in an
>>> optimized way.
>>> For example, MySQL has Load Data Infile command, Postgresql has Copy
>>> command, H2 has Insert into ... Select from Csvread.
>>> I am checking if this modification is feasible; it surely will need a
>>> specialization of sql dialect depending on used RDBMS.
>>> What do you think about? would it have too much impacts?
>>> Regards,
>>> Raffaele.
>>>
>>>
>>
>

Re: Bulk load of triples on DB

Posted by Andy Seaborne <an...@apache.org>.

What is the current loading rate?

The Jena SDB bulk loader may have some ideas you can draw on.  It bulk 
loads a chunk (typically 10K) of triples at a time uses DB temporary 
tables as working storage.  The database layout is a triples+nodes 
database layout.  SDB manipulates those tables in the DB to find new 
nodes to add to the node table and new triples to add to the triples 
table as single SQL operations.  The original designer may be around on 
dev@jena.a.o

	Andy

On 22/05/13 09:00, Sebastian Schaffert wrote:
> Hi Raffaele,
>
> thanks for your ideas. I have been spending a lot of time thinking on how
> to improve the performance of bulk imports. There are currently several
> reasons why a bulk import is slow:
> 1) Marmotta uses (database) transactions to ensure a good behaviour and
> consistent data in highly parallel environments; transactions, however,
> introduce a big performance impact especially when they get long (because
> the database needs to keep a journal and merge it at the end)
> 2) Marmotta needs to check before creating a node or triple if this node or
> triple already exists, because you don't want to have duplicates
> 3) Marmotta needs to issue a single SQL command for every inserted triple
> (because of 2)
>
> 3) could be addressed as you say, but even the Java JDBC API offers "batch
> commands" that would improve performance, i.e. if you manage to run the
> same statement in a sequence many times, the performance will be greatly
> optimized. Unfortunately, I was not able to do this because I don't have a
> good solution for 2). 3) depends on 2) because for every inserted triple I
> need to check if the nodes already exist, so there will be select
> statements before the insert statements.
>
> 2) is a really tricky issue, because the check is needed to ensure data
> integrity. I have been thinking about different options here. Keep in mind
> that two tables are affected (triples and nodes) and both need to be
> handled in a different way:
> - if you know that the *triples* do not yet exist (e.g. empty database or
> the user assures that they do not exist) you can avoid the check for triple
> existance, but the node check is still needed because several triples might
> refer to the same node
> - if the dataset is reasonably small, you can implement the node check
> using an in-memory hashtable, which would be very fast; unfortunately you
> don't know this in advance, and once a node exists the Marmotta caching
> backends anyways takes care of it as long as Marmotta has memory, so the
> expensive part is checking for non-existance rather than for existance
> - you could also implement a persistent hash map (like MapDB) to keep track
> of the node ids, but I doubt it would give you much benefit over the
> database lookup once the dataset is bug
> Even if you implement this solution, you would need a two-pass import to
> achieve a bulk-load-behaviour in the database, because two tables are
> affected, i.e. in the first pass you would import only the nodes, and in
> the second pass only the triples.
>
> Another possibility is to relax the data integrity constraints a bit (e.g.
> allowing the same node to exist with different IDs), but I cannot foresee
> the consequences of such a choice - it is against the data model.
>
>
> 1) is easy to solve by putting Marmotta in some kind of "maintenance mode",
> i.e. when bulk importing there is an exclusive lock on the database for the
> import process. Another (similar) solution is to provide a separate command
> line tool for importing into a database while Marmotta is not running at
> all.
>
>
> The solution I was going to implement as a result of this thinking is as
> follows:
> - a separate command-line tool that accesses the database directly
> - when importing, all nodes and triples are first only created in-memory
> and stored in standard Java data structures (or in a temporary log on the
> file system)
> - when the import is finished, first all nodes are bulk-inserted and the
> Java objects get IDs
> - second, all triples are bulk-imported with the proper node ids
>
>
> If you want to try out different solutions, I'd be happy if this problem
> can be solved ;-)
>
>
> Greetings,
>
> Sebastian
>
>
> 2013/5/21 Raffaele Palmieri <ra...@gmail.com>
>
>> Hi to all,
>> I would propose a little a change to architecture of Importer Service.
>> Actually for every triple there are single SQL commands invoked from
>> SailConnectionBase that persist triple informations on DB. That's probably
>> one of major causes of delay of import operation.
>> I thought a way to optimize that operation, building for example a csv,
>> tsv, or *sv file that the major part of RDBMS are able to import in an
>> optimized way.
>> For example, MySQL has Load Data Infile command, Postgresql has Copy
>> command, H2 has Insert into ... Select from Csvread.
>> I am checking if this modification is feasible; it surely will need a
>> specialization of sql dialect depending on used RDBMS.
>> What do you think about? would it have too much impacts?
>> Regards,
>> Raffaele.
>>
>

Re: Bulk load of triples on DB

Posted by Sebastian Schaffert <se...@gmail.com>.

Hi Raffaele,

thanks for your ideas. I have been spending a lot of time thinking on how
to improve the performance of bulk imports. There are currently several
reasons why a bulk import is slow:
1) Marmotta uses (database) transactions to ensure a good behaviour and
consistent data in highly parallel environments; transactions, however,
introduce a big performance impact especially when they get long (because
the database needs to keep a journal and merge it at the end)
2) Marmotta needs to check before creating a node or triple if this node or
triple already exists, because you don't want to have duplicates
3) Marmotta needs to issue a single SQL command for every inserted triple
(because of 2)

3) could be addressed as you say, but even the Java JDBC API offers "batch
commands" that would improve performance, i.e. if you manage to run the
same statement in a sequence many times, the performance will be greatly
optimized. Unfortunately, I was not able to do this because I don't have a
good solution for 2). 3) depends on 2) because for every inserted triple I
need to check if the nodes already exist, so there will be select
statements before the insert statements.

2) is a really tricky issue, because the check is needed to ensure data
integrity. I have been thinking about different options here. Keep in mind
that two tables are affected (triples and nodes) and both need to be
handled in a different way:
- if you know that the *triples* do not yet exist (e.g. empty database or
the user assures that they do not exist) you can avoid the check for triple
existance, but the node check is still needed because several triples might
refer to the same node
- if the dataset is reasonably small, you can implement the node check
using an in-memory hashtable, which would be very fast; unfortunately you
don't know this in advance, and once a node exists the Marmotta caching
backends anyways takes care of it as long as Marmotta has memory, so the
expensive part is checking for non-existance rather than for existance
- you could also implement a persistent hash map (like MapDB) to keep track
of the node ids, but I doubt it would give you much benefit over the
database lookup once the dataset is bug
Even if you implement this solution, you would need a two-pass import to
achieve a bulk-load-behaviour in the database, because two tables are
affected, i.e. in the first pass you would import only the nodes, and in
the second pass only the triples.

Another possibility is to relax the data integrity constraints a bit (e.g.
allowing the same node to exist with different IDs), but I cannot foresee
the consequences of such a choice - it is against the data model.


1) is easy to solve by putting Marmotta in some kind of "maintenance mode",
i.e. when bulk importing there is an exclusive lock on the database for the
import process. Another (similar) solution is to provide a separate command
line tool for importing into a database while Marmotta is not running at
all.


The solution I was going to implement as a result of this thinking is as
follows:
- a separate command-line tool that accesses the database directly
- when importing, all nodes and triples are first only created in-memory
and stored in standard Java data structures (or in a temporary log on the
file system)
- when the import is finished, first all nodes are bulk-inserted and the
Java objects get IDs
- second, all triples are bulk-imported with the proper node ids


If you want to try out different solutions, I'd be happy if this problem
can be solved ;-)


Greetings,

Sebastian


2013/5/21 Raffaele Palmieri <ra...@gmail.com>

> Hi to all,
> I would propose a little a change to architecture of Importer Service.
> Actually for every triple there are single SQL commands invoked from
> SailConnectionBase that persist triple informations on DB. That's probably
> one of major causes of delay of import operation.
> I thought a way to optimize that operation, building for example a csv,
> tsv, or *sv file that the major part of RDBMS are able to import in an
> optimized way.
> For example, MySQL has Load Data Infile command, Postgresql has Copy
> command, H2 has Insert into ... Select from Csvread.
> I am checking if this modification is feasible; it surely will need a
> specialization of sql dialect depending on used RDBMS.
> What do you think about? would it have too much impacts?
> Regards,
> Raffaele.
>