You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@marmotta.apache.org by Adam Flinton <af...@ihtsdo.org> on 2014/07/03 15:39:45 UTC

Load times/back ends etc

Dear All@marmotta,

We have a fairly large dataset (1.7gb in XML, 800MB if trig) which loads in
a few minutes into Jena using the tdbloader.

Loading into marmotta (with the default h2 db backend) seems to take many
many hours.

Would it be any quicker using the client library?

Are there any tips & tricks e.g. turning off things like versioning which
are not required for the initial load?

I am looking to use pgsql with versioning as this is what we need vs
jana/tdb.


Adam Flinton

Re: Load times/back ends etc

Posted by Sebastian Schaffert <se...@gmail.com>.

Hi all,

For documentation purposes: the main reason why loading *over the platform*
is slow is that Marmotta needs to make sure all data is consistent in the
case of concurrent access (i.e. two clients try to add triples for the same
resource or even the same triple) - which is the assumed default case for
all Web data. In case you know in advance that noone else will access the
triple store (e.g. by shutting down Marmotta), you can use the KiWiLoader,
as pointed out by Raffaele and Sergio. The KiWiLoader applies many typical
performance improvements for database bulk loading (like dropping indexes
before import, assuming noone else will create the same triple or node
during import, keeping large in-memory batches before writing out). With
the KiWiLoader we have managed to import both DBPedia and Freebase in
reasonable time.

Usage is actually java -jar marmotta-loader-kiwi.jar for KiWi (the
documentation on the webpage is a bit too simplified). To a somewhat
limited extent, the loader also exists for other backends like Titan, HBase
and BerkeleyDB.

Greetings,

Sebastian

2014-07-03 16:13 GMT+02:00 Sergio Fernández <
sergio.fernandez@salzburgresearch.at>:

> Hi Adam,
>
>
> On 03/07/14 15:39, Adam Flinton wrote:
>
>> Loading into marmotta (with the default h2 db backend) seems to take many
>> many hours.
>>
>
> H2 is just for demo purposes. For such amount of data you have to switch
> to a proper database, PostgreSQL is the commended one.
>
>
>  Would it be any quicker using the client library?
>>
>
> Plus using a direct loader: http://marmotta.apache.org/kiwi/loader
>
>
>  Are there any tips & tricks e.g. turning off things like versioning which
>> are not required for the initial load?
>>
>
> Of course versioning has an impact of importing, but no so relevant. Leave
> it enabled if you need it.
>
> Cheers,
>
> --
> Sergio Fernández
> Senior Researcher
> Knowledge and Media Technologies
> Salzburg Research Forschungsgesellschaft mbH
> Jakob-Haringer-Straße 5/3 | 5020 Salzburg, Austria
> T: +43 662 2288 318 | M: +43 660 2747 925
> sergio.fernandez@salzburgresearch.at
> http://www.salzburgresearch.at
>

Re: Load times/back ends etc

Posted by Sergio Fernández <se...@salzburgresearch.at>.

Hi Adam,

On 03/07/14 15:39, Adam Flinton wrote:
> Loading into marmotta (with the default h2 db backend) seems to take many
> many hours.

H2 is just for demo purposes. For such amount of data you have to switch 
to a proper database, PostgreSQL is the commended one.

> Would it be any quicker using the client library?

Plus using a direct loader: http://marmotta.apache.org/kiwi/loader

> Are there any tips & tricks e.g. turning off things like versioning which
> are not required for the initial load?

Of course versioning has an impact of importing, but no so relevant. 
Leave it enabled if you need it.

Cheers,

-- 
Sergio Fernández
Senior Researcher
Knowledge and Media Technologies
Salzburg Research Forschungsgesellschaft mbH
Jakob-Haringer-Straße 5/3 | 5020 Salzburg, Austria
T: +43 662 2288 318 | M: +43 660 2747 925
sergio.fernandez@salzburgresearch.at
http://www.salzburgresearch.at

Re: Load times/back ends etc

Posted by Raffaele Palmieri <ra...@gmail.com>.

Hi Adam, for bulk loading of huge graphs, there is a command line utility
that you can run. It greatly improves the load performance.
The documentation for usage is at this link:
http://marmotta.apache.org/kiwi/loader.html ,
if you use it, I suggest you not run this process in parallel with other
processes or threads on the same triple store,
Cheers,
Raffaele.

On 3 July 2014 15:39, Adam Flinton <af...@ihtsdo.org> wrote:

> Dear All@marmotta,
>
> We have a fairly large dataset (1.7gb in XML, 800MB if trig) which loads
> in a few minutes into Jena using the tdbloader.
>
> Loading into marmotta (with the default h2 db backend) seems to take many
> many hours.
>
> Would it be any quicker using the client library?
>
> Are there any tips & tricks e.g. turning off things like versioning which
> are not required for the initial load?
>
> I am looking to use pgsql with versioning as this is what we need vs
> jana/tdb.
>
>
> Adam Flinton
>