You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@marmotta.apache.org by Sebastian Schaffert <ss...@apache.org> on 2013/11/22 15:38:27 UTC

Bulk Loading: Introducing KiWiHandler and KiWiLoader

Dear all (especially Raffaele),

I have now finished implementing a bulk-loading API for quickly dumping
big RDF datasets into a KiWi/Marmotta triplestore. All code is located
in libraries/kiwi/kiwi-loader. There is a command line tool (implemented
mostly by Jakob) called KiWiLoader as well as an API implementation.
Since it is probably more relevant for the rest of you, I'll explain in
the following the API implementation:

Bulk loading is implemented as Sesame RDFHandler without the whole
Repository or SAIL API to avoid additional overhead when importing. This
means that you can use the bulk loading API with any Sesame component
that takes an RDFHandler, most importantly the RIO API. The code in the
KiWiHandlerTest illustrates its use:

    KiWiHandler handler;
    if(dialect instanceof PostgreSQLDialect) {
        handler = new KiWiPostgresHandler(store, new
KiWiLoaderConfiguration());
    } else if(dialect instanceof MySQLDialect) {
        handler = new KiWiMySQLHandler(store, new
KiWiLoaderConfiguration());
    } else {
        handler = new KiWiHandler(store,new KiWiLoaderConfiguration());
    }

    // bulk import
    RDFParser parser = Rio.createParser(RDFFormat.RDFXML);
    parser.setRDFHandler(handler);
    parser.parse(in,baseUri);

KiWiHandler implementations process statements in a streaming manner and
directly dump them into the database. Since for performance reasons this
is done without some of the checks implemented in the normal repository,
you should not run this process in parallel with other processes or
threads operating on the same triple store (maybe I will implement
database locking or so to avoid it).

What you can also see is that for PostgreSQL and MySQL there are
specialized handler implementations. These make use of the special
bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD
LOCAL INFILE) and are implemented by generating an in-memory CSV stream
that is directly sent to the database connection. They also disable
indexes before import and re-enable them when import is finished. To the
best of my knowledge, this is the fastest we can get with the current
data model.

I currently have a process running for importing the whole Freebase dump
into a PostgreSQL database backend. Server is powerful but uses an
ordinary harddisk (no SSD). Up until now it seems to work reliably with
(after 26 hours and 600 million triples) an average throughput of
6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike
other triple stores like BigData, throughput remains more or less
constant even when the size of the database increases - which shows the
power of the relational database backend. For your information, I have
attached a diagram showing some performance statistics over time. 

Take these figures with a grain of salt: such statistics heavily depend
on the way the import data is structured, e.g. if it has a high locality
(triples ordered by subject) and how large literals are in average.
Freebase typically includes quite large literals (e.g. Wikipedia
abstracts).

I'd be glad if those of you with big datasets (Raffaele?) could play
around a bit with this implementation, especially for MySQL, which I did
not test extensively (just the unit test).

A typical call of KiWiLoader would be:

java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar
-S /tmp/loader.png -D postgresql -d
jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta
-P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz 

The option "-S" enables statistics sampling and creates diagrams similar
to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input,
"-f" specifies input format, "-i" the input file, "-D" the database
dialect. Rest is self-explanatory ;-)

My next step is to look again at querying (API and especially SPARQL) to
see how we can improve performance once such big datasets have been
loaded ;-)

Greetings,

Sebasitan

-- 
Dr. Sebastian Schaffert
Chief Technology Officer
Redlink GmbH


Re: Bulk Loading: Introducing KiWiHandler and KiWiLoader

Posted by Raffaele Palmieri <ra...@gmail.com>.
Hi all,
after some fixes and tests, I note that there is some problem about
MARMOTTA-352 <https://issues.apache.org/jira/browse/MARMOTTA-352> attached
to MARMOTTA-245.
>From my tests with MySql, the procedure starts good and after less than a
minute performs first commit with following statistics:
"imported 100,0 K triples; statistics: 2.036/sec, 2.314/sec (last min),
2.314/sec (last hour)"
I am using jamendo.rdf.gz for my tests(http://dbtune.org/jamendo/).
After that, slows down, most likely for very poor Node Cache Hits, I
suspect that there is a problem with EhCache.
Attached is statistics diagram,
cheers,
Raffaele.



On 22 November 2013 16:49, Raffaele Palmieri <ra...@gmail.com>wrote:

> Dear all,
> first of all I'm happy that Marmotta is now a top level Apache project, congratulations
> to all of you that worked very hard in the last months.
> For Sebastian, surely I could test mysql implementation; from your tests, I
> seem to have figured out that database backend is better for storage than Big
> Data triple store, so the first part of issue MARMOTTA-245 could be
> closed, clearly after positive tests with supported DBs, at the least for
> the part of storage, using db native methods for bulk import .
> I don't see any attached diagram about performances, is it elsewhere?
> Best,
> Raffaele.
>
>
>
> On 22 November 2013 15:38, Sebastian Schaffert <ss...@apache.org>wrote:
>
>> Dear all (especially Raffaele),
>>
>> I have now finished implementing a bulk-loading API for quickly dumping
>> big RDF datasets into a KiWi/Marmotta triplestore. All code is located
>> in libraries/kiwi/kiwi-loader. There is a command line tool (implemented
>> mostly by Jakob) called KiWiLoader as well as an API implementation.
>> Since it is probably more relevant for the rest of you, I'll explain in
>> the following the API implementation:
>>
>> Bulk loading is implemented as Sesame RDFHandler without the whole
>> Repository or SAIL API to avoid additional overhead when importing. This
>> means that you can use the bulk loading API with any Sesame component
>> that takes an RDFHandler, most importantly the RIO API. The code in the
>> KiWiHandlerTest illustrates its use:
>>
>>     KiWiHandler handler;
>>     if(dialect instanceof PostgreSQLDialect) {
>>         handler = new KiWiPostgresHandler(store, new
>> KiWiLoaderConfiguration());
>>     } else if(dialect instanceof MySQLDialect) {
>>         handler = new KiWiMySQLHandler(store, new
>> KiWiLoaderConfiguration());
>>     } else {
>>         handler = new KiWiHandler(store,new KiWiLoaderConfiguration());
>>     }
>>
>>     // bulk import
>>     RDFParser parser = Rio.createParser(RDFFormat.RDFXML);
>>     parser.setRDFHandler(handler);
>>     parser.parse(in,baseUri);
>>
>> KiWiHandler implementations process statements in a streaming manner and
>> directly dump them into the database. Since for performance reasons this
>> is done without some of the checks implemented in the normal repository,
>> you should not run this process in parallel with other processes or
>> threads operating on the same triple store (maybe I will implement
>> database locking or so to avoid it).
>>
>> What you can also see is that for PostgreSQL and MySQL there are
>> specialized handler implementations. These make use of the special
>> bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD
>> LOCAL INFILE) and are implemented by generating an in-memory CSV stream
>> that is directly sent to the database connection. They also disable
>> indexes before import and re-enable them when import is finished. To the
>> best of my knowledge, this is the fastest we can get with the current
>> data model.
>>
>> I currently have a process running for importing the whole Freebase dump
>> into a PostgreSQL database backend. Server is powerful but uses an
>> ordinary harddisk (no SSD). Up until now it seems to work reliably with
>> (after 26 hours and 600 million triples) an average throughput of
>> 6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike
>> other triple stores like BigData, throughput remains more or less
>> constant even when the size of the database increases - which shows the
>> power of the relational database backend. For your information, I have
>> attached a diagram showing some performance statistics over time.
>>
>> Take these figures with a grain of salt: such statistics heavily depend
>> on the way the import data is structured, e.g. if it has a high locality
>> (triples ordered by subject) and how large literals are in average.
>> Freebase typically includes quite large literals (e.g. Wikipedia
>> abstracts).
>>
>> I'd be glad if those of you with big datasets (Raffaele?) could play
>> around a bit with this implementation, especially for MySQL, which I did
>> not test extensively (just the unit test).
>>
>> A typical call of KiWiLoader would be:
>>
>> java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar
>> -S /tmp/loader.png -D postgresql -d
>> jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta
>> -P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz
>>
>> The option "-S" enables statistics sampling and creates diagrams similar
>> to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input,
>> "-f" specifies input format, "-i" the input file, "-D" the database
>> dialect. Rest is self-explanatory ;-)
>>
>> My next step is to look again at querying (API and especially SPARQL) to
>> see how we can improve performance once such big datasets have been
>> loaded ;-)
>>
>> Greetings,
>>
>> Sebasitan
>>
>> --
>> Dr. Sebastian Schaffert
>> Chief Technology Officer
>> Redlink GmbH
>>
>>
>

Re: Bulk Loading: Introducing KiWiHandler and KiWiLoader

Posted by Raffaele Palmieri <ra...@gmail.com>.
Dear all,
first of all I'm happy that Marmotta is now a top level Apache
project, congratulations
to all of you that worked very hard in the last months.
For Sebastian, surely I could test mysql implementation; from your tests, I
seem to have figured out that database backend is better for storage than Big
Data triple store, so the first part of issue MARMOTTA-245 could be closed,
clearly after positive tests with supported DBs, at the least for the part
of storage, using db native methods for bulk import .
I don't see any attached diagram about performances, is it elsewhere?
Best,
Raffaele.



On 22 November 2013 15:38, Sebastian Schaffert <ss...@apache.org>wrote:

> Dear all (especially Raffaele),
>
> I have now finished implementing a bulk-loading API for quickly dumping
> big RDF datasets into a KiWi/Marmotta triplestore. All code is located
> in libraries/kiwi/kiwi-loader. There is a command line tool (implemented
> mostly by Jakob) called KiWiLoader as well as an API implementation.
> Since it is probably more relevant for the rest of you, I'll explain in
> the following the API implementation:
>
> Bulk loading is implemented as Sesame RDFHandler without the whole
> Repository or SAIL API to avoid additional overhead when importing. This
> means that you can use the bulk loading API with any Sesame component
> that takes an RDFHandler, most importantly the RIO API. The code in the
> KiWiHandlerTest illustrates its use:
>
>     KiWiHandler handler;
>     if(dialect instanceof PostgreSQLDialect) {
>         handler = new KiWiPostgresHandler(store, new
> KiWiLoaderConfiguration());
>     } else if(dialect instanceof MySQLDialect) {
>         handler = new KiWiMySQLHandler(store, new
> KiWiLoaderConfiguration());
>     } else {
>         handler = new KiWiHandler(store,new KiWiLoaderConfiguration());
>     }
>
>     // bulk import
>     RDFParser parser = Rio.createParser(RDFFormat.RDFXML);
>     parser.setRDFHandler(handler);
>     parser.parse(in,baseUri);
>
> KiWiHandler implementations process statements in a streaming manner and
> directly dump them into the database. Since for performance reasons this
> is done without some of the checks implemented in the normal repository,
> you should not run this process in parallel with other processes or
> threads operating on the same triple store (maybe I will implement
> database locking or so to avoid it).
>
> What you can also see is that for PostgreSQL and MySQL there are
> specialized handler implementations. These make use of the special
> bulk-loading constructs offered by PostgreSQL (COPY IN) and MySQL (LOAD
> LOCAL INFILE) and are implemented by generating an in-memory CSV stream
> that is directly sent to the database connection. They also disable
> indexes before import and re-enable them when import is finished. To the
> best of my knowledge, this is the fastest we can get with the current
> data model.
>
> I currently have a process running for importing the whole Freebase dump
> into a PostgreSQL database backend. Server is powerful but uses an
> ordinary harddisk (no SSD). Up until now it seems to work reliably with
> (after 26 hours and 600 million triples) an average throughput of
> 6.000-7.000 triples/sec and peaks of up to 30.000 triples/sec. Unlike
> other triple stores like BigData, throughput remains more or less
> constant even when the size of the database increases - which shows the
> power of the relational database backend. For your information, I have
> attached a diagram showing some performance statistics over time.
>
> Take these figures with a grain of salt: such statistics heavily depend
> on the way the import data is structured, e.g. if it has a high locality
> (triples ordered by subject) and how large literals are in average.
> Freebase typically includes quite large literals (e.g. Wikipedia
> abstracts).
>
> I'd be glad if those of you with big datasets (Raffaele?) could play
> around a bit with this implementation, especially for MySQL, which I did
> not test extensively (just the unit test).
>
> A typical call of KiWiLoader would be:
>
> java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar
> -S /tmp/loader.png -D postgresql -d
> jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta
> -P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz
>
> The option "-S" enables statistics sampling and creates diagrams similar
> to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input,
> "-f" specifies input format, "-i" the input file, "-D" the database
> dialect. Rest is self-explanatory ;-)
>
> My next step is to look again at querying (API and especially SPARQL) to
> see how we can improve performance once such big datasets have been
> loaded ;-)
>
> Greetings,
>
> Sebasitan
>
> --
> Dr. Sebastian Schaffert
> Chief Technology Officer
> Redlink GmbH
>
>

Re: Bulk Loading: Introducing KiWiHandler and KiWiLoader

Posted by Sergio Fernández <se...@salzburgresearch.at>.

On 22/11/13 15:38, Sebastian Schaffert wrote:
> I have now finished implementing a bulk-loading API for quickly dumping
> big RDF datasets into a KiWi/Marmotta triplestore. All code is located
> in libraries/kiwi/kiwi-loader. There is a command line tool (implemented
> mostly by Jakob) called KiWiLoader as well as an API implementation.
> Since it is probably more relevant for the rest of you, I'll explain in
> the following the API implementation:
>
> (...)
>
> A typical call of KiWiLoader would be:
>
> java -jar target/kiwi-loader-3.2.0-SNAPSHOT.one-jar.jar
> -S /tmp/loader.png -D postgresql -d
> jdbc:postgresql://localhost:5432/freebase?prepareThreshold=3 -U marmotta
> -P marmotta -z -f text/turtle -i freebase-rdf-2013-11-10-00-00_fixed.gz
>
> The option "-S" enables statistics sampling and creates diagrams similar
> to the one I attached. "-z"/"-j" select gzip/bzip2 compressed input,
> "-f" specifies input format, "-i" the input file, "-D" the database
> dialect. Rest is self-explanatory ;-)

It'd be nice to keep updated the documentation with these latest changes:

http://wiki.apache.org/marmotta/ImportData#Import_data_directly_to_the_KiWi_triple_store

;-)

-- 
Sergio Fernández
Senior Researcher
Knowledge and Media Technologies
Salzburg Research Forschungsgesellschaft mbH
Jakob-Haringer-Straße 5/3 | 5020 Salzburg, Austria
T: +43 662 2288 318 | M: +43 660 2747 925
sergio.fernandez@salzburgresearch.at
http://www.salzburgresearch.at