You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/04/04 04:33:49 UTC

Jena + HDT

From rdfhdt.org: ""We provide Jena Integration to have a Jena Model on top of a given HDT file."

My question is... would this model offer any advantage over TDB for larger graphs? Or any advantage at all? I'm mostly interested about performance and scalability issues.

----
[1] http://www.rdfhdt.org/what-is-hdt/

Re: Jena + HDT

Posted by Andy Seaborne <an...@apache.org>.

 From what I know of HDT, it supports one access pattern very well 
(linked data fragments), and it is good for published large datasets on 
the web.  If there is enough memory to store some of the access 
structures it would be OK for SPARQL. Without all indexes., as the 
RAM/disk ratio gets worse so will SPARQL performance if the queries are 
not of the right shape.

Note that hdt-jena is LGPL3 licensed.

In terms of sheer speed, parsing RDF Binary is 800K triples/s (for me), 
N-Tripes is 200 kTPS.

 > a 30M triple NT file. Loading that into TDB would probably take hours.

So about 3M triples? Or ii sthat comrpressed size? (my rule of thumb is 
x8-x10 for N-triples).

The loading rate for TDB tdbloader should be about 30-50 kTPS (the rate 
drops with increasing size). Faster for smaller datasets - 50-70K triples/s.

TDB loading is doing work "up front" so that any access patter in a 
query is well served.

The easiest way to speed it up would be to remove some of the indexes. 
TDB internally can cope with any combination of indexes, if there is at 
least one (the primary).

If you are getting much slower "something is wrong".  Running on a 
laptop with a rotating disk will be slower, especially if you are using 
the machine at the same time.  Laptop SSDs aren't always that great 
(it's a cost thing about how they interface to the system).

It is faster to work with RDF Binary - pure parsing is 800K triples/s 
and it is fast to write.  Files are large, like N-triples, it compresses 
(gzip) well (x8-x10).

gzip compression gets some of the benefits of HDT - it finds common 
symbols and has a dictionary - but it does not have path access. It is 
universally available.

     Andy

TBD2 is faster to load (not radically), especially loading into an 
existing dataset.

On 04/04/17 10:56, Osma Suominen wrote:
> Hi,
>
> I have some experience using HDT with Jena. I think HDT is an amazing
> technology and I've so far been happy with the performance, but as Rob
> said, the use case matters a lot and benchmarking is recommended.
>
> In my case I have a conversion pipeline [1] that converts a set of MARC
> bibliographic records into a 30M triple NT file. Loading that into TDB
> would probably take hours. Instead I'm converting it to HDT using the
> hdt-cpp toolkit [2] (it's faster than the Java version and uses less
> memory) and create an index file alongside the main HDT file. The
> HDT+index files are a fraction of the size of the NT file (4GB NT file
> vs. less than 500MB for the HDT+index).
>
> I can then start up a version of Fuseki that exposes the data in the HDT
> file as a read-only SPARQL endpoint. In my experience, query performance
> is very reasonable, though I haven't benchmarked it against TDB. Since
> the HDT file + index are rather small, they will soon be held mostly in
> the disk cache, so although the technology is disk based, in practice
> the disk will not be used very much unless you are  extremely low on
> memory.
>
> Running the conversion from NT to HDT, creating the index file, and
> starting up Fuseki altogether take less than 5 minutes and SPARQL
> queries can then be run immediately. In that time the TDB loader would
> have barely started.
>
> As an alternative to Fuseki, SPARQL queries can be run directly on the
> HDT file using the hdtsparql command line tool from the hdt-jena toolkit.
>
> -Osma
>
> [1] https://github.com/NatLibFi/bib-rdf-pipeline
>
> [2] https://github.com/rdfhdt/hdt-cpp
>
>
>
> 04.04.2017, 12:27, Rob Vesse kirjoitti:
>> HDT is primarily on disk. Whether it is query-able depends on the
>> exact encoding, there is one encoding designed primarily for
>> transportation of data and another designed for querying called
>> HDT-FoQ aka focused on querying
>>
>> In either case, there will be some memory usage as they do perform
>> some caching. They may also take advantage of memory mapped files
>> similar to what TDB does.
>>
>>  As far as comparisons with TDB I have never done any myself. For
>> simplistic queries, I would expect that HDT performs ok since from
>> what I remember the indexing is suitable for simple scans. However,
>> for queries with any kind of complexity i.e. Filters, negations, Joins
>> etc. I would expect TDB to outperform it and will scale far better.
>>
>> But as I always point out on these kinds of questions your use case
>> will matter. If you think one solution will be better than the other
>> for your use case then you should benchmark that yourself. Generic
>> benchmarking will only tell you so much and give you a general
>> indication of comparative performance.
>>
>> Rob
>>
>> On 04/04/2017 07:03, "Lorenz B." <bu...@informatik.uni-leipzig.de>
>> wrote:
>>
>>     Well, I'm not that familiar with HDT, thus, I'm probably wrong. And I
>>     saw right now that they also provide some kind of indexing concept.
>>
>>     Let's wait for response from Andy and/or Rob.
>>
>>     (In the meantime, I'll play around with HDT and Jena today to get
>> some
>>     more insights. )
>>
>>     >> Jena HDT is in-memory, right?
>>     > Is it? I thought it was a on-disk, compressed, and query-able
>> list of quads...
>>     >
>>     --
>>     Lorenz B�hmann
>>     AKSW group, University of Leipzig
>>     Group: http://aksw.org - semantic web research center
>>
>>
>>
>>
>>
>>
>
>

Re: Jena + HDT

Posted by Osma Suominen <os...@helsinki.fi>.

Hi,

I made an apples-to-apples comparison using my bibliographic data set.

The starting point is a NT file with 30M triples (unfortunately not yet 
available to the public), gzipped into a 400MB file (uncompressed it 
would be 4GB). I used my i3-2330M laptop with 8GB RAM and SSD.


Converting the dataset to HDT using rdf2hdt took 5 minutes and 15 
seconds. Top memory usage was about 1.3GB. During the conversion the 
rdf2hdt process used one CPU core for 100%, the gzip process an 
additional 20-30% of another CPU core. The resulting HDT file is 250MB.

Creating the index file using hdtSearch took a further 25 seconds. 
Memory usage was about 300MB with one CPU core at 100%. The index file 
is 160MB.

I ran an example query that calculates the top 20 subjects (the ones 
with most works about them). The query is included below. I ran it a few 
times using hdtsparql and the execution time was 15.3-16.8 seconds.

Total wall clock time: 5:40 minutes
Total disk usage: 410MB
Fastest query: 15.3 seconds


Loading the same dataset to TDB using tdbloader2 took 11 minutes. CPU 
usage was 110-180% and top memory usage was 1.3GB for the java process 
that does the initial loading. Then came the sort processes that took 
300% CPU and used up to 3.5GB memory. The resulting TDB directory size 
is 2.7GB.

The example query took 12.9-14.7 seconds.

Total wall clock time: 11 minutes
Total disk usage: 2.7GB
Fastest query: 12.9 seconds


To summarize, generating a HDT with an index file is about twice as fast 
as loading the data into TDB and uses less memory and CPU. Disk usage is 
only 15% of what TDB uses. Query performance for this particular query 
is about 20% slower with HDT than when using TDB.

-Osma


--example query--
PREFIX schema: <http://schema.org/>

SELECT ?ysoc (COUNT(DISTINCT ?w) AS ?count) WHERE {
   ?w schema:about ?ysoc .
   FILTER(STRSTARTS(STR(?ysoc), 'http://www.yso.fi/onto/yso/'))
}
GROUP BY ?ysoc
ORDER BY DESC(?count)
LIMIT 20
--example query--


04.04.2017, 13:29, Osma Suominen kirjoitti:
> 04.04.2017, 13:10, Dave Reynolds kirjoitti:
>> Not to detract from HDT in anyway but we routinely load 25M triple file
>> sets (Turtle) to TDB in around 10mins on modest cloud VMs and rather
>> faster on local desktops with modern SSDs. So HDT might still have some
>> load speed benefits but at that scale it is less than 2x and not hours
>> v.s. minutes.
>
> Right, sorry, I made a mistake in my estimate.
>
> With my modest laptop (i3-2330M, SSD), loading the Geonames dataset
> (173M triples NT file) into TDB using tdbloader2 takes about 70 minutes,
> so the loading rate is about 40k triples per second. The size of the
> resulting TDB is 16GB.
>
> -Osma
>


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Jena + HDT

Posted by Osma Suominen <os...@helsinki.fi>.

04.04.2017, 13:10, Dave Reynolds kirjoitti:
> Not to detract from HDT in anyway but we routinely load 25M triple file
> sets (Turtle) to TDB in around 10mins on modest cloud VMs and rather
> faster on local desktops with modern SSDs. So HDT might still have some
> load speed benefits but at that scale it is less than 2x and not hours
> v.s. minutes.

Right, sorry, I made a mistake in my estimate.

With my modest laptop (i3-2330M, SSD), loading the Geonames dataset 
(173M triples NT file) into TDB using tdbloader2 takes about 70 minutes, 
so the loading rate is about 40k triples per second. The size of the 
resulting TDB is 16GB.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Jena + HDT

Posted by Dave Reynolds <da...@gmail.com>.

Not to detract from HDT in anyway but we routinely load 25M triple file 
sets (Turtle) to TDB in around 10mins on modest cloud VMs and rather 
faster on local desktops with modern SSDs. So HDT might still have some 
load speed benefits but at that scale it is less than 2x and not hours 
v.s. minutes.

Dave

On 04/04/17 10:56, Osma Suominen wrote:
> Hi,
>
> I have some experience using HDT with Jena. I think HDT is an amazing
> technology and I've so far been happy with the performance, but as Rob
> said, the use case matters a lot and benchmarking is recommended.
>
> In my case I have a conversion pipeline [1] that converts a set of MARC
> bibliographic records into a 30M triple NT file. Loading that into TDB
> would probably take hours. Instead I'm converting it to HDT using the
> hdt-cpp toolkit [2] (it's faster than the Java version and uses less
> memory) and create an index file alongside the main HDT file. The
> HDT+index files are a fraction of the size of the NT file (4GB NT file
> vs. less than 500MB for the HDT+index).
>
> I can then start up a version of Fuseki that exposes the data in the HDT
> file as a read-only SPARQL endpoint. In my experience, query performance
> is very reasonable, though I haven't benchmarked it against TDB. Since
> the HDT file + index are rather small, they will soon be held mostly in
> the disk cache, so although the technology is disk based, in practice
> the disk will not be used very much unless you are  extremely low on
> memory.
>
> Running the conversion from NT to HDT, creating the index file, and
> starting up Fuseki altogether take less than 5 minutes and SPARQL
> queries can then be run immediately. In that time the TDB loader would
> have barely started.
>
> As an alternative to Fuseki, SPARQL queries can be run directly on the
> HDT file using the hdtsparql command line tool from the hdt-jena toolkit.
>
> -Osma
>
> [1] https://github.com/NatLibFi/bib-rdf-pipeline
>
> [2] https://github.com/rdfhdt/hdt-cpp
>
>
>
> 04.04.2017, 12:27, Rob Vesse kirjoitti:
>> HDT is primarily on disk. Whether it is query-able depends on the
>> exact encoding, there is one encoding designed primarily for
>> transportation of data and another designed for querying called
>> HDT-FoQ aka focused on querying
>>
>> In either case, there will be some memory usage as they do perform
>> some caching. They may also take advantage of memory mapped files
>> similar to what TDB does.
>>
>>  As far as comparisons with TDB I have never done any myself. For
>> simplistic queries, I would expect that HDT performs ok since from
>> what I remember the indexing is suitable for simple scans. However,
>> for queries with any kind of complexity i.e. Filters, negations, Joins
>> etc. I would expect TDB to outperform it and will scale far better.
>>
>> But as I always point out on these kinds of questions your use case
>> will matter. If you think one solution will be better than the other
>> for your use case then you should benchmark that yourself. Generic
>> benchmarking will only tell you so much and give you a general
>> indication of comparative performance.
>>
>> Rob
>>
>> On 04/04/2017 07:03, "Lorenz B." <bu...@informatik.uni-leipzig.de>
>> wrote:
>>
>>     Well, I'm not that familiar with HDT, thus, I'm probably wrong. And I
>>     saw right now that they also provide some kind of indexing concept.
>>
>>     Let's wait for response from Andy and/or Rob.
>>
>>     (In the meantime, I'll play around with HDT and Jena today to get
>> some
>>     more insights. )
>>
>>     >> Jena HDT is in-memory, right?
>>     > Is it? I thought it was a on-disk, compressed, and query-able
>> list of quads...
>>     >
>>     --
>>     Lorenz B�hmann
>>     AKSW group, University of Leipzig
>>     Group: http://aksw.org - semantic web research center
>>
>>
>>
>>
>>
>>
>
>

Re: Jena + HDT

Posted by Osma Suominen <os...@helsinki.fi>.

Hi,

I have some experience using HDT with Jena. I think HDT is an amazing 
technology and I've so far been happy with the performance, but as Rob 
said, the use case matters a lot and benchmarking is recommended.

In my case I have a conversion pipeline [1] that converts a set of MARC 
bibliographic records into a 30M triple NT file. Loading that into TDB 
would probably take hours. Instead I'm converting it to HDT using the 
hdt-cpp toolkit [2] (it's faster than the Java version and uses less 
memory) and create an index file alongside the main HDT file. The 
HDT+index files are a fraction of the size of the NT file (4GB NT file 
vs. less than 500MB for the HDT+index).

I can then start up a version of Fuseki that exposes the data in the HDT 
file as a read-only SPARQL endpoint. In my experience, query performance 
is very reasonable, though I haven't benchmarked it against TDB. Since 
the HDT file + index are rather small, they will soon be held mostly in 
the disk cache, so although the technology is disk based, in practice 
the disk will not be used very much unless you are  extremely low on memory.

Running the conversion from NT to HDT, creating the index file, and 
starting up Fuseki altogether take less than 5 minutes and SPARQL 
queries can then be run immediately. In that time the TDB loader would 
have barely started.

As an alternative to Fuseki, SPARQL queries can be run directly on the 
HDT file using the hdtsparql command line tool from the hdt-jena toolkit.

-Osma

[1] https://github.com/NatLibFi/bib-rdf-pipeline

[2] https://github.com/rdfhdt/hdt-cpp

04.04.2017, 12:27, Rob Vesse kirjoitti:
> HDT is primarily on disk. Whether it is query-able depends on the exact encoding, there is one encoding designed primarily for transportation of data and another designed for querying called HDT-FoQ aka focused on querying
>
> In either case, there will be some memory usage as they do perform some caching. They may also take advantage of memory mapped files similar to what TDB does.
>
>  As far as comparisons with TDB I have never done any myself. For simplistic queries, I would expect that HDT performs ok since from what I remember the indexing is suitable for simple scans. However, for queries with any kind of complexity i.e. Filters, negations, Joins etc. I would expect TDB to outperform it and will scale far better.
>
> But as I always point out on these kinds of questions your use case will matter. If you think one solution will be better than the other for your use case then you should benchmark that yourself. Generic benchmarking will only tell you so much and give you a general indication of comparative performance.
>
> Rob
>
> On 04/04/2017 07:03, "Lorenz B." <bu...@informatik.uni-leipzig.de> wrote:
>
>     Well, I'm not that familiar with HDT, thus, I'm probably wrong. And I
>     saw right now that they also provide some kind of indexing concept.
>
>     Let's wait for response from Andy and/or Rob.
>
>     (In the meantime, I'll play around with HDT and Jena today to get some
>     more insights. )
>
>     >> Jena HDT is in-memory, right?
>     > Is it? I thought it was a on-disk, compressed, and query-able list of quads...
>     >
>     --
>     Lorenz B�hmann
>     AKSW group, University of Leipzig
>     Group: http://aksw.org - semantic web research center
>
>
>
>
>
>

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Jena + HDT

Posted by Rob Vesse <rv...@dotnetrdf.org>.

HDT is primarily on disk. Whether it is query-able depends on the exact encoding, there is one encoding designed primarily for transportation of data and another designed for querying called HDT-FoQ aka focused on querying

In either case, there will be some memory usage as they do perform some caching. They may also take advantage of memory mapped files similar to what TDB does.

 As far as comparisons with TDB I have never done any myself. For simplistic queries, I would expect that HDT performs ok since from what I remember the indexing is suitable for simple scans. However, for queries with any kind of complexity i.e. Filters, negations, Joins etc. I would expect TDB to outperform it and will scale far better. 

But as I always point out on these kinds of questions your use case will matter. If you think one solution will be better than the other for your use case then you should benchmark that yourself. Generic benchmarking will only tell you so much and give you a general indication of comparative performance.

Rob

On 04/04/2017 07:03, "Lorenz B." <bu...@informatik.uni-leipzig.de> wrote:

    Well, I'm not that familiar with HDT, thus, I'm probably wrong. And I
    saw right now that they also provide some kind of indexing concept.

    Let's wait for response from Andy and/or Rob.

    (In the meantime, I'll play around with HDT and Jena today to get some
    more insights. )

    >> Jena HDT is in-memory, right?
    > Is it? I thought it was a on-disk, compressed, and query-able list of quads...
    >
    -- 
    Lorenz Bühmann
    AKSW group, University of Leipzig
    Group: http://aksw.org - semantic web research center

Re: Jena + HDT

Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.

Well, I'm not that familiar with HDT, thus, I'm probably wrong. And I
saw right now that they also provide some kind of indexing concept.

Let's wait for response from Andy and/or Rob.

(In the meantime, I'll play around with HDT and Jena today to get some
more insights. )

>> Jena HDT is in-memory, right?
> Is it? I thought it was a on-disk, compressed, and query-able list of quads...
>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Re: Jena + HDT

Posted by Laura Morales <la...@mail.com>.

> Jena HDT is in-memory, right?

Is it? I thought it was a on-disk, compressed, and query-able list of quads...

Re: Jena + HDT

Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.

Jena HDT is in-memory, right? So at first you would need enough memory
to load the data. I think this would also mean that there are no indexes
create, thus, the query optimizer works different.

For sure Andy S. and I guess also Rob V. (as far as I know he worked
with HDT) can probably say more. I couldn't find benchmarks right now.
But feel free to do some experiments and publish the results here - I'm
always interested in such things.
> >From rdfhdt.org: ""We provide Jena Integration to have a Jena Model on top of a given HDT file."
>
> My question is... would this model offer any advantage over TDB for larger graphs? Or any advantage at all? I'm mostly interested about performance and scalability issues.
>
> ----
> [1] http://www.rdfhdt.org/what-is-hdt/
>
-- 
Lorenz B�hmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center