You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Lorenz Buehmann <bu...@informatik.uni-leipzig.de> on 2022/08/26 14:03:09 UTC
TDB2 bulk loader - multiple files into different graph per file
Hi all,
is there any option to use TDB2 bulk loader (tdb2.xloader or just
tdb2.loader) to load multiple files into multiple different named
graphs? Like
tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2> file2 ...
I'm asking because I thought the initial loading is way faster then
iterating over multiple (graph, file) pairs and running the TDB2 loader
for each pair?
By the way, the tdb2.xloader has no option for named graphs at all?
Cheers,
Lorenz
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
On 26/08/2022 19:50, Dan Brickley wrote:
> On Fri, 26 Aug 2022 at 16:27, Andy Seaborne <an...@apache.org> wrote:
>
>>
>>
>> On 26/08/2022 15:03, Lorenz Buehmann wrote:
>>> I'm asking because I thought the initial loading is way faster then
>>> iterating over multiple (graph, file) pairs and running the TDB2 loader
>>> for each pair?
>>
>> Yes. It is faster when loading from empty in a single run of a loader.
>>
>> The loaders do some straight-to-index work which makes proper
>> transactions impossible, and so if a load has a parse error, a bypass of
>> transactions would, at best, break the database with half a load, or, at
>> worse, break the database.
>
>
> Is it possible to load into new and dedicated named graphs so that such
> partial loads could be easily cleaned up / reverted? Or the corruption is
> deeper in the underlying data structures (index etc.)?
What sort of errors are you thinking of?
Loaders are one step of the pipeline from gettign data fro some 3rd part
and into database. Their role is get data in as fast as possible within
the hardware constraints.
A syntax error will be detected by the parser, and when the parser
aborts the whole load aborts. Bulk loading is multiphase - load triples
to get a node table, the primary index (SPO, GSPO), then build the other
indexes. It is faster this way - and can have parallelism. Several
loaders have various degrees of parallelism.
If it aborts, there is, at best, a partial SPO table, no other indexes.
The rest of the system assumes a valid database.
Syntax errors should be caught by checking first with 'riot' if you
can't trust the source.
The single-threaded loaders are transactional and will abort the load
transaction. No data loaded, database is in the state as when the load
started. They also work on already-existing databases.
For schema errors (SHACL, ShEx) work on valid RDF, and all loaders will
work. The loaders "only" need syntactically RDF.
Schema fixup is later.
Andy
>
> Dan
>
>
>> Andy
>>
>
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Dan Brickley <da...@danbri.org>.
On Fri, 26 Aug 2022 at 16:27, Andy Seaborne <an...@apache.org> wrote:
>
>
> On 26/08/2022 15:03, Lorenz Buehmann wrote:
> > I'm asking because I thought the initial loading is way faster then
> > iterating over multiple (graph, file) pairs and running the TDB2 loader
> > for each pair?
>
> Yes. It is faster when loading from empty in a single run of a loader.
>
> The loaders do some straight-to-index work which makes proper
> transactions impossible, and so if a load has a parse error, a bypass of
> transactions would, at best, break the database with half a load, or, at
> worse, break the database.
Is it possible to load into new and dedicated named graphs so that such
partial loads could be easily cleaned up / reverted? Or the corruption is
deeper in the underlying data structures (index etc.)?
Dan
> Andy
>
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
On 26/08/2022 15:03, Lorenz Buehmann wrote:
> I'm asking because I thought the initial loading is way faster then
> iterating over multiple (graph, file) pairs and running the TDB2 loader
> for each pair?
Yes. It is faster when loading from empty in a single run of a loader.
The loaders do some straight-to-index work which makes proper
transactions impossible, and so if a load has a parse error, a bypass of
transactions would, at best, break the database with half a load, or, at
worse, break the database.
Andy
Re: Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Martynas Jusevičius <ma...@atomgraph.com>.
On Sun, Aug 28, 2022 at 11:00 AM Lorenz Buehmann
<bu...@informatik.uni-leipzig.de> wrote:
>
> Hi Andy,
>
> thanks for fast response.
>
> I see - the only drawback with wrapping the streams into TriG is when we
> have Turtle syntax files (or lets say any non N-Triples format) - afaik,
> prefixes aren't allowed inside graphs, i.e. at that point you're lost.
> What I did now is to pipe those files into riot first which then
> generates N-Triples which then can be wrapped in TriG graphs. Indeed, we
> have the riot overhead here, i.e. the data is parsed twice. Still faster
> though then loading graphs in separate TDB loader calls, so I guess I
> can live with this.
I had a similar question a few years ago, and Claus responded:
https://stackoverflow.com/questions/63467067/converting-rdf-triples-to-quads-from-command-line/63716278
>
> Having a follow up question:
>
> I could see a huge difference between read compressed (Bzip) vs
> uncompressed file:
>
> I put the output until the triples have been loaded here as the index
> creating should be affected by the compression:
>
>
> # uncompressed with tdb2.tdbloader
>
> 14:24:40 INFO loader :: Add: 163,000,000
> river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230)
> 14:24:42 INFO loader :: Finished:
> output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in 1165.30s
> (Avg: 140,145)
>
>
> # compressed with tdb2.tdbloader
>
> 17:37:37 INFO loader :: Add: 163,000,000
> river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050)
> 17:37:40 INFO loader :: Finished:
> output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in
> 10158.16s (Avg: 16,076)
>
>
> So loading the compressed file is ~9x slower then the compressed one.
> Can we consider this as expected? Note, here we have a geospatial
> dataset with millions of geometry literals. Not sure if this is also
> something that makes things worse.
>
> What are your experiences with loading compressed vs uncompressed data?
>
>
> Cheers,
>
> Lorenz
>
>
> On 26.08.22 17:02, Andy Seaborne wrote:
> > Hi Lorenz,
> >
> > No - there isn't an option.
> >
> > The way to do it is to prepare the load as quads by, for example,
> > wrapping in TriG syntax around the files or adding the G to N-triples.
> >
> > This can be done streaming and piped into the loader (with --syntax=
> > if not N-quads).
> >
> > > By the way, the tdb2.xloader has no option for named graphs at all?
> >
> > The input needs to be prepared as quads.
> >
> > Andy
> >
> > On 26/08/2022 15:03, Lorenz Buehmann wrote:
> >> Hi all,
> >>
> >> is there any option to use TDB2 bulk loader (tdb2.xloader or just
> >> tdb2.loader) to load multiple files into multiple different named
> >> graphs? Like
> >>
> >> tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2>
> >> file2 ...
> >>
> >> I'm asking because I thought the initial loading is way faster then
> >> iterating over multiple (graph, file) pairs and running the TDB2
> >> loader for each pair?
> >>
> >>
> >> By the way, the tdb2.xloader has no option for named graphs at all?
> >>
> >>
> >> Cheers,
> >>
> >> Lorenz
> >>
Re: Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
Yep, I already recognized that I forgot to mention hardware and details:
- file size compressed: 5,9G
- file size uncompressed: 23G
- Server:
- AMD EPYC 7443P 24-Core Processor
- 256GB RAM
- 4 x 8TB SSD Samsung_SSD_870 as a ZFS raid, i.e. ~30TB
- Jena version (latest release .4.6.0):
TDB2: VERSION: 4.6.0
TDB2: BUILD_DATE: 2022-08-20T08:22:47Z
- TDB2 loader is the default one, i.e. it should be 'phased'?
- I rerun the loader phased vs parallel on compress vs uncompressed:
https://gist.github.com/LorenzBuehmann/27f232a1fd2c2a95600115b18958458b
-> compressed one degrades immediately to an avg of 16,000/s vs
140,000/s on the uncompressed data - looks horrible
And I yes, I also tend to decompress via OS tool before loading
On 28.08.22 13:55, Andy Seaborne wrote:
>
>
> On 28/08/2022 09:58, Lorenz Buehmann wrote:
>> Hi Andy,
>>
>> thanks for fast response.
>>
>> I see - the only drawback with wrapping the streams into TriG is when
>> we have Turtle syntax files (or lets say any non N-Triples format) -
>> afaik, prefixes aren't allowed inside graphs, i.e. at that point
>> you're lost.
> >
>> What I did now is to pipe those files into riot first which then
>> generates N-Triples which then can be wrapped in TriG graphs. Indeed,
>> we have the riot overhead here, i.e. the data is parsed twice. Still
>> faster though then loading graphs in separate TDB loader calls, so I
>> guess I can live with this.
>
>
> Exercise in text processing :-)
>
> Spit out the prefixes into a separate TTL file (grep!) and load that
> file as well.
>
>>
>> Having a follow up question:
>>
>> I could see a huge difference between read compressed (Bzip) vs
>> uncompressed file:
>>
>> I put the output until the triples have been loaded here as the index
>> creating should be affected by the compression:
>>
>>
>> # uncompressed with tdb2.tdbloader
>
> Which loader?
> And what hardware?
>
> (--loader=parallel may not make much of a difference at 100m)
>
>
>> 14:24:40 INFO loader :: Add: 163,000,000
>> river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230)
>> 14:24:42 INFO loader :: Finished:
>> output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in
>> 1165.30s (Avg: 140,145)
>>
>>
>> # compressed with tdb2.tdbloader
>>
>> 17:37:37 INFO loader :: Add: 163,000,000
>> river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050)
>> 17:37:40 INFO loader :: Finished:
>> output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in
>> 10158.16s (Avg: 16,076)
>
> That is bad!
> Was it consistently slow through the load?
>
> If you are relying on Jena to do the bz2 decompress, then it is using
> Commons Compress.
>
> gz is done (via Commons Compress) in native code. I use gz and if I
> get a bz2 file, I decompress it with OS tools.
>
>> So loading the compressed file is ~9x slower then the compressed one.
>> Can we consider this as expected? Note, here we have a geospatial
>> dataset with millions of geometry literals. Not sure if this is also
>> something that makes things worse.
>>
>> What are your experiences with loading compressed vs uncompressed data?
>
> bz2 is expensive - it is focuses on max compression. Coupled with
> being java (not so much the java, as being not highly tuned code
> decompression code) it coudl be a factor.
>
> Usually (gz) there is a slight slow down if using SSD as source. HDD
> can be either way.
>
> Andy
>
>>
>>
>> Cheers,
>>
>> Lorenz
>>
>>
>> On 26.08.22 17:02, Andy Seaborne wrote:
>>> Hi Lorenz,
>>>
>>> No - there isn't an option.
>>>
>>> The way to do it is to prepare the load as quads by, for example,
>>> wrapping in TriG syntax around the files or adding the G to N-triples.
>>>
>>> This can be done streaming and piped into the loader (with --syntax=
>>> if not N-quads).
>>>
>>> > By the way, the tdb2.xloader has no option for named graphs at all?
>>>
>>> The input needs to be prepared as quads.
>>>
>>> Andy
>>>
>>> On 26/08/2022 15:03, Lorenz Buehmann wrote:
>>>> Hi all,
>>>>
>>>> is there any option to use TDB2 bulk loader (tdb2.xloader or just
>>>> tdb2.loader) to load multiple files into multiple different named
>>>> graphs? Like
>>>>
>>>> tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2>
>>>> file2 ...
>>>>
>>>> I'm asking because I thought the initial loading is way faster then
>>>> iterating over multiple (graph, file) pairs and running the TDB2
>>>> loader for each pair?
>>>>
>>>>
>>>> By the way, the tdb2.xloader has no option for named graphs at all?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Lorenz
>>>>
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
On 29/08/2022 14:53, Simon Bin wrote:
> I was asked to try it on my system (samsung 970 evo+ nvme, intel
> 11850h), but I used a slightly smaller data set (river_europe); it is
> not quite as bad as on Lorenz' but the buffering would help
> nevertheless:
>
> main : river_europe-latest.osm.pbf.ttl.bz2 : 815.14 sec : 72,098,221 Triples : 88,449.21 per second : 0 errors : 10 warnings
> fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2 : 376.64 sec : 72,098,221 Triples : 191,424.76 per second : 0 errors : 10 warnings
> pbzip2 -dc river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 Triples : 464,442.66 per second : 0 errors : 10 warnings
> river_europe-latest.osm.pbf.ttl : 136.92 sec : 72,098,221 Triples : 526,587.26 per second : 0 errors : 10 warnings
Are these two datasets (this dataset and
river_planet-latest.osm.pbf.ttl) publicly availably?
Different datasets have different performance characteristics.
I'm not surprised BSBM is slower - it has a lot of large literals so
there is a lot of basic byte shifting.
I also tried on a laptop which typically have slower buses. (I had a
hardware crash a couple of weeks ago so I don't have the desktop I was
using for comparison but from memory, the 8yo desktop is faster for riot
parsing than the 1yo laptop.)
Andy
If you want excellent figure, use LUBM. It has a small node/triple ratio
(there are less bytes to shift) and high locality of URI use (better
memory cache usage). It is unrealistic for parsing and loading.
>
> Cheers,
>
> On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote:
>> In addition I used the OS tool in a pipe:
>>
>> bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count
>> --syntax "Turtle"
>>
>> Triples = 163,310,838
>> stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per
>> second : 0 errors : 31 warnings
>>
>>
>> unsurprisingly more or less exactly the time of decompression + the
>> parsing time of the uncompressed file - still way faster than the
>> Apache
>> Commons one, even with my suggested fix the OS variant is ~5min
>> faster
>>
>>
>> On 29.08.22 11:24, Lorenz Buehmann wrote:
>>> riot --time --count river_planet-latest.osm.pbf.ttl
>>>
>>> Triples = 163,310,838
>>> 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors
>>> :
>>> 31 warnings
>>>
>>>
>>> riot --time --count river_planet-latest.osm.pbf.ttl.gz
>>>
>>> Triples = 163,310,838
>>> 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors
>>> :
>>> 31 warnings
>>>
>>>
>>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>>>
>>> Triples = 163,310,838
>>> 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0
>>> errors :
>>> 31 warnings
>>>
>>>
>>> Takes ages with Bzip2 ... there must be something going wrong ...
>>>
>>>
>>> We checked code and the Apache Commons Compress docs, a colleague
>>> spotted the hint at
>>> https://commons.apache.org/proper/commons-compress/examples.html#Buffering
>>>
>>> :
>>>
>>>> The stream classes all wrap around streams provided by the
>>>> calling
>>>> code and they work on them directly without any additional
>>>> buffering.
>>>> On the other hand most of them will benefit from buffering so it
>>>> is
>>>> highly recommended that users wrap their stream in
>>>> Buffered(In|Out)putStreams before using the Commons Compress API.
>>> we were curious about this statement, checked
>>> org.apache.jena.atlas.io.IO class and added one line in openFileEx
>>>
>>> in = new BufferedInputStream(in);
>>>
>>> which wraps the file stream before its passed to the decompressor
>>> streams
>>>
>>>
>>> Run again the parsing:
>>>
>>>
>>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
>>> 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
>>> stream in IO class)
>>>
>>> Triples = 163,310,838
>>> 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0
>>> errors
>>> : 31 warnings
>>>
>>>
>>> What do you think?
>>>
>>>
>>> On 28.08.22 14:22, Andy Seaborne wrote:
>>>>
>>>>>
>>>>> If you are relying on Jena to do the bz2 decompress, then it is
>>>>> using Commons Compress.
>>>>>
>>>>> gz is done (via Commons Compress) in native code. I use gz and
>>>>> if I
>>>>> get a bz2 file, I decompress it with OS tools.
>>>>
>>>> Could you try an experiment please?
>>>>
>>>> Run on the same hardware as the loader was run:
>>>>
>>>> riot --time --count river_planet-latest.osm.pbf.ttl
>>>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>>>>
>>>> Andy
>>>>
>>>> gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
>>>>
>>>> riot --time --count .../BSBM/bsbm-25m.nt.gz
>>>> Triples = 24,997,044
>>>> 118.02 sec : 24,997,044 Triples : 211,808.84 per second
>>>>
>>>> riot --time --count .../BSBM/bsbm-25m.nt
>>>> Triples = 24,997,044
>>>> 109.97 sec : 24,997,044 Triples : 227,314.05 per second
>>
>
Re: Re: Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
I spotted an interesting difference in performance gap/gain when using a
smaller dataset for Europe:
On the server we have
- the ZFS raid with less powerful hard-disks, i.e. only SATA with 4 x
Samsung 870 QVO
- an 2TB NVMe mounted separately
On the ZFS raid:
with Jena 4.6.0:
Triples = 54,821,333
3,047.89 sec : 54,821,333 Triples : 17,986.64 per second : 0
errors : 10 warnings
with Jena 4.7.0 patched with the BufferedInputStream wrapper:
Triples = 54,821,333
308.05 sec : 54,821,333 Triples : 177,963.61 per second : 0
errors : 10 warnings
On the NVMe
with Jena 4.6.0:
Triples = 54,821,333
824.11 sec : 54,821,333 Triples : 66,521.62 per second : 0
errors : 10 warnings
with Jena 4.7.0 patched with the BufferedInputStream wrapper:
Triples = 54,821,333
303.07 sec : 54,821,333 Triples : 180,888.49 per second : 0
errors : 10 warnings
Observation:
- the difference on the ZFS raid is factor 10
- on the NVMe disk it is "only" 3x faster with the buffered stream
Looks like the Bzip implementation of Apache Commons Compress is doing
lots of IO stuff, which is why it suffers way more not having the
buffered stream on the ZFS raid compared to the faster NVMe disk.
Nevertheless, it's always worth to use the buffered stream
On 29.08.22 15:53, Simon Bin wrote:
> I was asked to try it on my system (samsung 970 evo+ nvme, intel
> 11850h), but I used a slightly smaller data set (river_europe); it is
> not quite as bad as on Lorenz' but the buffering would help
> nevertheless:
>
> main : river_europe-latest.osm.pbf.ttl.bz2 : 815.14 sec : 72,098,221 Triples : 88,449.21 per second : 0 errors : 10 warnings
> fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2 : 376.64 sec : 72,098,221 Triples : 191,424.76 per second : 0 errors : 10 warnings
> pbzip2 -dc river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 Triples : 464,442.66 per second : 0 errors : 10 warnings
> river_europe-latest.osm.pbf.ttl : 136.92 sec : 72,098,221 Triples : 526,587.26 per second : 0 errors : 10 warnings
>
> Cheers,
>
> On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote:
>> In addition I used the OS tool in a pipe:
>>
>> bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count
>> --syntax "Turtle"
>>
>> Triples = 163,310,838
>> stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per
>> second : 0 errors : 31 warnings
>>
>>
>> unsurprisingly more or less exactly the time of decompression + the
>> parsing time of the uncompressed file - still way faster than the
>> Apache
>> Commons one, even with my suggested fix the OS variant is ~5min
>> faster
>>
>>
>> On 29.08.22 11:24, Lorenz Buehmann wrote:
>>> riot --time --count river_planet-latest.osm.pbf.ttl
>>>
>>> Triples = 163,310,838
>>> 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors
>>> :
>>> 31 warnings
>>>
>>>
>>> riot --time --count river_planet-latest.osm.pbf.ttl.gz
>>>
>>> Triples = 163,310,838
>>> 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors
>>> :
>>> 31 warnings
>>>
>>>
>>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>>>
>>> Triples = 163,310,838
>>> 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0
>>> errors :
>>> 31 warnings
>>>
>>>
>>> Takes ages with Bzip2 ... there must be something going wrong ...
>>>
>>>
>>> We checked code and the Apache Commons Compress docs, a colleague
>>> spotted the hint at
>>> https://commons.apache.org/proper/commons-compress/examples.html#Buffering
>>>
>>> :
>>>
>>>> The stream classes all wrap around streams provided by the
>>>> calling
>>>> code and they work on them directly without any additional
>>>> buffering.
>>>> On the other hand most of them will benefit from buffering so it
>>>> is
>>>> highly recommended that users wrap their stream in
>>>> Buffered(In|Out)putStreams before using the Commons Compress API.
>>> we were curious about this statement, checked
>>> org.apache.jena.atlas.io.IO class and added one line in openFileEx
>>>
>>> in = new BufferedInputStream(in);
>>>
>>> which wraps the file stream before its passed to the decompressor
>>> streams
>>>
>>>
>>> Run again the parsing:
>>>
>>>
>>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
>>> 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
>>> stream in IO class)
>>>
>>> Triples = 163,310,838
>>> 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0
>>> errors
>>> : 31 warnings
>>>
>>>
>>> What do you think?
>>>
>>>
>>> On 28.08.22 14:22, Andy Seaborne wrote:
>>>>> If you are relying on Jena to do the bz2 decompress, then it is
>>>>> using Commons Compress.
>>>>>
>>>>> gz is done (via Commons Compress) in native code. I use gz and
>>>>> if I
>>>>> get a bz2 file, I decompress it with OS tools.
>>>> Could you try an experiment please?
>>>>
>>>> Run on the same hardware as the loader was run:
>>>>
>>>> riot --time --count river_planet-latest.osm.pbf.ttl
>>>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>>>>
>>>> Andy
>>>>
>>>> gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
>>>>
>>>> riot --time --count .../BSBM/bsbm-25m.nt.gz
>>>> Triples = 24,997,044
>>>> 118.02 sec : 24,997,044 Triples : 211,808.84 per second
>>>>
>>>> riot --time --count .../BSBM/bsbm-25m.nt
>>>> Triples = 24,997,044
>>>> 109.97 sec : 24,997,044 Triples : 227,314.05 per second
>
Re: Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Simon Bin <sb...@informatik.uni-leipzig.de>.
I was asked to try it on my system (samsung 970 evo+ nvme, intel
11850h), but I used a slightly smaller data set (river_europe); it is
not quite as bad as on Lorenz' but the buffering would help
nevertheless:
main : river_europe-latest.osm.pbf.ttl.bz2 : 815.14 sec : 72,098,221 Triples : 88,449.21 per second : 0 errors : 10 warnings
fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2 : 376.64 sec : 72,098,221 Triples : 191,424.76 per second : 0 errors : 10 warnings
pbzip2 -dc river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 Triples : 464,442.66 per second : 0 errors : 10 warnings
river_europe-latest.osm.pbf.ttl : 136.92 sec : 72,098,221 Triples : 526,587.26 per second : 0 errors : 10 warnings
Cheers,
On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote:
> In addition I used the OS tool in a pipe:
>
> bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count
> --syntax "Turtle"
>
> Triples = 163,310,838
> stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per
> second : 0 errors : 31 warnings
>
>
> unsurprisingly more or less exactly the time of decompression + the
> parsing time of the uncompressed file - still way faster than the
> Apache
> Commons one, even with my suggested fix the OS variant is ~5min
> faster
>
>
> On 29.08.22 11:24, Lorenz Buehmann wrote:
> > riot --time --count river_planet-latest.osm.pbf.ttl
> >
> > Triples = 163,310,838
> > 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors
> > :
> > 31 warnings
> >
> >
> > riot --time --count river_planet-latest.osm.pbf.ttl.gz
> >
> > Triples = 163,310,838
> > 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors
> > :
> > 31 warnings
> >
> >
> > riot --time --count river_planet-latest.osm.pbf.ttl.bz2
> >
> > Triples = 163,310,838
> > 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0
> > errors :
> > 31 warnings
> >
> >
> > Takes ages with Bzip2 ... there must be something going wrong ...
> >
> >
> > We checked code and the Apache Commons Compress docs, a colleague
> > spotted the hint at
> > https://commons.apache.org/proper/commons-compress/examples.html#Buffering
> >
> > :
> >
> > > The stream classes all wrap around streams provided by the
> > > calling
> > > code and they work on them directly without any additional
> > > buffering.
> > > On the other hand most of them will benefit from buffering so it
> > > is
> > > highly recommended that users wrap their stream in
> > > Buffered(In|Out)putStreams before using the Commons Compress API.
> > we were curious about this statement, checked
> > org.apache.jena.atlas.io.IO class and added one line in openFileEx
> >
> > in = new BufferedInputStream(in);
> >
> > which wraps the file stream before its passed to the decompressor
> > streams
> >
> >
> > Run again the parsing:
> >
> >
> > riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
> > 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
> > stream in IO class)
> >
> > Triples = 163,310,838
> > 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0
> > errors
> > : 31 warnings
> >
> >
> > What do you think?
> >
> >
> > On 28.08.22 14:22, Andy Seaborne wrote:
> > >
> > > >
> > > > If you are relying on Jena to do the bz2 decompress, then it is
> > > > using Commons Compress.
> > > >
> > > > gz is done (via Commons Compress) in native code. I use gz and
> > > > if I
> > > > get a bz2 file, I decompress it with OS tools.
> > >
> > > Could you try an experiment please?
> > >
> > > Run on the same hardware as the loader was run:
> > >
> > > riot --time --count river_planet-latest.osm.pbf.ttl
> > > riot --time --count river_planet-latest.osm.pbf.ttl.bz2
> > >
> > > Andy
> > >
> > > gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
> > >
> > > riot --time --count .../BSBM/bsbm-25m.nt.gz
> > > Triples = 24,997,044
> > > 118.02 sec : 24,997,044 Triples : 211,808.84 per second
> > >
> > > riot --time --count .../BSBM/bsbm-25m.nt
> > > Triples = 24,997,044
> > > 109.97 sec : 24,997,044 Triples : 227,314.05 per second
>
Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
In addition I used the OS tool in a pipe:
bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count
--syntax "Turtle"
Triples = 163,310,838
stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per
second : 0 errors : 31 warnings
unsurprisingly more or less exactly the time of decompression + the
parsing time of the uncompressed file - still way faster than the Apache
Commons one, even with my suggested fix the OS variant is ~5min faster
On 29.08.22 11:24, Lorenz Buehmann wrote:
> riot --time --count river_planet-latest.osm.pbf.ttl
>
> Triples = 163,310,838
> 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors :
> 31 warnings
>
>
> riot --time --count river_planet-latest.osm.pbf.ttl.gz
>
> Triples = 163,310,838
> 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors :
> 31 warnings
>
>
> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>
> Triples = 163,310,838
> 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 errors :
> 31 warnings
>
>
> Takes ages with Bzip2 ... there must be something going wrong ...
>
>
> We checked code and the Apache Commons Compress docs, a colleague
> spotted the hint at
> https://commons.apache.org/proper/commons-compress/examples.html#Buffering
> :
>
>> The stream classes all wrap around streams provided by the calling
>> code and they work on them directly without any additional buffering.
>> On the other hand most of them will benefit from buffering so it is
>> highly recommended that users wrap their stream in
>> Buffered(In|Out)putStreams before using the Commons Compress API.
> we were curious about this statement, checked
> org.apache.jena.atlas.io.IO class and added one line in openFileEx
>
> in = new BufferedInputStream(in);
>
> which wraps the file stream before its passed to the decompressor streams
>
>
> Run again the parsing:
>
>
> riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
> 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
> stream in IO class)
>
> Triples = 163,310,838
> 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors
> : 31 warnings
>
>
> What do you think?
>
>
> On 28.08.22 14:22, Andy Seaborne wrote:
>>
>>>
>>> If you are relying on Jena to do the bz2 decompress, then it is
>>> using Commons Compress.
>>>
>>> gz is done (via Commons Compress) in native code. I use gz and if I
>>> get a bz2 file, I decompress it with OS tools.
>>
>> Could you try an experiment please?
>>
>> Run on the same hardware as the loader was run:
>>
>> riot --time --count river_planet-latest.osm.pbf.ttl
>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>>
>> Andy
>>
>> gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
>>
>> riot --time --count .../BSBM/bsbm-25m.nt.gz
>> Triples = 24,997,044
>> 118.02 sec : 24,997,044 Triples : 211,808.84 per second
>>
>> riot --time --count .../BSBM/bsbm-25m.nt
>> Triples = 24,997,044
>> 109.97 sec : 24,997,044 Triples : 227,314.05 per second
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
On 29/08/2022 18:58, Andy Seaborne wrote:
>
>
> On 29/08/2022 10:24, Lorenz Buehmann wrote:
> ...
>
>> We checked code and the Apache Commons Compress docs, a colleague
>> spotted the hint at
>> https://commons.apache.org/proper/commons-compress/examples.html#Buffering
>> :
>>
>>> The stream classes all wrap around streams provided by the calling
>>> code and they work on them directly without any additional buffering.
>>> On the other hand most of them will benefit from buffering so it is
>>> highly recommended that users wrap their stream in
>>> Buffered(In|Out)putStreams before using the Commons Compress API.
>> we were curious about this statement, checked
>> org.apache.jena.atlas.io.IO class and added one line in openFileEx
>>
>> in = new BufferedInputStream(in);
>>
>> which wraps the file stream before its passed to the decompressor streams
>>
>>
>> Run again the parsing:
>>
>>
>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
>> 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
>> stream in IO class)
>>
>> Triples = 163,310,838
>> 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors
>> : 31 warnings
>>
>>
>> What do you think?
>
> Yes.
>
> IO.ensureBuffered.
>
> It buffers if not already buffered and if not a ByteArrayInputStream.
> It also makes all buffering findable in the IDE.
>
> RIOT buffers (128K char buffer) so calls down to chars-UTF8-bytes are in
> chunks. Your observation indicates BZip2CompressorInputStream is not
> not exploiting read(byte[] dest) calls ... yep - it's loop calling
> internal the one byte "read0".
>
> GZIPInputStream has a default 512 byte buffer - maybe a bigger one there
> will help a bit.
A quick test on BSBM-25 million...
Adding buffering in gzip caused a 0.1% slow-down. (Data from SSD)
Andy
>
> SnappyCompressorInputStream has a 32k buffer.
>
> So it is bz2 needing IO.ensureBuffered, the others may benefit - or may
> go slower.
>
> Andy
>
>>
>>
>> On 28.08.22 14:22, Andy Seaborne wrote:
>>>
>>>>
>>>> If you are relying on Jena to do the bz2 decompress, then it is
>>>> using Commons Compress.
>>>>
>>>> gz is done (via Commons Compress) in native code. I use gz and if I
>>>> get a bz2 file, I decompress it with OS tools.
>>>
>>> Could you try an experiment please?
>>>
>>> Run on the same hardware as the loader was run:
>>>
>>> riot --time --count river_planet-latest.osm.pbf.ttl
>>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>>>
>>> Andy
>>>
>>> gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
>>>
>>> riot --time --count .../BSBM/bsbm-25m.nt.gz
>>> Triples = 24,997,044
>>> 118.02 sec : 24,997,044 Triples : 211,808.84 per second
>>>
>>> riot --time --count .../BSBM/bsbm-25m.nt
>>> Triples = 24,997,044
>>> 109.97 sec : 24,997,044 Triples : 227,314.05 per second
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
On 29/08/2022 10:24, Lorenz Buehmann wrote:
...
> We checked code and the Apache Commons Compress docs, a colleague
> spotted the hint at
> https://commons.apache.org/proper/commons-compress/examples.html#Buffering
> :
>
>> The stream classes all wrap around streams provided by the calling
>> code and they work on them directly without any additional buffering.
>> On the other hand most of them will benefit from buffering so it is
>> highly recommended that users wrap their stream in
>> Buffered(In|Out)putStreams before using the Commons Compress API.
> we were curious about this statement, checked
> org.apache.jena.atlas.io.IO class and added one line in openFileEx
>
> in = new BufferedInputStream(in);
>
> which wraps the file stream before its passed to the decompressor streams
>
>
> Run again the parsing:
>
>
> riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
> 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream
> in IO class)
>
> Triples = 163,310,838
> 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors :
> 31 warnings
>
>
> What do you think?
Yes.
IO.ensureBuffered.
It buffers if not already buffered and if not a ByteArrayInputStream.
It also makes all buffering findable in the IDE.
RIOT buffers (128K char buffer) so calls down to chars-UTF8-bytes are in
chunks. Your observation indicates BZip2CompressorInputStream is not
not exploiting read(byte[] dest) calls ... yep - it's loop calling
internal the one byte "read0".
GZIPInputStream has a default 512 byte buffer - maybe a bigger one there
will help a bit.
SnappyCompressorInputStream has a 32k buffer.
So it is bz2 needing IO.ensureBuffered, the others may benefit - or may
go slower.
Andy
>
>
> On 28.08.22 14:22, Andy Seaborne wrote:
>>
>>>
>>> If you are relying on Jena to do the bz2 decompress, then it is using
>>> Commons Compress.
>>>
>>> gz is done (via Commons Compress) in native code. I use gz and if I
>>> get a bz2 file, I decompress it with OS tools.
>>
>> Could you try an experiment please?
>>
>> Run on the same hardware as the loader was run:
>>
>> riot --time --count river_planet-latest.osm.pbf.ttl
>> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>>
>> Andy
>>
>> gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
>>
>> riot --time --count .../BSBM/bsbm-25m.nt.gz
>> Triples = 24,997,044
>> 118.02 sec : 24,997,044 Triples : 211,808.84 per second
>>
>> riot --time --count .../BSBM/bsbm-25m.nt
>> Triples = 24,997,044
>> 109.97 sec : 24,997,044 Triples : 227,314.05 per second
Re: Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
riot --time --count river_planet-latest.osm.pbf.ttl
Triples = 163,310,838
351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors : 31
warnings
riot --time --count river_planet-latest.osm.pbf.ttl.gz
Triples = 163,310,838
431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors : 31
warnings
riot --time --count river_planet-latest.osm.pbf.ttl.bz2
Triples = 163,310,838
9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 errors :
31 warnings
Takes ages with Bzip2 ... there must be something going wrong ...
We checked code and the Apache Commons Compress docs, a colleague
spotted the hint at
https://commons.apache.org/proper/commons-compress/examples.html#Buffering :
> The stream classes all wrap around streams provided by the calling
> code and they work on them directly without any additional buffering.
> On the other hand most of them will benefit from buffering so it is
> highly recommended that users wrap their stream in
> Buffered(In|Out)putStreams before using the Commons Compress API.
we were curious about this statement, checked
org.apache.jena.atlas.io.IO class and added one line in openFileEx
in = new BufferedInputStream(in);
which wraps the file stream before its passed to the decompressor streams
Run again the parsing:
riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream
in IO class)
Triples = 163,310,838
1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors :
31 warnings
What do you think?
On 28.08.22 14:22, Andy Seaborne wrote:
>
>>
>> If you are relying on Jena to do the bz2 decompress, then it is using
>> Commons Compress.
>>
>> gz is done (via Commons Compress) in native code. I use gz and if I
>> get a bz2 file, I decompress it with OS tools.
>
> Could you try an experiment please?
>
> Run on the same hardware as the loader was run:
>
> riot --time --count river_planet-latest.osm.pbf.ttl
> riot --time --count river_planet-latest.osm.pbf.ttl.bz2
>
> Andy
>
> gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
>
> riot --time --count .../BSBM/bsbm-25m.nt.gz
> Triples = 24,997,044
> 118.02 sec : 24,997,044 Triples : 211,808.84 per second
>
> riot --time --count .../BSBM/bsbm-25m.nt
> Triples = 24,997,044
> 109.97 sec : 24,997,044 Triples : 227,314.05 per second
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
>
> If you are relying on Jena to do the bz2 decompress, then it is using
> Commons Compress.
>
> gz is done (via Commons Compress) in native code. I use gz and if I get
> a bz2 file, I decompress it with OS tools.
Could you try an experiment please?
Run on the same hardware as the loader was run:
riot --time --count river_planet-latest.osm.pbf.ttl
riot --time --count river_planet-latest.osm.pbf.ttl.bz2
Andy
gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
riot --time --count .../BSBM/bsbm-25m.nt.gz
Triples = 24,997,044
118.02 sec : 24,997,044 Triples : 211,808.84 per second
riot --time --count .../BSBM/bsbm-25m.nt
Triples = 24,997,044
109.97 sec : 24,997,044 Triples : 227,314.05 per second
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
On 28/08/2022 09:58, Lorenz Buehmann wrote:
> Hi Andy,
>
> thanks for fast response.
>
> I see - the only drawback with wrapping the streams into TriG is when we
> have Turtle syntax files (or lets say any non N-Triples format) - afaik,
> prefixes aren't allowed inside graphs, i.e. at that point you're lost.
>
> What I did now is to pipe those files into riot first which then
> generates N-Triples which then can be wrapped in TriG graphs. Indeed, we
> have the riot overhead here, i.e. the data is parsed twice. Still faster
> though then loading graphs in separate TDB loader calls, so I guess I
> can live with this.
Exercise in text processing :-)
Spit out the prefixes into a separate TTL file (grep!) and load that
file as well.
>
> Having a follow up question:
>
> I could see a huge difference between read compressed (Bzip) vs
> uncompressed file:
>
> I put the output until the triples have been loaded here as the index
> creating should be affected by the compression:
>
>
> # uncompressed with tdb2.tdbloader
Which loader?
And what hardware?
(--loader=parallel may not make much of a difference at 100m)
> 14:24:40 INFO loader :: Add: 163,000,000
> river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230)
> 14:24:42 INFO loader :: Finished:
> output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in 1165.30s
> (Avg: 140,145)
>
>
> # compressed with tdb2.tdbloader
>
> 17:37:37 INFO loader :: Add: 163,000,000
> river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050)
> 17:37:40 INFO loader :: Finished:
> output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in
> 10158.16s (Avg: 16,076)
That is bad!
Was it consistently slow through the load?
If you are relying on Jena to do the bz2 decompress, then it is using
Commons Compress.
gz is done (via Commons Compress) in native code. I use gz and if I get
a bz2 file, I decompress it with OS tools.
> So loading the compressed file is ~9x slower then the compressed one.
> Can we consider this as expected? Note, here we have a geospatial
> dataset with millions of geometry literals. Not sure if this is also
> something that makes things worse.
>
> What are your experiences with loading compressed vs uncompressed data?
bz2 is expensive - it is focuses on max compression. Coupled with being
java (not so much the java, as being not highly tuned code decompression
code) it coudl be a factor.
Usually (gz) there is a slight slow down if using SSD as source. HDD can
be either way.
Andy
>
>
> Cheers,
>
> Lorenz
>
>
> On 26.08.22 17:02, Andy Seaborne wrote:
>> Hi Lorenz,
>>
>> No - there isn't an option.
>>
>> The way to do it is to prepare the load as quads by, for example,
>> wrapping in TriG syntax around the files or adding the G to N-triples.
>>
>> This can be done streaming and piped into the loader (with --syntax=
>> if not N-quads).
>>
>> > By the way, the tdb2.xloader has no option for named graphs at all?
>>
>> The input needs to be prepared as quads.
>>
>> Andy
>>
>> On 26/08/2022 15:03, Lorenz Buehmann wrote:
>>> Hi all,
>>>
>>> is there any option to use TDB2 bulk loader (tdb2.xloader or just
>>> tdb2.loader) to load multiple files into multiple different named
>>> graphs? Like
>>>
>>> tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2>
>>> file2 ...
>>>
>>> I'm asking because I thought the initial loading is way faster then
>>> iterating over multiple (graph, file) pairs and running the TDB2
>>> loader for each pair?
>>>
>>>
>>> By the way, the tdb2.xloader has no option for named graphs at all?
>>>
>>>
>>> Cheers,
>>>
>>> Lorenz
>>>
Re: Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
Hi Andy,
thanks for fast response.
I see - the only drawback with wrapping the streams into TriG is when we
have Turtle syntax files (or lets say any non N-Triples format) - afaik,
prefixes aren't allowed inside graphs, i.e. at that point you're lost.
What I did now is to pipe those files into riot first which then
generates N-Triples which then can be wrapped in TriG graphs. Indeed, we
have the riot overhead here, i.e. the data is parsed twice. Still faster
though then loading graphs in separate TDB loader calls, so I guess I
can live with this.
Having a follow up question:
I could see a huge difference between read compressed (Bzip) vs
uncompressed file:
I put the output until the triples have been loaded here as the index
creating should be affected by the compression:
# uncompressed with tdb2.tdbloader
14:24:40 INFO loader :: Add: 163,000,000
river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230)
14:24:42 INFO loader :: Finished:
output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in 1165.30s
(Avg: 140,145)
# compressed with tdb2.tdbloader
17:37:37 INFO loader :: Add: 163,000,000
river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050)
17:37:40 INFO loader :: Finished:
output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in
10158.16s (Avg: 16,076)
So loading the compressed file is ~9x slower then the compressed one.
Can we consider this as expected? Note, here we have a geospatial
dataset with millions of geometry literals. Not sure if this is also
something that makes things worse.
What are your experiences with loading compressed vs uncompressed data?
Cheers,
Lorenz
On 26.08.22 17:02, Andy Seaborne wrote:
> Hi Lorenz,
>
> No - there isn't an option.
>
> The way to do it is to prepare the load as quads by, for example,
> wrapping in TriG syntax around the files or adding the G to N-triples.
>
> This can be done streaming and piped into the loader (with --syntax=
> if not N-quads).
>
> > By the way, the tdb2.xloader has no option for named graphs at all?
>
> The input needs to be prepared as quads.
>
> Andy
>
> On 26/08/2022 15:03, Lorenz Buehmann wrote:
>> Hi all,
>>
>> is there any option to use TDB2 bulk loader (tdb2.xloader or just
>> tdb2.loader) to load multiple files into multiple different named
>> graphs? Like
>>
>> tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2>
>> file2 ...
>>
>> I'm asking because I thought the initial loading is way faster then
>> iterating over multiple (graph, file) pairs and running the TDB2
>> loader for each pair?
>>
>>
>> By the way, the tdb2.xloader has no option for named graphs at all?
>>
>>
>> Cheers,
>>
>> Lorenz
>>
Re: TDB2 bulk loader - multiple files into different graph per file
Posted by Andy Seaborne <an...@apache.org>.
Hi Lorenz,
No - there isn't an option.
The way to do it is to prepare the load as quads by, for example,
wrapping in TriG syntax around the files or adding the G to N-triples.
This can be done streaming and piped into the loader (with --syntax= if
not N-quads).
> By the way, the tdb2.xloader has no option for named graphs at all?
The input needs to be prepared as quads.
Andy
On 26/08/2022 15:03, Lorenz Buehmann wrote:
> Hi all,
>
> is there any option to use TDB2 bulk loader (tdb2.xloader or just
> tdb2.loader) to load multiple files into multiple different named
> graphs? Like
>
> tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2> file2 ...
>
> I'm asking because I thought the initial loading is way faster then
> iterating over multiple (graph, file) pairs and running the TDB2 loader
> for each pair?
>
>
> By the way, the tdb2.xloader has no option for named graphs at all?
>
>
> Cheers,
>
> Lorenz
>