You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Sarven Capadisli <in...@csarven.ca> on 2012/02/29 06:43:59 UTC

tdbloader's info on batch count

Hi, I was hoping if you guys could clarify some of these questions for me:

When I import data into my TDB Triple Store using tdb.tdbloader, I get 
information like the following:

Add: 4,150,000 triples (Batch: 2,380 / Avg: 4,684)
Add: 4,200,000 triples (Batch: 29,620 / Avg: 4,732)

What is batch exactly?

Why does it differ from one step to another?

Is there a way to set the batch number?

Is there a way to configure TDB in order to perform faster importing?

Thanks,

-Sarven

Re: tdbloader's info on batch count

Posted by Sarven Capadisli <in...@csarven.ca>.

On 12-02-29 07:12 AM, Paolo Castagna wrote:
> Sarven Capadisli wrote:
>> On 12-02-29 05:09 AM, Damian Steer wrote:
>>> At a guess, other stuff happening on the same host? A batch might
>>> include a sync to disk too. I wouldn't have thought GC would be an issue.
>>
>> Not to my knowledge. I get the feeling that the disk falls asleep.
>> Hence, I'm investing with what I have right now.
>>
>>> Loading from empty using tdbloader2 is the usual advice. Paolo has
>>> been working on a cross platform version of this.
>>
>> Is it possible to use it on an existing store?
>
> Hi Sarven,
> a pure Java version of tdbloader2 (named tdbloader3) is available
> as an *experimental* prototype, but it is just for bulk loads on an
> initial empty store (as I think is the case for the existing
> tdbloader2, right?).

Hi Paolo, thanks a lot for that info. I will give tdbloader2/3 a go on 
separate store.

> Code here:
> https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader3/trunk/
>
> JIRA issue here:
> https://issues.apache.org/jira/browse/JENA-117

Great thank you!

> How many triples is the RDF dataset you are trying to load?

I don't know the count at the moment.. but I've mentioned the size in 
reply to Andy's email; ~5100 RDF/XML files ~35 GB in total. Different 
sizes. Largest files are less than 15 MB.

-Sarven

Re: tdbloader's info on batch count

Posted by Paolo Castagna <ca...@googlemail.com>.

Sarven Capadisli wrote:
> On 12-02-29 05:09 AM, Damian Steer wrote:
>> At a guess, other stuff happening on the same host? A batch might
>> include a sync to disk too. I wouldn't have thought GC would be an issue.
> 
> Not to my knowledge. I get the feeling that the disk falls asleep.
> Hence, I'm investing with what I have right now.
> 
>> Loading from empty using tdbloader2 is the usual advice. Paolo has
>> been working on a cross platform version of this.
> 
> Is it possible to use it on an existing store?

Hi Sarven,
a pure Java version of tdbloader2 (named tdbloader3) is available
as an *experimental* prototype, but it is just for bulk loads on an
initial empty store (as I think is the case for the existing
tdbloader2, right?).

Code here:
https://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader3/trunk/

JIRA issue here:
https://issues.apache.org/jira/browse/JENA-117

How many triples is the RDF dataset you are trying to load?

Paolo

> 
> -Sarven

Re: tdbloader's info on batch count

Posted by Sarven Capadisli <in...@csarven.ca>.

On 12-02-29 08:09 AM, Andy Seaborne wrote:
> On 29/02/12 11:34, Sarven Capadisli wrote:
>> On 12-02-29 06:26 AM, Sarven Capadisli wrote:
>>> On 12-02-29 05:09 AM, Damian Steer wrote:
>>>> At a guess, other stuff happening on the same host? A batch might
>>>> include a sync to disk too. I wouldn't have thought GC would be an
>>>> issue.
>>>
>>> Not to my knowledge. I get the feeling that the disk falls asleep.
>>> Hence, I'm investing with what I have right now.
>>
>> On that note, actually what I find absurd is that, if I want to get
>> tdbloader back to action (to work faster), I do some large disk writing
>> on another screen window. This was an accidental find, and I don't have
>> a technical explanation for it. Somehow that causes the Batch numbers go
>> up to 20000+, where they may have been stuck below 1000s.
>>
>> -Sarven
>
> Interesting but I'm not completely shocked.
>
> The batch speed (yes, triples per second for the last time interval)
> tends to shoot up at the start (JIT presumably), hit some peak, then
> very slowly decline. With exceptions. Sometimes it declines for a bit,
> then starts going faster even on a machine that is doing nothing else,
> which is a bit odd.
>
> I think the occasional one-off drop in batch is a major, non-incremental
> GC happening.
>
> The "doing work elsewhere" makes it go faster might be because the OS is
> knocked into a more efficient policy for the disk cache but I'm guessing
> here.
>
>> Add: 4,150,000 triples (Batch: 2,380 / Avg: 4,684)
>> Add: 4,200,000 triples (Batch: 29,620 / Avg: 4,732)
>
> That's pretty slow.
>
> Usual questions:
> How much data overall?

I don't have a triple count right now, but about 35GB of 5100 or so 
RDF/XML files in different sizes.

> Many long literals? Other unusual data features?

Not that many literals even. And they are usually short.

> What's the machine?

Ubuntu x86_64 GNU/Linux
Memory: 16 GB
Disk swap: 16 GB
Filesystem: ext4

> An incremental version is quite possible. It could load to a dataset,
> ensuring the id are right, then do index-merging.

I was loading them incrementally, then I merged a good chunk of the 
files into N-Triples and tried importing with them. It seems to go 
slightly better but it is hard to tell for sure.


Thanks Andy!

-Sarven

Re: tdbloader's info on batch count

Posted by Andy Seaborne <an...@apache.org>.

On 29/02/12 13:18, Paolo Castagna wrote:
> Andy Seaborne wrote:
>> An incremental version is quite possible.  It could load to a dataset,
>> ensuring the id are right, then do index-merging.
>
> Hi Andy,
> can you expand a little bit on "ensuring the id are right"
> and "index-merging" bits? ;-)
>
> To "ensure ids are right" the incremental loader would need
> to re-use the same node table of the exiting db, right?

Yes.

(Hash-ids don't remove the need but they would change the problem to 
allowing two idenpendent databases to be merged by messing around with 
the lowest level data structures.)

> I have been thinking on how to merge two TDB indexes, but
> it does not seem a trivial problem to me... not with the
> current node ids.

The indexes are just a stream of sorted numbers (OK - the numbers are 
192 bits long but that's what computers are for :-)  It's a plain merge 
of two already sorted streams, with duplicate removal, using the B+Tree 
rebuilder.

>
> Paolo

	Andy

Re: tdbloader's info on batch count

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> An incremental version is quite possible.  It could load to a dataset,
> ensuring the id are right, then do index-merging.

Hi Andy,
can you expand a little bit on "ensuring the id are right"
and "index-merging" bits? ;-)

To "ensure ids are right" the incremental loader would need
to re-use the same node table of the exiting db, right?

I have been thinking on how to merge two TDB indexes, but
it does not seem a trivial problem to me... not with the
current node ids.

Paolo

Re: tdbloader's info on batch count

Posted by Andy Seaborne <an...@apache.org>.

On 10/03/12 08:47, Sarven Capadisli wrote:
> On 12-02-29 08:09 AM, Andy Seaborne wrote:
>> On 29/02/12 11:34, Sarven Capadisli wrote:
>>> On 12-02-29 06:26 AM, Sarven Capadisli wrote:
>>>> On 12-02-29 05:09 AM, Damian Steer wrote:
>>>>> At a guess, other stuff happening on the same host? A batch might
>>>>> include a sync to disk too. I wouldn't have thought GC would be an
>>>>> issue.
>>>>
>>>> Not to my knowledge. I get the feeling that the disk falls asleep.
>>>> Hence, I'm investing with what I have right now.
>>>
>>> On that note, actually what I find absurd is that, if I want to get
>>> tdbloader back to action (to work faster), I do some large disk writing
>>> on another screen window. This was an accidental find, and I don't have
>>> a technical explanation for it. Somehow that causes the Batch numbers go
>>> up to 20000+, where they may have been stuck below 1000s.
>>>
>>> -Sarven
>>
>> Interesting but I'm not completely shocked.
>>
>> The batch speed (yes, triples per second for the last time interval)
>> tends to shoot up at the start (JIT presumably), hit some peak, then
>> very slowly decline. With exceptions. Sometimes it declines for a bit,
>> then starts going faster even on a machine that is doing nothing else,
>> which is a bit odd.
>>
>> I think the occasional one-off drop in batch is a major, non-incremental
>> GC happening.
>>
>> The "doing work elsewhere" makes it go faster might be because the OS is
>> knocked into a more efficient policy for the disk cache but I'm guessing
>> here.
>
> By observation, I can pretty much confirm that when loading slows down,
> the way to speed it up is to get some disk writing going on elsewhere.

What is the state of the machine at the time?
(e.g. what is the process size at the time?)

>
> Is it possible to write some magic code where it can detect these slow
> downs and nudge the OS to get things going?

Probably yes.  Maybe just having another process doing a little writing 
all the time the load is on maybe all that is needed.

	Andy

>
> -Sarven

Re: tdbloader's info on batch count

Posted by Sarven Capadisli <in...@csarven.ca>.

On 12-02-29 08:09 AM, Andy Seaborne wrote:
> On 29/02/12 11:34, Sarven Capadisli wrote:
>> On 12-02-29 06:26 AM, Sarven Capadisli wrote:
>>> On 12-02-29 05:09 AM, Damian Steer wrote:
>>>> At a guess, other stuff happening on the same host? A batch might
>>>> include a sync to disk too. I wouldn't have thought GC would be an
>>>> issue.
>>>
>>> Not to my knowledge. I get the feeling that the disk falls asleep.
>>> Hence, I'm investing with what I have right now.
>>
>> On that note, actually what I find absurd is that, if I want to get
>> tdbloader back to action (to work faster), I do some large disk writing
>> on another screen window. This was an accidental find, and I don't have
>> a technical explanation for it. Somehow that causes the Batch numbers go
>> up to 20000+, where they may have been stuck below 1000s.
>>
>> -Sarven
>
> Interesting but I'm not completely shocked.
>
> The batch speed (yes, triples per second for the last time interval)
> tends to shoot up at the start (JIT presumably), hit some peak, then
> very slowly decline. With exceptions. Sometimes it declines for a bit,
> then starts going faster even on a machine that is doing nothing else,
> which is a bit odd.
>
> I think the occasional one-off drop in batch is a major, non-incremental
> GC happening.
>
> The "doing work elsewhere" makes it go faster might be because the OS is
> knocked into a more efficient policy for the disk cache but I'm guessing
> here.

By observation, I can pretty much confirm that when loading slows down, 
the way to speed it up is to get some disk writing going on elsewhere.

Is it possible to write some magic code where it can detect these slow 
downs and nudge the OS to get things going?

-Sarven

Re: tdbloader's info on batch count

Posted by Andy Seaborne <an...@apache.org>.

On 29/02/12 11:34, Sarven Capadisli wrote:
> On 12-02-29 06:26 AM, Sarven Capadisli wrote:
>> On 12-02-29 05:09 AM, Damian Steer wrote:
>>> At a guess, other stuff happening on the same host? A batch might
>>> include a sync to disk too. I wouldn't have thought GC would be an
>>> issue.
>>
>> Not to my knowledge. I get the feeling that the disk falls asleep.
>> Hence, I'm investing with what I have right now.
>
> On that note, actually what I find absurd is that, if I want to get
> tdbloader back to action (to work faster), I do some large disk writing
> on another screen window. This was an accidental find, and I don't have
> a technical explanation for it. Somehow that causes the Batch numbers go
> up to 20000+, where they may have been stuck below 1000s.
>
> -Sarven

Interesting but I'm not completely shocked.

The batch speed (yes, triples per second for the last time interval) 
tends to shoot up at the start (JIT presumably), hit some peak, then 
very slowly decline.   With exceptions.  Sometimes it declines for a 
bit, then starts going faster even on a machine that is doing nothing 
else, which is a bit odd.

I think the occasional one-off drop in batch is a major, non-incremental 
GC happening.

The "doing work elsewhere" makes it go faster might be because the OS is 
knocked into a more efficient policy for the disk cache but I'm guessing 
here.

> Add: 4,150,000 triples (Batch: 2,380 / Avg: 4,684)
> Add: 4,200,000 triples (Batch: 29,620 / Avg: 4,732)

That's pretty slow.

Usual questions:
   How much data overall?
   Many long literals?  Other unusual data features?
   What's the machine?

An incremental version is quite possible.  It could load to a dataset, 
ensuring the id are right, then do index-merging.

	Andy

Re: tdbloader's info on batch count

Posted by Sarven Capadisli <in...@csarven.ca>.

On 12-02-29 06:26 AM, Sarven Capadisli wrote:
> On 12-02-29 05:09 AM, Damian Steer wrote:
>> At a guess, other stuff happening on the same host? A batch might
>> include a sync to disk too. I wouldn't have thought GC would be an issue.
>
> Not to my knowledge. I get the feeling that the disk falls asleep.
> Hence, I'm investing with what I have right now.

On that note, actually what I find absurd is that, if I want to get 
tdbloader back to action (to work faster), I do some large disk writing 
on another screen window. This was an accidental find, and I don't have 
a technical explanation for it. Somehow that causes the Batch numbers go 
up to 20000+, where they may have been stuck below 1000s.

-Sarven

Re: tdbloader's info on batch count

Posted by Sarven Capadisli <in...@csarven.ca>.

On 12-02-29 05:09 AM, Damian Steer wrote:
> At a guess, other stuff happening on the same host? A batch might include a sync to disk too. I wouldn't have thought GC would be an issue.

Not to my knowledge. I get the feeling that the disk falls asleep. 
Hence, I'm investing with what I have right now.

> Loading from empty using tdbloader2 is the usual advice. Paolo has been working on a cross platform version of this.

Is it possible to use it on an existing store?

-Sarven

Re: tdbloader's info on batch count

Posted by Damian Steer <d....@bristol.ac.uk>.

On 29 Feb 2012, at 10:38, Ian Dickinson wrote:

> On 29/02/12 10:19, Damian wrote:
>> 
>> On 29 Feb 2012, at 10:09, Damian Steer wrote:
>>>> 
>>>> What is batch exactly?
>>> 
>>> Batch is the time taken to load the last batch, that is the last
>>> 50,000 triples.
>> 
>> Sorry, brain/finger issue. That should be: time average for the last
>> batch.
> Are they times or triples-per-second scores? (I was never sure, but always guessed the latter)

The latter.

Damian

Re: tdbloader's info on batch count

Posted by Ian Dickinson <ia...@epimorphics.com>.

On 29/02/12 10:19, Damian wrote:
>
> On 29 Feb 2012, at 10:09, Damian Steer wrote:
>>>
>>> What is batch exactly?
>>
>> Batch is the time taken to load the last batch, that is the last
>> 50,000 triples.
>
> Sorry, brain/finger issue. That should be: time average for the last
> batch.
Are they times or triples-per-second scores? (I was never sure, but 
always guessed the latter)

Ian

Re: tdbloader's info on batch count

Posted by Damian Steer <d....@bristol.ac.uk>.

On 29 Feb 2012, at 10:09, Damian Steer wrote:
>> 
>> What is batch exactly?
> 
> Batch is the time taken to load the last batch, that is the last 50,000 triples.

Sorry, brain/finger issue. That should be: time average for the last batch.

Damian

Re: tdbloader's info on batch count

Posted by Damian Steer <d....@bristol.ac.uk>.

On 29 Feb 2012, at 05:43, Sarven Capadisli wrote:

> Hi, I was hoping if you guys could clarify some of these questions for me:
> 
> When I import data into my TDB Triple Store using tdb.tdbloader, I get information like the following:
> 
> Add: 4,150,000 triples (Batch: 2,380 / Avg: 4,684)
> Add: 4,200,000 triples (Batch: 29,620 / Avg: 4,732)
> 
> What is batch exactly?

Batch is the time taken to load the last batch, that is the last 50,000 triples.

> Why does it differ from one step to another?

At a guess, other stuff happening on the same host? A batch might include a sync to disk too. I wouldn't have thought GC would be an issue.

> Is there a way to set the batch number?
> 
> Is there a way to configure TDB in order to perform faster importing?

Loading from empty using tdbloader2 is the usual advice. Paolo has been working on a cross platform version of this.

Damian