You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Daniel Hernandez <da...@degu.cl> on 2021/02/13 11:00:59 UTC

Understanding the output of Jena TDB Loader

Hi,

I am loading an n-triples file using tdbloader2.  I am curious about
what is the meaning of the numbers in the loader output.  The loading
output started as follows:

 09:54:15 INFO -- TDB Bulk Loader Start
 09:54:15 INFO Data Load Phase
 09:54:15 INFO Got 1 data files to load
 09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz
09:54:59 INFO  loader          :: Load: /home/ubuntu/dataset.nq.gz -- 2021/02/08 09:54:59 UTC
09:55:01 INFO  loader          :: Add: 50,000 Data (Batch: 19,912 / Avg: 19,912)
09:55:03 INFO  loader          :: Add: 100,000 Data (Batch: 23,288 / Avg: 21,468)
09:55:05 INFO  loader          :: Add: 150,000 Data (Batch: 26,123 / Avg: 22,824)
09:55:07 INFO  loader          :: Add: 200,000 Data (Batch: 24,987 / Avg: 23,329)
09:55:09 INFO  loader          :: Add: 250,000 Data (Batch: 25,641 / Avg: 23,757)
09:55:11 INFO  loader          :: Add: 300,000 Data (Batch: 25,100 / Avg: 23,971)
09:55:13 INFO  loader          :: Add: 350,000 Data (Batch: 24,213 / Avg: 24,005)
09:55:15 INFO  loader          :: Add: 400,000 Data (Batch: 24,461 / Avg: 24,061)
09:55:17 INFO  loader          :: Add: 450,000 Data (Batch: 25,667 / Avg: 24,230)
09:55:19 INFO  loader          :: Add: 500,000 Data (Batch: 25,879 / Avg: 24,385)
09:55:19 INFO  loader          ::   Elapsed: 20.50 seconds [2021/02/08 09:55:19 UTC]
09:55:21 INFO  loader          :: Add: 550,000 Data (Batch: 25,484 / Avg: 24,481)
09:55:23 INFO  loader          :: Add: 600,000 Data (Batch: 23,419 / Avg: 24,389)
09:55:25 INFO  loader          :: Add: 650,000 Data (Batch: 25,012 / Avg: 24,436)
09:55:27 INFO  loader          :: Add: 700,000 Data (Batch: 25,201 / Avg: 24,489)
09:55:29 INFO  loader          :: Add: 750,000 Data (Batch: 26,288 / Avg: 24,601)
09:55:31 INFO  loader          :: Add: 800,000 Data (Batch: 25,960 / Avg: 24,682)
09:55:33 INFO  loader          :: Add: 850,000 Data (Batch: 24,330 / Avg: 24,661)
09:55:35 INFO  loader          :: Add: 900,000 Data (Batch: 25,813 / Avg: 24,722)
09:55:37 INFO  loader          :: Add: 950,000 Data (Batch: 26,164 / Avg: 24,794)
09:55:39 INFO  loader          :: Add: 1,000,000 Data (Batch: 26,357 / Avg: 24,868)
09:55:39 INFO  loader          ::   Elapsed: 40.21 seconds [2021/02/08 09:55:39 UTC]

My first questions are:

1) I guess that 600,000 is the number of data loaded at 09:55:23.  What
means data?  Does it mean bytes or triples?

2) What are the numbers Batch: 23,419 and Avg: 24,389?  I guess that are
associated to the loading speed.

After some days of loading the output shows different numbers:

10:21:45 INFO  loader          ::   Elapsed: 433,606.84 seconds [2021/02/13 10:21:45 UTC]
10:21:48 INFO  loader          :: Add: 505,550,000 Data (Batch: 18,348 / Avg: 1,165)
10:21:51 INFO  loader          :: Add: 505,600,000 Data (Batch: 18,656 / Avg: 1,166)
10:22:55 INFO  loader          :: Add: 505,650,000 Data (Batch: 781 / Avg: 1,165)
10:36:12 INFO  loader          :: Add: 505,700,000 Data (Batch: 62 / Avg: 1,163)
10:36:14 INFO  loader          :: Add: 505,750,000 Data (Batch: 17,543 / Avg: 1,164)
10:36:17 INFO  loader          :: Add: 505,800,000 Data (Batch: 17,385 / Avg: 1,164)
10:36:20 INFO  loader          :: Add: 505,850,000 Data (Batch: 17,998 / Avg: 1,164)
10:36:23 INFO  loader          :: Add: 505,900,000 Data (Batch: 17,170 / Avg: 1,164)
10:37:12 INFO  loader          :: Add: 505,950,000 Data (Batch: 1,025 / Avg: 1,164)
10:37:14 INFO  loader          :: Add: 506,000,000 Data (Batch: 18,301 / Avg: 1,164)
10:37:14 INFO  loader          ::   Elapsed: 434,535.94 seconds [2021/02/13 10:37:14 UTC]

Now the numbers Batch and Avg are smaller.  Also, it is taking longer to
load each 500,000 data.  At the beginning it takes 20 seconds to load
500,000 data.  Now it is taking 929 seconds.  Why the load speed is
degraded?  In my experience loading big datasets in Jena I always see
that loading start getting slower as much data have been already loaded.

Best regards,
Daniel

Re: Understanding the output of Jena TDB Loader

Posted by Daniel Hernandez <da...@degu.cl>.

Hi,

Andy Seaborne writes:

> On 23/02/2021 17:55, Daniel Hernandez wrote:
>> Hi,
>>
>>>> The disk where I was loading the data was a local rotating disk of
>>>> 7200 rpm. The machine has also an SSD but is too small to do the
>>>> experiment.
>>>
>>> tdbloader2 may be the right choice for that setup - it was written
>>> with disks in mind. It uses Unix sort(1). What it needs is to tune the
>>> parameters to the runs of "sort"
>> Thanks, this information is very useful.
>>
>>> Wolfgang Fahl has loaded large (several billion triples)
>>>
>>> https://issues.apache.org/jira/browse/JENA-1909
>>>
>>> and his notes are at:
>>>
>>>   http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
>> I also have loaded Wikidata in a very small virtual machine with a
>> single core, and a rotating non local disk. I remember it lasted more
>> than a week. I do not saved the log, because the machine was running
>> other jobs at the same time. Next time I loaded a big dataset I will
>> share the machine specification and loading log.
>>
>>>> I wonder if it is better to load the data using a fast disk, a lot of
>>>> RAM, or a lot of cores.
>>>
>>> A few years ago, I ran load tests of two machines, one 32G+SATA SSD,
>>> one 16G+ 1TB M2 SSD.  The 16G but faster SSD was quicker overall.
>> That is interesting.  I am considering to have a machine with an
>> NVMe
>> SSD disk for the next loading.
>>
>>> Database directories can be copied across machines after they have
>>> been built.
>> The tdbloader2 generates some files with the tmp extension. The file
>> data-triples.tmp can be very big. The name suggest that it is a temporal
>> file. Can I delete that file after the loading ends?
>
> Yes.
>
> The files are the triples ids from the parse/load nodes stage.
>
> Then comes the indexing which is multiple passes over the tmp files,
> once per index to sort, using an external sort (in both sense!
> external program and external to disk), then build the indexes in a
> single pass per index.
>
> This is reusing the external sort capability of sorting data much
> larger than RAM.  sort(1) needs
>
> I found a previous load script (when wikidata was 2.2 B IIRC)
>
> Setting SORT_ARGS
>
> ------------------
> #!/bin/bash
>
> echo "== $(date)"
>
> export TOOL_DIR="$PWD"
> export JENA_HOME="$HOME/jlib/apache-jena-3.5.0"
> export JVM_ARGS=""
> export GZIP="--fast"
> #export SORT_ARGS="--parallel=2 --compress-program=/bin/gzip
>  --temporary-directory=$PWD/tmp --buffer-size=75%"
>
> export SORT_ARGS="--temporary-directory=$PWD/tmp"
>
> ## -k : keep work files.
>
> # Logger:org.apache.jena.riot
>
> PHASE="--phase index"
> ARGS="--keep-work $PHASE --loc db2-all"
>
> tdbloader2 $ARGS "$@"
> echo "== $(date)"
> ------------------
> IIRC not all sort(1) had "--parallel" back then.
>
>
> I also found a script replacement for the "sort" command in the scripts:
> ------------------
> #!/bin/bash
> # Special.
> ## mysort $KEYS "$DATA" "$WORK"
>
> KEYS="$1"
> DATA="$2"
> WORK="$3"
>
> SORT_ARGS="--compress-program=/bin/gzip --temporary-directory=$PWD/tmp
> --buffer-size=80%"
> gzip -d < "$DATA.gz" | sort $SORT_ARGS -u $KEYS > "$WORK"
> ------------------
>
> 	HTH
> 	Andy

Thanks Andy, you help me a lot!

Best,
Daniel

Re: Understanding the output of Jena TDB Loader

Posted by Andy Seaborne <an...@apache.org>.


On 23/02/2021 17:55, Daniel Hernandez wrote:
> 
> Hi,
> 
>>> The disk where I was loading the data was a local rotating disk of
>>> 7200 rpm. The machine has also an SSD but is too small to do the
>>> experiment.
>>
>> tdbloader2 may be the right choice for that setup - it was written
>> with disks in mind. It uses Unix sort(1). What it needs is to tune the
>> parameters to the runs of "sort"
> 
> Thanks, this information is very useful.
> 
>> Wolfgang Fahl has loaded large (several billion triples)
>>
>> https://issues.apache.org/jira/browse/JENA-1909
>>
>> and his notes are at:
>>
>>   http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
> 
> I also have loaded Wikidata in a very small virtual machine with a
> single core, and a rotating non local disk. I remember it lasted more
> than a week. I do not saved the log, because the machine was running
> other jobs at the same time. Next time I loaded a big dataset I will
> share the machine specification and loading log.
> 
>>> I wonder if it is better to load the data using a fast disk, a lot of
>>> RAM, or a lot of cores.
>>
>> A few years ago, I ran load tests of two machines, one 32G+SATA SSD,
>> one 16G+ 1TB M2 SSD.  The 16G but faster SSD was quicker overall.
> 
> That is interesting.  I am considering to have a machine with an NVMe
> SSD disk for the next loading.
> 
>> Database directories can be copied across machines after they have
>> been built.
> 
> The tdbloader2 generates some files with the tmp extension. The file
> data-triples.tmp can be very big. The name suggest that it is a temporal
> file. Can I delete that file after the loading ends?

Yes.

The files are the triples ids from the parse/load nodes stage.

Then comes the indexing which is multiple passes over the tmp files, 
once per index to sort, using an external sort (in both sense! external 
program and external to disk), then build the indexes in a single pass 
per index.

This is reusing the external sort capability of sorting data much larger 
than RAM.  sort(1) needs

I found a previous load script (when wikidata was 2.2 B IIRC)

Setting SORT_ARGS

------------------
#!/bin/bash

echo "== $(date)"

export TOOL_DIR="$PWD"
export JENA_HOME="$HOME/jlib/apache-jena-3.5.0"
export JVM_ARGS=""
export GZIP="--fast"
#export SORT_ARGS="--parallel=2 --compress-program=/bin/gzip 
--temporary-directory=$PWD/tmp --buffer-size=75%"

export SORT_ARGS="--temporary-directory=$PWD/tmp"

## -k : keep work files.

# Logger:org.apache.jena.riot

PHASE="--phase index"
ARGS="--keep-work $PHASE --loc db2-all"

tdbloader2 $ARGS "$@"
echo "== $(date)"
------------------
IIRC not all sort(1) had "--parallel" back then.


I also found a script replacement for the "sort" command in the scripts:
------------------
#!/bin/bash
# Special.
## mysort $KEYS "$DATA" "$WORK"

KEYS="$1"
DATA="$2"
WORK="$3"

SORT_ARGS="--compress-program=/bin/gzip --temporary-directory=$PWD/tmp 
--buffer-size=80%"
gzip -d < "$DATA.gz" | sort $SORT_ARGS -u $KEYS > "$WORK"
------------------

	HTH
	Andy

> 
> Best,
> Daniel
>

Re: Understanding the output of Jena TDB Loader

Posted by Daniel Hernandez <da...@degu.cl>.

Hi,

>> The disk where I was loading the data was a local rotating disk of
>> 7200 rpm. The machine has also an SSD but is too small to do the
>> experiment.
>
> tdbloader2 may be the right choice for that setup - it was written
> with disks in mind. It uses Unix sort(1). What it needs is to tune the
> parameters to the runs of "sort"

Thanks, this information is very useful.

> Wolfgang Fahl has loaded large (several billion triples)
>
> https://issues.apache.org/jira/browse/JENA-1909
>
> and his notes are at:
>
>  http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData

I also have loaded Wikidata in a very small virtual machine with a
single core, and a rotating non local disk. I remember it lasted more
than a week. I do not saved the log, because the machine was running
other jobs at the same time. Next time I loaded a big dataset I will
share the machine specification and loading log.

>> I wonder if it is better to load the data using a fast disk, a lot of
>> RAM, or a lot of cores.
>
> A few years ago, I ran load tests of two machines, one 32G+SATA SSD,
> one 16G+ 1TB M2 SSD.  The 16G but faster SSD was quicker overall.

That is interesting.  I am considering to have a machine with an NVMe
SSD disk for the next loading.

> Database directories can be copied across machines after they have
> been built.

The tdbloader2 generates some files with the tmp extension. The file
data-triples.tmp can be very big. The name suggest that it is a temporal
file. Can I delete that file after the loading ends?

Best,
Daniel

Re: Understanding the output of Jena TDB Loader

Posted by Andy Seaborne <an...@apache.org>.


On 16/02/2021 09:20, Daniel Hernandez wrote:
> 
> Hi,
> 
>>>> tdbloader2 may not be the right choice. It is a bit niche but if you
>>>> have much less RAM than total data it can be better than tdbloader and
>>>> it is better if there is rotating disk, not SSD. It has been reported
>>>> to be the right choice for several billion for SSD.
>>> I have a SSD disk, a machine with 256 GB of ram, and 32 cores. Do
>>> you recommend using tdbloader in this setting?
>>
>> The rate you were getting seem low even for tdbloader2 - is it all SDD
>> or could /tmp be on a disk? And is the SSD local or remove (e.g. EBS)?
>>
>> As a general point, because the hardware matters, it is a case of try
>> a few cases and see.
> 
> Sorry, I have been confused. The disk where I was loading the data was a
> local rotating disk of 7200 rpm. The machine has also an SSD but is too
> small to do the experiment.

tdbloader2 may be the right choice for that setup - it was written with 
disks in mind. It uses Unix sort(1). What it needs is to tune the 
parameters to the runs of "sort"

Wolfgang Fahl has loaded large (several billion triples)

https://issues.apache.org/jira/browse/JENA-1909

and his notes are at:

  http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData

>> Does to have to be TDB1? "tdb2.tdbloader --loader=parallel" is the
>> most aggressive loader. For TDB1, I'm not sure if "tdbloader2" or
>> "tdbloader" will be faster end-to-end.
> 
> I have running some queries using TDB1 before, so I want to compare the
> performance in similar conditions. Otherwise, I would have to run the
> queries again for TDB2. So I have to evaluate what option is better.
> 
>> I'd be interested in what you found out. It's been a while since I had
>> access to a large machine (which was on AWS ~240G RAM, local SSD). I
>> used tdb2.tdbloader (i.e. TDB2).
> 
> I am sorry that my machine was not so good

Mime neither :-)

And I don't have access to additional hardware at the moment.

> because it has a rotating
> disk. I have another machine, with a 1T local SSD disk, but with only 64
> GB. I am going to test the loading speed on that machine (when that
> machine finishes the jobs it is doing). I wonder if it is better to load
> the data using a fast disk, a lot of RAM, or a lot of cores.

A few years ago, I ran load tests of two machines, one 32G+SATA SSD, one 
16G+ 1TB M2 SSD.  The 16G but faster SSD was quicker overall.

Database directories can be copied across machines after they have been 
built.

     Andy

> 
> Best,
> Daniel
>

Re: Understanding the output of Jena TDB Loader

Posted by Daniel Hernandez <da...@degu.cl>.

Hi,

>>> tdbloader2 may not be the right choice. It is a bit niche but if you
>>> have much less RAM than total data it can be better than tdbloader and
>>> it is better if there is rotating disk, not SSD. It has been reported
>>> to be the right choice for several billion for SSD.
>> I have a SSD disk, a machine with 256 GB of ram, and 32 cores. Do
>> you recommend using tdbloader in this setting?
>
> The rate you were getting seem low even for tdbloader2 - is it all SDD
> or could /tmp be on a disk? And is the SSD local or remove (e.g. EBS)?
>
> As a general point, because the hardware matters, it is a case of try
> a few cases and see.

Sorry, I have been confused. The disk where I was loading the data was a
local rotating disk of 7200 rpm. The machine has also an SSD but is too
small to do the experiment.

> Does to have to be TDB1? "tdb2.tdbloader --loader=parallel" is the
> most aggressive loader. For TDB1, I'm not sure if "tdbloader2" or
> "tdbloader" will be faster end-to-end.

I have running some queries using TDB1 before, so I want to compare the
performance in similar conditions. Otherwise, I would have to run the
queries again for TDB2. So I have to evaluate what option is better.

> I'd be interested in what you found out. It's been a while since I had
> access to a large machine (which was on AWS ~240G RAM, local SSD). I
> used tdb2.tdbloader (i.e. TDB2).

I am sorry that my machine was not so good because it has a rotating
disk. I have another machine, with a 1T local SSD disk, but with only 64
GB. I am going to test the loading speed on that machine (when that
machine finishes the jobs it is doing). I wonder if it is better to load
the data using a fast disk, a lot of RAM, or a lot of cores.

Best,
Daniel

Re: Understanding the output of Jena TDB Loader

Posted by Andy Seaborne <an...@apache.org>.

On 13/02/2021 17:17, Daniel Hernandez wrote:
> 
> Hi,
> 
> Andy Seaborne writes:
>> How much data are you loading?
> 
> I am loading a billion triples.
> 
>> Heap is only used for the node table cache and not index work which is
>> out of heap in memory mapped filesmapped by the virtual memory of the
>> OS process so caching is done by the OS filesystem cache machinery. It
>> can make the OS process look very large even if the heap is only 1.2G.
> 
> So it is better to do not modify the Xms parameter?

Xms does not matter. Personally, I'd set -Xmx to 4G which is larger than 
normal and plenty.

Don't set it too high - that can slow things down. If the heap grows, it 
is taking space away from the OS and, as the data size grows, then file 
I/O on indexes is the dominate speed factor.  So caching and I/O 
hardware matter. For example, on AWS, EBS SSD vs local SSD has different 
speed characteristics.

>> tdbloader2 may not be the right choice. It is a bit niche but if you
>> have much less RAM than total data it can be better than tdbloader and
>> it is better if there is rotating disk, not SSD. It has been reported
>> to be the right choice for several billion for SSD.
> 
> I have a SSD disk, a machine with 256 GB of ram, and 32 cores. Do you
> recommend using tdbloader in this setting?

The rate you were getting seem low even for tdbloader2 - is it all SDD 
or could /tmp be on a disk? And is the SSD local or remove (e.g. EBS)?

As a general point, because the hardware matters, it is a case of try a 
few cases and see.

Does to have to be TDB1? "tdb2.tdbloader --loader=parallel" is the most 
aggressive loader. For TDB1, I'm not sure if "tdbloader2" or "tdbloader" 
will be faster end-to-end.

I'd be interested in what you found out. It's been a while since I had 
access to a large machine (which was on AWS ~240G RAM, local SSD). I 
used tdb2.tdbloader (i.e. TDB2).

     Andy

> 
> Best regards,
> Daniel
>

Re: Understanding the output of Jena TDB Loader

Posted by Daniel Hernandez <da...@degu.cl>.

Hi,

Andy Seaborne writes:
> How much data are you loading?

I am loading a billion triples.

> Heap is only used for the node table cache and not index work which is
> out of heap in memory mapped filesmapped by the virtual memory of the
> OS process so caching is done by the OS filesystem cache machinery. It
> can make the OS process look very large even if the heap is only 1.2G.

So it is better to do not modify the Xms parameter?

> tdbloader2 may not be the right choice. It is a bit niche but if you
> have much less RAM than total data it can be better than tdbloader and 
> it is better if there is rotating disk, not SSD. It has been reported
> to be the right choice for several billion for SSD.

I have a SSD disk, a machine with 256 GB of ram, and 32 cores. Do you
recommend using tdbloader in this setting?

Best regards,
Daniel

Re: Understanding the output of Jena TDB Loader

Posted by Andy Seaborne <an...@apache.org>.


On 13/02/2021 13:21, Daniel Hernandez wrote:
> 
> Hi,
> 
> Thanks Lorenz for your answer.  Regarding a possible spill to disk, my
> machine has 256 GB of RAM, and the Java process is taking only 20 G.  I
> am not sure if changing the -Xmx Java argument would speed up the
> loading process.  I see that tdbloader2 started the process using with
> the argument -Xmx1200M.  Maybe is too conservative.

Hi there,

How much data are you loading?

Heap is only used for the node table cache and not index work which is 
out of heap in memory mapped filesmapped by the virtual memory of the OS 
process so caching is done by the OS filesystem cache machinery. It can 
make the OS process look very large even if the heap is only 1.2G.


tdbloader2 may not be the right choice. It is a bit niche but if you 
have much less RAM than total data it can be better than tdbloader and 
it is better if there is rotating disk, not SSD. It has been reported to 
be the right choice for several billion for SSD.

--

tdb2.tdbloader is the same or faster that TDB1 loaders.

If you give it --loader=parallel, it'll max out on the I/O subsystem.

And don't use the machine for ohter I/O at the time.

---

Loading slows down because:

* Indexes get deeper
* I/O is jumping around and RAM caching is less effective.

---
Beware of ulimits on the Linux process - you may have a large machine 
but can one process make use of it?

     Andy

> 
> Best,
> Daniel
> 
> Lorenz Buehmann writes:
> 
>> On 13.02.21 12:00, Daniel Hernandez wrote:
>>> Hi,
>>>
>>> I am loading an n-triples file using tdbloader2.  I am curious about
>>> what is the meaning of the numbers in the loader output.  The loading
>>> output started as follows:
>>>
>>>   09:54:15 INFO -- TDB Bulk Loader Start
>>>   09:54:15 INFO Data Load Phase
>>>   09:54:15 INFO Got 1 data files to load
>>>   09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz
>>> 09:54:59 INFO  loader          :: Load: /home/ubuntu/dataset.nq.gz -- 2021/02/08 09:54:59 UTC
>>> 09:55:01 INFO  loader          :: Add: 50,000 Data (Batch: 19,912 / Avg: 19,912)
>>> 09:55:03 INFO  loader          :: Add: 100,000 Data (Batch: 23,288 / Avg: 21,468)
>>> 09:55:05 INFO  loader          :: Add: 150,000 Data (Batch: 26,123 / Avg: 22,824)
>>> 09:55:07 INFO  loader          :: Add: 200,000 Data (Batch: 24,987 / Avg: 23,329)
>>> 09:55:09 INFO  loader          :: Add: 250,000 Data (Batch: 25,641 / Avg: 23,757)
>>> 09:55:11 INFO  loader          :: Add: 300,000 Data (Batch: 25,100 / Avg: 23,971)
>>> 09:55:13 INFO  loader          :: Add: 350,000 Data (Batch: 24,213 / Avg: 24,005)
>>> 09:55:15 INFO  loader          :: Add: 400,000 Data (Batch: 24,461 / Avg: 24,061)
>>> 09:55:17 INFO  loader          :: Add: 450,000 Data (Batch: 25,667 / Avg: 24,230)
>>> 09:55:19 INFO  loader          :: Add: 500,000 Data (Batch: 25,879 / Avg: 24,385)
>>> 09:55:19 INFO  loader          ::   Elapsed: 20.50 seconds [2021/02/08 09:55:19 UTC]
>>> 09:55:21 INFO  loader          :: Add: 550,000 Data (Batch: 25,484 / Avg: 24,481)
>>> 09:55:23 INFO  loader          :: Add: 600,000 Data (Batch: 23,419 / Avg: 24,389)
>>> 09:55:25 INFO  loader          :: Add: 650,000 Data (Batch: 25,012 / Avg: 24,436)
>>> 09:55:27 INFO  loader          :: Add: 700,000 Data (Batch: 25,201 / Avg: 24,489)
>>> 09:55:29 INFO  loader          :: Add: 750,000 Data (Batch: 26,288 / Avg: 24,601)
>>> 09:55:31 INFO  loader          :: Add: 800,000 Data (Batch: 25,960 / Avg: 24,682)
>>> 09:55:33 INFO  loader          :: Add: 850,000 Data (Batch: 24,330 / Avg: 24,661)
>>> 09:55:35 INFO  loader          :: Add: 900,000 Data (Batch: 25,813 / Avg: 24,722)
>>> 09:55:37 INFO  loader          :: Add: 950,000 Data (Batch: 26,164 / Avg: 24,794)
>>> 09:55:39 INFO  loader          :: Add: 1,000,000 Data (Batch: 26,357 / Avg: 24,868)
>>> 09:55:39 INFO  loader          ::   Elapsed: 40.21 seconds [2021/02/08 09:55:39 UTC]
>>>
>>> My first questions are:
>>>
>>> 1) I guess that 600,000 is the number of data loaded at 09:55:23.  What
>>> means data?  Does it mean bytes or triples?
>> #triples
>>>
>>> 2) What are the numbers Batch: 23,419 and Avg: 24,389?  I guess that are
>>> associated to the loading speed.
>>
>> Batch means in the last 50K triples the number of triples per second
>>
>> Avg means among all triples parsed so far the number of triples per second
>>
>>>
>>> After some days of loading the output shows different numbers:
>>>
>>> 10:21:45 INFO  loader          ::   Elapsed: 433,606.84 seconds [2021/02/13 10:21:45 UTC]
>>> 10:21:48 INFO  loader          :: Add: 505,550,000 Data (Batch: 18,348 / Avg: 1,165)
>>> 10:21:51 INFO  loader          :: Add: 505,600,000 Data (Batch: 18,656 / Avg: 1,166)
>>> 10:22:55 INFO  loader          :: Add: 505,650,000 Data (Batch: 781 / Avg: 1,165)
>>> 10:36:12 INFO  loader          :: Add: 505,700,000 Data (Batch: 62 / Avg: 1,163)
>>> 10:36:14 INFO  loader          :: Add: 505,750,000 Data (Batch: 17,543 / Avg: 1,164)
>>> 10:36:17 INFO  loader          :: Add: 505,800,000 Data (Batch: 17,385 / Avg: 1,164)
>>> 10:36:20 INFO  loader          :: Add: 505,850,000 Data (Batch: 17,998 / Avg: 1,164)
>>> 10:36:23 INFO  loader          :: Add: 505,900,000 Data (Batch: 17,170 / Avg: 1,164)
>>> 10:37:12 INFO  loader          :: Add: 505,950,000 Data (Batch: 1,025 / Avg: 1,164)
>>> 10:37:14 INFO  loader          :: Add: 506,000,000 Data (Batch: 18,301 / Avg: 1,164)
>>> 10:37:14 INFO  loader          ::   Elapsed: 434,535.94 seconds [2021/02/13 10:37:14 UTC]
>>>
>>> Now the numbers Batch and Avg are smaller.  Also, it is taking longer to
>>> load each 500,000 data.  At the beginning it takes 20 seconds to load
>>> 500,000 data.  Now it is taking 929 seconds.  Why the load speed is
>>> degraded?  In my experience loading big datasets in Jena I always see
>>> that loading start getting slower as much data have been already loaded.
>> Andy Seaborne will explain better, but in general IO and memory pressure
>> increase, spill to disk, etc. - those heavy drops to 781 and 62 can also
>> stem from other workloads on your system
>>>
>>> Best regards,
>>> Daniel

Re: Understanding the output of Jena TDB Loader

Posted by Daniel Hernandez <da...@degu.cl>.

Hi,

Thanks Lorenz for your answer.  Regarding a possible spill to disk, my
machine has 256 GB of RAM, and the Java process is taking only 20 G.  I
am not sure if changing the -Xmx Java argument would speed up the
loading process.  I see that tdbloader2 started the process using with
the argument -Xmx1200M.  Maybe is too conservative.

Best,
Daniel

Lorenz Buehmann writes:

> On 13.02.21 12:00, Daniel Hernandez wrote:
>> Hi,
>>
>> I am loading an n-triples file using tdbloader2.  I am curious about
>> what is the meaning of the numbers in the loader output.  The loading
>> output started as follows:
>>
>>  09:54:15 INFO -- TDB Bulk Loader Start
>>  09:54:15 INFO Data Load Phase
>>  09:54:15 INFO Got 1 data files to load
>>  09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz
>> 09:54:59 INFO  loader          :: Load: /home/ubuntu/dataset.nq.gz -- 2021/02/08 09:54:59 UTC
>> 09:55:01 INFO  loader          :: Add: 50,000 Data (Batch: 19,912 / Avg: 19,912)
>> 09:55:03 INFO  loader          :: Add: 100,000 Data (Batch: 23,288 / Avg: 21,468)
>> 09:55:05 INFO  loader          :: Add: 150,000 Data (Batch: 26,123 / Avg: 22,824)
>> 09:55:07 INFO  loader          :: Add: 200,000 Data (Batch: 24,987 / Avg: 23,329)
>> 09:55:09 INFO  loader          :: Add: 250,000 Data (Batch: 25,641 / Avg: 23,757)
>> 09:55:11 INFO  loader          :: Add: 300,000 Data (Batch: 25,100 / Avg: 23,971)
>> 09:55:13 INFO  loader          :: Add: 350,000 Data (Batch: 24,213 / Avg: 24,005)
>> 09:55:15 INFO  loader          :: Add: 400,000 Data (Batch: 24,461 / Avg: 24,061)
>> 09:55:17 INFO  loader          :: Add: 450,000 Data (Batch: 25,667 / Avg: 24,230)
>> 09:55:19 INFO  loader          :: Add: 500,000 Data (Batch: 25,879 / Avg: 24,385)
>> 09:55:19 INFO  loader          ::   Elapsed: 20.50 seconds [2021/02/08 09:55:19 UTC]
>> 09:55:21 INFO  loader          :: Add: 550,000 Data (Batch: 25,484 / Avg: 24,481)
>> 09:55:23 INFO  loader          :: Add: 600,000 Data (Batch: 23,419 / Avg: 24,389)
>> 09:55:25 INFO  loader          :: Add: 650,000 Data (Batch: 25,012 / Avg: 24,436)
>> 09:55:27 INFO  loader          :: Add: 700,000 Data (Batch: 25,201 / Avg: 24,489)
>> 09:55:29 INFO  loader          :: Add: 750,000 Data (Batch: 26,288 / Avg: 24,601)
>> 09:55:31 INFO  loader          :: Add: 800,000 Data (Batch: 25,960 / Avg: 24,682)
>> 09:55:33 INFO  loader          :: Add: 850,000 Data (Batch: 24,330 / Avg: 24,661)
>> 09:55:35 INFO  loader          :: Add: 900,000 Data (Batch: 25,813 / Avg: 24,722)
>> 09:55:37 INFO  loader          :: Add: 950,000 Data (Batch: 26,164 / Avg: 24,794)
>> 09:55:39 INFO  loader          :: Add: 1,000,000 Data (Batch: 26,357 / Avg: 24,868)
>> 09:55:39 INFO  loader          ::   Elapsed: 40.21 seconds [2021/02/08 09:55:39 UTC]
>>
>> My first questions are:
>>
>> 1) I guess that 600,000 is the number of data loaded at 09:55:23.  What
>> means data?  Does it mean bytes or triples?
> #triples
>>
>> 2) What are the numbers Batch: 23,419 and Avg: 24,389?  I guess that are
>> associated to the loading speed.
>
> Batch means in the last 50K triples the number of triples per second
>
> Avg means among all triples parsed so far the number of triples per second
>
>>
>> After some days of loading the output shows different numbers:
>>
>> 10:21:45 INFO  loader          ::   Elapsed: 433,606.84 seconds [2021/02/13 10:21:45 UTC]
>> 10:21:48 INFO  loader          :: Add: 505,550,000 Data (Batch: 18,348 / Avg: 1,165)
>> 10:21:51 INFO  loader          :: Add: 505,600,000 Data (Batch: 18,656 / Avg: 1,166)
>> 10:22:55 INFO  loader          :: Add: 505,650,000 Data (Batch: 781 / Avg: 1,165)
>> 10:36:12 INFO  loader          :: Add: 505,700,000 Data (Batch: 62 / Avg: 1,163)
>> 10:36:14 INFO  loader          :: Add: 505,750,000 Data (Batch: 17,543 / Avg: 1,164)
>> 10:36:17 INFO  loader          :: Add: 505,800,000 Data (Batch: 17,385 / Avg: 1,164)
>> 10:36:20 INFO  loader          :: Add: 505,850,000 Data (Batch: 17,998 / Avg: 1,164)
>> 10:36:23 INFO  loader          :: Add: 505,900,000 Data (Batch: 17,170 / Avg: 1,164)
>> 10:37:12 INFO  loader          :: Add: 505,950,000 Data (Batch: 1,025 / Avg: 1,164)
>> 10:37:14 INFO  loader          :: Add: 506,000,000 Data (Batch: 18,301 / Avg: 1,164)
>> 10:37:14 INFO  loader          ::   Elapsed: 434,535.94 seconds [2021/02/13 10:37:14 UTC]
>>
>> Now the numbers Batch and Avg are smaller.  Also, it is taking longer to
>> load each 500,000 data.  At the beginning it takes 20 seconds to load
>> 500,000 data.  Now it is taking 929 seconds.  Why the load speed is
>> degraded?  In my experience loading big datasets in Jena I always see
>> that loading start getting slower as much data have been already loaded.
> Andy Seaborne will explain better, but in general IO and memory pressure
> increase, spill to disk, etc. - those heavy drops to 781 and 62 can also
> stem from other workloads on your system
>>
>> Best regards,
>> Daniel

Re: Understanding the output of Jena TDB Loader

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

On 13.02.21 12:00, Daniel Hernandez wrote:
> Hi,
>
> I am loading an n-triples file using tdbloader2.  I am curious about
> what is the meaning of the numbers in the loader output.  The loading
> output started as follows:
>
>  09:54:15 INFO -- TDB Bulk Loader Start
>  09:54:15 INFO Data Load Phase
>  09:54:15 INFO Got 1 data files to load
>  09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz
> 09:54:59 INFO  loader          :: Load: /home/ubuntu/dataset.nq.gz -- 2021/02/08 09:54:59 UTC
> 09:55:01 INFO  loader          :: Add: 50,000 Data (Batch: 19,912 / Avg: 19,912)
> 09:55:03 INFO  loader          :: Add: 100,000 Data (Batch: 23,288 / Avg: 21,468)
> 09:55:05 INFO  loader          :: Add: 150,000 Data (Batch: 26,123 / Avg: 22,824)
> 09:55:07 INFO  loader          :: Add: 200,000 Data (Batch: 24,987 / Avg: 23,329)
> 09:55:09 INFO  loader          :: Add: 250,000 Data (Batch: 25,641 / Avg: 23,757)
> 09:55:11 INFO  loader          :: Add: 300,000 Data (Batch: 25,100 / Avg: 23,971)
> 09:55:13 INFO  loader          :: Add: 350,000 Data (Batch: 24,213 / Avg: 24,005)
> 09:55:15 INFO  loader          :: Add: 400,000 Data (Batch: 24,461 / Avg: 24,061)
> 09:55:17 INFO  loader          :: Add: 450,000 Data (Batch: 25,667 / Avg: 24,230)
> 09:55:19 INFO  loader          :: Add: 500,000 Data (Batch: 25,879 / Avg: 24,385)
> 09:55:19 INFO  loader          ::   Elapsed: 20.50 seconds [2021/02/08 09:55:19 UTC]
> 09:55:21 INFO  loader          :: Add: 550,000 Data (Batch: 25,484 / Avg: 24,481)
> 09:55:23 INFO  loader          :: Add: 600,000 Data (Batch: 23,419 / Avg: 24,389)
> 09:55:25 INFO  loader          :: Add: 650,000 Data (Batch: 25,012 / Avg: 24,436)
> 09:55:27 INFO  loader          :: Add: 700,000 Data (Batch: 25,201 / Avg: 24,489)
> 09:55:29 INFO  loader          :: Add: 750,000 Data (Batch: 26,288 / Avg: 24,601)
> 09:55:31 INFO  loader          :: Add: 800,000 Data (Batch: 25,960 / Avg: 24,682)
> 09:55:33 INFO  loader          :: Add: 850,000 Data (Batch: 24,330 / Avg: 24,661)
> 09:55:35 INFO  loader          :: Add: 900,000 Data (Batch: 25,813 / Avg: 24,722)
> 09:55:37 INFO  loader          :: Add: 950,000 Data (Batch: 26,164 / Avg: 24,794)
> 09:55:39 INFO  loader          :: Add: 1,000,000 Data (Batch: 26,357 / Avg: 24,868)
> 09:55:39 INFO  loader          ::   Elapsed: 40.21 seconds [2021/02/08 09:55:39 UTC]
>
> My first questions are:
>
> 1) I guess that 600,000 is the number of data loaded at 09:55:23.  What
> means data?  Does it mean bytes or triples?
#triples
>
> 2) What are the numbers Batch: 23,419 and Avg: 24,389?  I guess that are
> associated to the loading speed.

Batch means in the last 50K triples the number of triples per second

Avg means among all triples parsed so far the number of triples per second

>
> After some days of loading the output shows different numbers:
>
> 10:21:45 INFO  loader          ::   Elapsed: 433,606.84 seconds [2021/02/13 10:21:45 UTC]
> 10:21:48 INFO  loader          :: Add: 505,550,000 Data (Batch: 18,348 / Avg: 1,165)
> 10:21:51 INFO  loader          :: Add: 505,600,000 Data (Batch: 18,656 / Avg: 1,166)
> 10:22:55 INFO  loader          :: Add: 505,650,000 Data (Batch: 781 / Avg: 1,165)
> 10:36:12 INFO  loader          :: Add: 505,700,000 Data (Batch: 62 / Avg: 1,163)
> 10:36:14 INFO  loader          :: Add: 505,750,000 Data (Batch: 17,543 / Avg: 1,164)
> 10:36:17 INFO  loader          :: Add: 505,800,000 Data (Batch: 17,385 / Avg: 1,164)
> 10:36:20 INFO  loader          :: Add: 505,850,000 Data (Batch: 17,998 / Avg: 1,164)
> 10:36:23 INFO  loader          :: Add: 505,900,000 Data (Batch: 17,170 / Avg: 1,164)
> 10:37:12 INFO  loader          :: Add: 505,950,000 Data (Batch: 1,025 / Avg: 1,164)
> 10:37:14 INFO  loader          :: Add: 506,000,000 Data (Batch: 18,301 / Avg: 1,164)
> 10:37:14 INFO  loader          ::   Elapsed: 434,535.94 seconds [2021/02/13 10:37:14 UTC]
>
> Now the numbers Batch and Avg are smaller.  Also, it is taking longer to
> load each 500,000 data.  At the beginning it takes 20 seconds to load
> 500,000 data.  Now it is taking 929 seconds.  Why the load speed is
> degraded?  In my experience loading big datasets in Jena I always see
> that loading start getting slower as much data have been already loaded.
Andy Seaborne will explain better, but in general IO and memory pressure
increase, spill to disk, etc. - those heavy drops to 781 and 62 can also
stem from other workloads on your system
>
> Best regards,
> Daniel