You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Øyvind Gjesdal <oy...@gmail.com> on 2021/12/11 21:45:08 UTC
Testing tdb2.xloader
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
dump downloaded 2021-12-09.
The instance is a vm created on the Norwegian Research and Education Cloud,
an openstack cloud provider.
Instance type:
32 GB memory
4 CPU
The storage used for dump + temp files is mounted as a separate 900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
> *mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
overcommitted.
OS for the instance is a clean Rocky Linux image, with no services except
jena/fuseki installed. The systemd service set up for fuseki is stopped.
jena and fuseki version is 4.3.0.
openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
I'm running from a tmux session to avoid connectivity issues and to capture
the output. I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to separate
temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
full. For the second (current run) all directories are under
/var/fuseki/databases.
$JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz
The import is so far at the "ingest data" stage where it has really slowed
down.
Current output is:
20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)
See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
Some notes:
* There is a (time/info) lapse in the output log between the end of
'parse' and the start of 'index' for Terms. It is unclear to me what is
happening in the 1h13 minutes between the lines.
22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds [2021/12/10
22:33:46 CET]
22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)
* The ingest data step really slows down on the "ingest data stage": At the
current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
has 10 days left before it finishes.
* When I saw sort running in the background for the first parts of the job,
I looked at the `sort` command. I noticed from some online sources that
setting the environment variable LC_ALL=C improves speed for `sort`. Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.
*** WARNING *** The locale specified by the environment affects
sort order. Set LC_ALL=C to get the traditional sort order that
uses native byte values.
Links:
https://access.redhat.com/solutions/445233
https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
Best regards,
Øyvind
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
The more tests we have on different machines the better. :)
Personally I'd say if you have a choice go for a PCIe 4.0 NVMe SSDs and
stay away from SATA < III SSDs. Also for the tests SSD RAID isn't necessary.
These components have become extremely affordable in recent years and
really should be part of a fast pipeline imo in particular for
tdb2.tdbloader in parallel mode.
But as Andy has emphasized he has designed the tdb2.xloader process to
be spinning-disk friendly. so SSDs are not a prerequisite for xloader
On Tue, Dec 14, 2021 at 10:38 AM Øyvind Gjesdal <oy...@gmail.com> wrote:
> Hi Marco,
>
> Very useful to compare with your log on the different runs. Still working
> with configuration to see if I can get the ingest data stage to be usable
> for hdd. It looks like I get close to the performance of your run on the
> earlier stages, while ingest data is still very much too slow. Having to
> use SSD may be necessary, for a real world large import to complete? I'lI
> request some ssd storage as well, and hope there's a quota for me :)
>
> Maybe I could also test different distros, to see if some of the default OS
> settings affect the import.
>
> Best regards,
> Øyvind
>
> søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <marco.neumann@gmail.com
> >:
>
> > Øyvind, looks like the above was the wrong log from a prior sharding
> > experiment.
> >
> > This is the correct log file for the truthy dataset.
> >
> > http://www.lotico.com/temp/LOG-98085
> >
> >
> >
> > On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
> > wrote:
> >
> > > Thank you Øyvind for sharing, great to see more tests in the wild.
> > >
> > > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > > dataset and quickly ran out of disk space. It finished the job but did
> > not
> > > write any of the indexes to disk due to lack of space. no error
> messages.
> > >
> > > http://www.lotico.com/temp/LOG-95239
> > >
> > > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> > the
> > > full wikidata dataset,
> > >
> > > I personally had the best experience with dedicated hardware so far
> (can
> > > be in the data center), shared or dedicated virtual compute engines did
> > not
> > > deliver as expected. And I have not seen great benefits from data
> center
> > > grade multicore cpus. But I think they will during runtime in multi
> user
> > > settings (eg fuseki).
> > >
> > > Best,
> > > Marco
> > >
> > > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
> > wrote:
> > >
> > >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> > >> truthy
> > >> dump downloaded 2021-12-09.
> > >>
> > >> The instance is a vm created on the Norwegian Research and Education
> > >> Cloud,
> > >> an openstack cloud provider.
> > >>
> > >> Instance type:
> > >> 32 GB memory
> > >> 4 CPU
> > >>
> > >> The storage used for dump + temp files is mounted as a separate
> 900GB
> > >> volume and is mounted on /var/fuseki/databases
> > >> .The type of storage is described as
> > >> > *mass-storage-default*: Storage backed by spinning hard drives,
> > >> available to everybody and is the default type.
> > >> with ext4 configured. At the moment I don't have access to the faster
> > >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> can
> > be
> > >> overcommitted.
> > >>
> > >> OS for the instance is a clean Rocky Linux image, with no services
> > except
> > >> jena/fuseki installed. The systemd service set up for fuseki is
> > stopped.
> > >> jena and fuseki version is 4.3.0.
> > >>
> > >> openjdk 11.0.13 2021-10-19 LTS
> > >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> > >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
> sharing)
> > >>
> > >> I'm running from a tmux session to avoid connectivity issues and to
> > >> capture
> > >> the output. I think the output is stored in memory and not on disk.
> > >> On First run I tried to have the tmpdir on the root partition, to
> > separate
> > >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> > disk
> > >> full. For the second (current run) all directories are under
> > >> /var/fuseki/databases.
> > >>
> > >> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> > >> --tmpdir
> > >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> > >>
> > >> The import is so far at the "ingest data" stage where it has really
> > slowed
> > >> down.
> > >>
> > >> Current output is:
> > >>
> > >> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356
> /
> > >> Avg: 7,593)
> > >>
> > >> See full log so far at
> > >>
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> > >>
> > >> Some notes:
> > >>
> > >> * There is a (time/info) lapse in the output log between the end of
> > >> 'parse' and the start of 'index' for Terms. It is unclear to me what
> is
> > >> happening in the 1h13 minutes between the lines.
> > >>
> > >> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
> > [2021/12/10
> > >> 22:33:46 CET]
> > >> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
> > >> 6,560,468,631 triples/quads 129,331 TPS
> > >> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch:
> 237,755 /
> > >> Avg: 237,755)
> > >>
> > >> * The ingest data step really slows down on the "ingest data stage":
> At
> > >> the
> > >> current rate, if I calculated correctly, it looks like
> > PKG.CmdxIngestData
> > >> has 10 days left before it finishes.
> > >>
> > >> * When I saw sort running in the background for the first parts of the
> > >> job,
> > >> I looked at the `sort` command. I noticed from some online sources
> that
> > >> setting the environment variable LC_ALL=C improves speed for `sort`.
> > Could
> > >> this be set on the ProcessBuilder for the `sort` process? Could it
> > >> break/change something? I see the warning from the man page for
> `sort`.
> > >>
> > >> *** WARNING *** The locale specified by the environment affects
> > >> sort order. Set LC_ALL=C to get the traditional sort order
> that
> > >> uses native byte values.
> > >>
> > >> Links:
> > >> https://access.redhat.com/solutions/445233
> > >>
> > >>
> >
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> > >>
> > >>
> >
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> > >>
> > >> Best regards,
> > >> Øyvind
> > >>
> > >
> > >
> > > --
> > >
> > >
> > > ---
> > > Marco Neumann
> > > KONA
> > >
> > >
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
> >
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
On 14/12/2021 10:38, Øyvind Gjesdal wrote:
> Hi Marco,
>
> Very useful to compare with your log on the different runs. Still working
> with configuration to see if I can get the ingest data stage to be usable
> for hdd. It looks like I get close to the performance of your run on the
> earlier stages, while ingest data is still very much too slow. Having to
> use SSD may be necessary, for a real world large import to complete? I'lI
> request some ssd storage as well, and hope there's a quota for me :)
The access patterns should (tm!) be spinning-disk friendly. There is no
random IO updating B+trees directly.
All the B+Trees are written "bottom-up" by specially writing blocks of
the right layout to disk, not via the B+Tree runtime code which would be
top down via "add record" style access.
Andy
> Maybe I could also test different distros, to see if some of the default OS
> settings affect the import.
>
> Best regards,
> Øyvind
>
> søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <ma...@gmail.com>:
>
>> Øyvind, looks like the above was the wrong log from a prior sharding
>> experiment.
>>
>> This is the correct log file for the truthy dataset.
>>
>> http://www.lotico.com/temp/LOG-98085
>>
>>
>>
>> On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
>> wrote:
>>
>>> Thank you Øyvind for sharing, great to see more tests in the wild.
>>>
>>> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
>>> dataset and quickly ran out of disk space. It finished the job but did
>> not
>>> write any of the indexes to disk due to lack of space. no error messages.
>>>
>>> http://www.lotico.com/temp/LOG-95239
>>>
>>> I have now ordered a new 4TB SSD drive to rerun the test possibly with
>> the
>>> full wikidata dataset,
>>>
>>> I personally had the best experience with dedicated hardware so far (can
>>> be in the data center), shared or dedicated virtual compute engines did
>> not
>>> deliver as expected. And I have not seen great benefits from data center
>>> grade multicore cpus. But I think they will during runtime in multi user
>>> settings (eg fuseki).
>>>
>>> Best,
>>> Marco
>>>
>>> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
>> wrote:
>>>
>>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>>>> truthy
>>>> dump downloaded 2021-12-09.
>>>>
>>>> The instance is a vm created on the Norwegian Research and Education
>>>> Cloud,
>>>> an openstack cloud provider.
>>>>
>>>> Instance type:
>>>> 32 GB memory
>>>> 4 CPU
>>>>
>>>> The storage used for dump + temp files is mounted as a separate 900GB
>>>> volume and is mounted on /var/fuseki/databases
>>>> .The type of storage is described as
>>>>> *mass-storage-default*: Storage backed by spinning hard drives,
>>>> available to everybody and is the default type.
>>>> with ext4 configured. At the moment I don't have access to the faster
>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
>> be
>>>> overcommitted.
>>>>
>>>> OS for the instance is a clean Rocky Linux image, with no services
>> except
>>>> jena/fuseki installed. The systemd service set up for fuseki is
>> stopped.
>>>> jena and fuseki version is 4.3.0.
>>>>
>>>> openjdk 11.0.13 2021-10-19 LTS
>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>>>
>>>> I'm running from a tmux session to avoid connectivity issues and to
>>>> capture
>>>> the output. I think the output is stored in memory and not on disk.
>>>> On First run I tried to have the tmpdir on the root partition, to
>> separate
>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>> disk
>>>> full. For the second (current run) all directories are under
>>>> /var/fuseki/databases.
>>>>
>>>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>>>> --tmpdir
>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>
>>>> The import is so far at the "ingest data" stage where it has really
>> slowed
>>>> down.
>>>>
>>>> Current output is:
>>>>
>>>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
>>>> Avg: 7,593)
>>>>
>>>> See full log so far at
>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>>>
>>>> Some notes:
>>>>
>>>> * There is a (time/info) lapse in the output log between the end of
>>>> 'parse' and the start of 'index' for Terms. It is unclear to me what is
>>>> happening in the 1h13 minutes between the lines.
>>>>
>>>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
>> [2021/12/10
>>>> 22:33:46 CET]
>>>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
>>>> Avg: 237,755)
>>>>
>>>> * The ingest data step really slows down on the "ingest data stage": At
>>>> the
>>>> current rate, if I calculated correctly, it looks like
>> PKG.CmdxIngestData
>>>> has 10 days left before it finishes.
>>>>
>>>> * When I saw sort running in the background for the first parts of the
>>>> job,
>>>> I looked at the `sort` command. I noticed from some online sources that
>>>> setting the environment variable LC_ALL=C improves speed for `sort`.
>> Could
>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>> break/change something? I see the warning from the man page for `sort`.
>>>>
>>>> *** WARNING *** The locale specified by the environment affects
>>>> sort order. Set LC_ALL=C to get the traditional sort order that
>>>> uses native byte values.
>>>>
>>>> Links:
>>>> https://access.redhat.com/solutions/445233
>>>>
>>>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>>>
>>>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>>>
>>>> Best regards,
>>>> Øyvind
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> ---
>>> Marco Neumann
>>> KONA
>>>
>>>
>>
>> --
>>
>>
>> ---
>> Marco Neumann
>> KONA
>>
>
Re: Testing tdb2.xloader
Posted by Øyvind Gjesdal <oy...@gmail.com>.
Hi Marco,
Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages, while ingest data is still very much too slow. Having to
use SSD may be necessary, for a real world large import to complete? I'lI
request some ssd storage as well, and hope there's a quota for me :)
Maybe I could also test different distros, to see if some of the default OS
settings affect the import.
Best regards,
Øyvind
søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <ma...@gmail.com>:
> Øyvind, looks like the above was the wrong log from a prior sharding
> experiment.
>
> This is the correct log file for the truthy dataset.
>
> http://www.lotico.com/temp/LOG-98085
>
>
>
> On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
> wrote:
>
> > Thank you Øyvind for sharing, great to see more tests in the wild.
> >
> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > dataset and quickly ran out of disk space. It finished the job but did
> not
> > write any of the indexes to disk due to lack of space. no error messages.
> >
> > http://www.lotico.com/temp/LOG-95239
> >
> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> the
> > full wikidata dataset,
> >
> > I personally had the best experience with dedicated hardware so far (can
> > be in the data center), shared or dedicated virtual compute engines did
> not
> > deliver as expected. And I have not seen great benefits from data center
> > grade multicore cpus. But I think they will during runtime in multi user
> > settings (eg fuseki).
> >
> > Best,
> > Marco
> >
> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
> wrote:
> >
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >>
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >>
> >> The storage used for dump + temp files is mounted as a separate 900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >> > *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
> be
> >> overcommitted.
> >>
> >> OS for the instance is a clean Rocky Linux image, with no services
> except
> >> jena/fuseki installed. The systemd service set up for fuseki is
> stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >>
> >> I'm running from a tmux session to avoid connectivity issues and to
> >> capture
> >> the output. I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >>
> >> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> slowed
> >> down.
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the end of
> >> 'parse' and the start of 'index' for Terms. It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >>
> >> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage": At
> >> the
> >> current rate, if I calculated correctly, it looks like
> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >>
> >> * When I saw sort running in the background for the first parts of the
> >> job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >> *** WARNING *** The locale specified by the environment affects
> >> sort order. Set LC_ALL=C to get the traditional sort order that
> >> uses native byte values.
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >> Best regards,
> >> Øyvind
> >>
> >
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
> >
> >
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
Øyvind, looks like the above was the wrong log from a prior sharding
experiment.
This is the correct log file for the truthy dataset.
http://www.lotico.com/temp/LOG-98085
On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
wrote:
> Thank you Øyvind for sharing, great to see more tests in the wild.
>
> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> dataset and quickly ran out of disk space. It finished the job but did not
> write any of the indexes to disk due to lack of space. no error messages.
>
> http://www.lotico.com/temp/LOG-95239
>
> I have now ordered a new 4TB SSD drive to rerun the test possibly with the
> full wikidata dataset,
>
> I personally had the best experience with dedicated hardware so far (can
> be in the data center), shared or dedicated virtual compute engines did not
> deliver as expected. And I have not seen great benefits from data center
> grade multicore cpus. But I think they will during runtime in multi user
> settings (eg fuseki).
>
> Best,
> Marco
>
> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com> wrote:
>
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> dump downloaded 2021-12-09.
>>
>> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>>
>> The storage used for dump + temp files is mounted as a separate 900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>> > *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
>> overcommitted.
>>
>> OS for the instance is a clean Rocky Linux image, with no services except
>> jena/fuseki installed. The systemd service set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> the output. I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>>
>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really slowed
>> down.
>>
>> Current output is:
>>
>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the end of
>> 'parse' and the start of 'index' for Terms. It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>>
>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage": At
>> the
>> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
>> has 10 days left before it finishes.
>>
>> * When I saw sort running in the background for the first parts of the
>> job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`. Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>> *** WARNING *** The locale specified by the environment affects
>> sort order. Set LC_ALL=C to get the traditional sort order that
>> uses native byte values.
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>
>> Best regards,
>> Øyvind
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>
--
---
Marco Neumann
KONA
log4j implications [Was: Testing tdb2.xloader]
Posted by Andy Seaborne <an...@apache.org>.
4.3.1 will contain the fixed log4j 2.15.0. No special mitigations necessary.
Jena uses log4j2 via the slf4j adapter from Apache Logging
(log4j-slf4j-impl). 2.15.0 should be compatible in Jena usage with
2.14.* for Jena 4.x.
From the download, replace log4j-(api|core|log4j-slf4j-impl) with the
2.15.0 versions in lib/
User applications are already deciding which logging to use - log4j2 is
an optional dependency of Jena so it isn't pulled in by a Jena
POM/Gradle entry. The application developer adds their choice of slf4j
provider [*]
Fuseki however shades dependencies into a combined jar.
We could say "set the system property '-Dlog4j2.formatMsgNoLookups=true'
but it isn't always to change scripts etc for at least two reasons
(1) when teams running the production deployment are not the
person/people producing the deployment package.
(2) Some of our users are data people, not programmer-developers. This
exploit is live now - making the fix easy seems like a good thing.
A new release (even if 4.3.0 only lasted a day!) is a routine upgrade
procedure.
And it's easy to say "upgrade to 4.3.1" than explain the steps needed to
secure 4.3.0, even if we achieve 4.4.0 at the earlier point of January
with the reworked UI using up-to-date JS dependencies.
Andy
[*] The patch server in RDF Delta doesn't use log4j at all - it uses
java.util.logging (JUL). Client side, Delta does use log4j.
On 12/12/2021 13:45, Marco Neumann wrote:
> Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability?
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability?
On Sun, Dec 12, 2021 at 1:24 PM Marco Neumann <ma...@gmail.com>
wrote:
> As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB
> SSD drive and an old laptop.
>
> I also have a contact who has just set up a new datacenter in Ireland. I
> may be able to run a few tests on much bigger machines as well. Otherwise I
> am very happy with the iron in Finland.as long as they are dedicated
> machines.
>
> On Sun, Dec 12, 2021 at 12:44 PM Andy Seaborne <an...@apache.org> wrote:
>
>>
>>
>> On 11/12/2021 22:02, Marco Neumann wrote:
>> > Thank you Øyvind for sharing, great to see more tests in the wild.
>> >
>> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
>> > dataset and quickly ran out of disk space. It finished the job but did
>> not
>> > write any of the indexes to disk due to lack of space. no error
>> messages.
>>
>> The 4.3.1 xloader should hopefully address the space issue.
>>
>> Andy
>>
>> >
>> > http://www.lotico.com/temp/LOG-95239
>> >
>> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
>> the
>> > full wikidata dataset,
>> >
>> > I personally had the best experience with dedicated hardware so far
>> (can be
>> > in the data center), shared or dedicated virtual compute engines did not
>> > deliver as expected. And I have not seen great benefits from data center
>> > grade multicore cpus. But I think they will during runtime in multi user
>> > settings (eg fuseki).
>> >
>> > Best,
>> > Marco
>> >
>> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
>> wrote:
>> >
>> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> >> dump downloaded 2021-12-09.
>> >>
>> >> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> >> an openstack cloud provider.
>> >>
>> >> Instance type:
>> >> 32 GB memory
>> >> 4 CPU
>> >>
>> >> The storage used for dump + temp files is mounted as a separate 900GB
>> >> volume and is mounted on /var/fuseki/databases
>> >> .The type of storage is described as
>> >>> *mass-storage-default*: Storage backed by spinning hard drives,
>> >> available to everybody and is the default type.
>> >> with ext4 configured. At the moment I don't have access to the faster
>> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>> can be
>> >> overcommitted.
>> >>
>> >> OS for the instance is a clean Rocky Linux image, with no services
>> except
>> >> jena/fuseki installed. The systemd service set up for fuseki is
>> stopped.
>> >> jena and fuseki version is 4.3.0.
>> >>
>> >> openjdk 11.0.13 2021-10-19 LTS
>> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
>> sharing)
>> >>
>> >> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> >> the output. I think the output is stored in memory and not on disk.
>> >> On First run I tried to have the tmpdir on the root partition, to
>> separate
>> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>> disk
>> >> full. For the second (current run) all directories are under
>> >> /var/fuseki/databases.
>> >>
>> >> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
>> >>
>> >> The import is so far at the "ingest data" stage where it has really
>> slowed
>> >> down.
>> >>
>> >> Current output is:
>> >>
>> >> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
>> >> Avg: 7,593)
>> >>
>> >> See full log so far at
>> >>
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>> >>
>> >> Some notes:
>> >>
>> >> * There is a (time/info) lapse in the output log between the end of
>> >> 'parse' and the start of 'index' for Terms. It is unclear to me what
>> is
>> >> happening in the 1h13 minutes between the lines.
>> >>
>> >> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
>> [2021/12/10
>> >> 22:33:46 CET]
>> >> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>> >> 6,560,468,631 triples/quads 129,331 TPS
>> >> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755
>> /
>> >> Avg: 237,755)
>> >>
>> >> * The ingest data step really slows down on the "ingest data stage":
>> At the
>> >> current rate, if I calculated correctly, it looks like
>> PKG.CmdxIngestData
>> >> has 10 days left before it finishes.
>> >>
>> >> * When I saw sort running in the background for the first parts of the
>> job,
>> >> I looked at the `sort` command. I noticed from some online sources that
>> >> setting the environment variable LC_ALL=C improves speed for `sort`.
>> Could
>> >> this be set on the ProcessBuilder for the `sort` process? Could it
>> >> break/change something? I see the warning from the man page for `sort`.
>> >>
>> >> *** WARNING *** The locale specified by the environment affects
>> >> sort order. Set LC_ALL=C to get the traditional sort order
>> that
>> >> uses native byte values.
>> >>
>> >> Links:
>> >> https://access.redhat.com/solutions/445233
>> >>
>> >>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>> >>
>> >>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>> >>
>> >> Best regards,
>> >> Øyvind
>> >>
>> >
>> >
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB SSD
drive and an old laptop.
I also have a contact who has just set up a new datacenter in Ireland. I
may be able to run a few tests on much bigger machines as well. Otherwise I
am very happy with the iron in Finland.as long as they are dedicated
machines.
On Sun, Dec 12, 2021 at 12:44 PM Andy Seaborne <an...@apache.org> wrote:
>
>
> On 11/12/2021 22:02, Marco Neumann wrote:
> > Thank you Øyvind for sharing, great to see more tests in the wild.
> >
> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > dataset and quickly ran out of disk space. It finished the job but did
> not
> > write any of the indexes to disk due to lack of space. no error messages.
>
> The 4.3.1 xloader should hopefully address the space issue.
>
> Andy
>
> >
> > http://www.lotico.com/temp/LOG-95239
> >
> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> the
> > full wikidata dataset,
> >
> > I personally had the best experience with dedicated hardware so far (can
> be
> > in the data center), shared or dedicated virtual compute engines did not
> > deliver as expected. And I have not seen great benefits from data center
> > grade multicore cpus. But I think they will during runtime in multi user
> > settings (eg fuseki).
> >
> > Best,
> > Marco
> >
> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
> wrote:
> >
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> truthy
> >> dump downloaded 2021-12-09.
> >>
> >> The instance is a vm created on the Norwegian Research and Education
> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >>
> >> The storage used for dump + temp files is mounted as a separate 900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >>> *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
> be
> >> overcommitted.
> >>
> >> OS for the instance is a clean Rocky Linux image, with no services
> except
> >> jena/fuseki installed. The systemd service set up for fuseki is
> stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >>
> >> I'm running from a tmux session to avoid connectivity issues and to
> capture
> >> the output. I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >>
> >> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> slowed
> >> down.
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the end of
> >> 'parse' and the start of 'index' for Terms. It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >>
> >> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage": At
> the
> >> current rate, if I calculated correctly, it looks like
> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >>
> >> * When I saw sort running in the background for the first parts of the
> job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >> *** WARNING *** The locale specified by the environment affects
> >> sort order. Set LC_ALL=C to get the traditional sort order that
> >> uses native byte values.
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >> Best regards,
> >> Øyvind
> >>
> >
> >
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
On 11/12/2021 22:02, Marco Neumann wrote:
> Thank you Øyvind for sharing, great to see more tests in the wild.
>
> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> dataset and quickly ran out of disk space. It finished the job but did not
> write any of the indexes to disk due to lack of space. no error messages.
The 4.3.1 xloader should hopefully address the space issue.
Andy
>
> http://www.lotico.com/temp/LOG-95239
>
> I have now ordered a new 4TB SSD drive to rerun the test possibly with the
> full wikidata dataset,
>
> I personally had the best experience with dedicated hardware so far (can be
> in the data center), shared or dedicated virtual compute engines did not
> deliver as expected. And I have not seen great benefits from data center
> grade multicore cpus. But I think they will during runtime in multi user
> settings (eg fuseki).
>
> Best,
> Marco
>
> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com> wrote:
>
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
>> dump downloaded 2021-12-09.
>>
>> The instance is a vm created on the Norwegian Research and Education Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>>
>> The storage used for dump + temp files is mounted as a separate 900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>>> *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
>> overcommitted.
>>
>> OS for the instance is a clean Rocky Linux image, with no services except
>> jena/fuseki installed. The systemd service set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> I'm running from a tmux session to avoid connectivity issues and to capture
>> the output. I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>>
>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really slowed
>> down.
>>
>> Current output is:
>>
>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the end of
>> 'parse' and the start of 'index' for Terms. It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>>
>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage": At the
>> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
>> has 10 days left before it finishes.
>>
>> * When I saw sort running in the background for the first parts of the job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`. Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>> *** WARNING *** The locale specified by the environment affects
>> sort order. Set LC_ALL=C to get the traditional sort order that
>> uses native byte values.
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>
>> Best regards,
>> Øyvind
>>
>
>
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
Thank you Øyvind for sharing, great to see more tests in the wild.
I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did not
write any of the indexes to disk due to lack of space. no error messages.
http://www.lotico.com/temp/LOG-95239
I have now ordered a new 4TB SSD drive to rerun the test possibly with the
full wikidata dataset,
I personally had the best experience with dedicated hardware so far (can be
in the data center), shared or dedicated virtual compute engines did not
deliver as expected. And I have not seen great benefits from data center
grade multicore cpus. But I think they will during runtime in multi user
settings (eg fuseki).
Best,
Marco
On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com> wrote:
> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
> dump downloaded 2021-12-09.
>
> The instance is a vm created on the Norwegian Research and Education Cloud,
> an openstack cloud provider.
>
> Instance type:
> 32 GB memory
> 4 CPU
>
> The storage used for dump + temp files is mounted as a separate 900GB
> volume and is mounted on /var/fuseki/databases
> .The type of storage is described as
> > *mass-storage-default*: Storage backed by spinning hard drives,
> available to everybody and is the default type.
> with ext4 configured. At the moment I don't have access to the faster
> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
> overcommitted.
>
> OS for the instance is a clean Rocky Linux image, with no services except
> jena/fuseki installed. The systemd service set up for fuseki is stopped.
> jena and fuseki version is 4.3.0.
>
> openjdk 11.0.13 2021-10-19 LTS
> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>
> I'm running from a tmux session to avoid connectivity issues and to capture
> the output. I think the output is stored in memory and not on disk.
> On First run I tried to have the tmpdir on the root partition, to separate
> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
> full. For the second (current run) all directories are under
> /var/fuseki/databases.
>
> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
> /var/fuseki/databases/tmp latest-truthy.nt.gz
>
> The import is so far at the "ingest data" stage where it has really slowed
> down.
>
> Current output is:
>
> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
> Avg: 7,593)
>
> See full log so far at
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>
> Some notes:
>
> * There is a (time/info) lapse in the output log between the end of
> 'parse' and the start of 'index' for Terms. It is unclear to me what is
> happening in the 1h13 minutes between the lines.
>
> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds [2021/12/10
> 22:33:46 CET]
> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
> 6,560,468,631 triples/quads 129,331 TPS
> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
> Avg: 237,755)
>
> * The ingest data step really slows down on the "ingest data stage": At the
> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
> has 10 days left before it finishes.
>
> * When I saw sort running in the background for the first parts of the job,
> I looked at the `sort` command. I noticed from some online sources that
> setting the environment variable LC_ALL=C improves speed for `sort`. Could
> this be set on the ProcessBuilder for the `sort` process? Could it
> break/change something? I see the warning from the man page for `sort`.
>
> *** WARNING *** The locale specified by the environment affects
> sort order. Set LC_ALL=C to get the traditional sort order that
> uses native byte values.
>
> Links:
> https://access.redhat.com/solutions/445233
>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>
> Best regards,
> Øyvind
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Øyvind Gjesdal <oy...@gmail.com>.
I'm debugging now and think I've found the some possible culprits of the
slow data ingest stage on my setup.
In the ingest data stage, I see 100% disk use read, with only 2.5 MB/s,
and the ps command also shows my processor spending time waiting for IO.
In the jena-base/src/main/java/org/apache/jena/atlas/io/IO.java class both
GZIPInputReader and GZIPOutputReader are called with one parameter, which
by default uses "512" as buffer. (
https://github.com/openjdk/jdk/blob/6e1da6440b898d8a10b3db56707a75f32a489149/src/java.base/share/classes/java/util/zip/GZIPInputStream.java#L90
)
I've built a snapshot now, with "return new
GZIP[Output/Input]Stream(out,8*1024);" (which is the buffer value used in
other places for GZIPOutput/Input in jena), which is currently running, but
I'm thinking of maybe testing larger buffers, and see what my "lowest best
buffer" is. I'm now waiting for the first steps to complete, to see the
effect of changing it.
What do I need to do to start from the exec_java $PKG.CmdxIngestData step
in tdb2.xloader?
Would
* stopping xloader before this step once
* taking a copy of $LOC
* then copying the copy to $LOC before each run (and commenting out
preceding steps in bin/tdb2.xloader)
be enough to have a working test setup to start at "data ingest", or do I
need to do something else?
Best regards,
Øyvind
tir. 28. des. 2021 kl. 14:21 skrev Andy Seaborne <an...@apache.org>:
> Excellent news!
>
> Updated: https://www.w3.org/wiki/LargeTripleStores
>
> Andy
>
> On 28/12/2021 10:11, Marco Neumann wrote:
> > Ok here is another successful tdb2 load. this time with the full wikidata
> > download (20211222_latest-all.nt.gz 172G ) file.
> >
> > counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
> > entire load.
> >
> > I think with the right hardware this could easily be time compressed
> quite
> > a bit.
> >
> > http://www.lotico.com/temp/LOG-45497
> >
> > Best,
> > Marco
> >
> >
> > On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <ma...@gmail.com>
> > wrote:
> >
> >> Thank you Andy. found it in revisions somewhere
> >>
> >> just finished another run with truthy
> >>
> >> http://lotico.com/temp/LOG-1214
> >>
> >> will now increase RAM before running an additional load with increased
> >> thread count.
> >>
> >> Marco
> >>
> >> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:
> >>
> >>> gists are git repos: so the file is there ... somewhere:
> >>>
> >>>
> >>>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
> >>>
> >>> Andy
> >>>
> >>> On 19/12/2021 17:56, Marco Neumann wrote:
> >>>> Thank you Lorenz,
> >>>> unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> >>>> github
> >>>>
> >>>>
> >>>> On Sun, Dec 19, 2021 at 9:46 AM LB <conpcomplete@googlemail.com
> >>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> I edited the Gist [1] and put the default stats there. Takes ~4min to
> >>>>> compute the stats.
> >>>>>
> >>>>> Findings:
> >>>>>
> >>>>> - for Wikidata we have to extend those stats with the stats for
> wdt:P31
> >>>>> property as Wikidata does use this property as their own rdf:type
> >>>>> relation. It is indeed trivial, just execute
> >>>>>
> >>>>> select ?c (count(*) as ?cnt) {?s
> >>>>> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
> >>>>>
> >>>>> and convert it into the stats rule language (SSE) and put those rules
> >>>>> before the more generic rule
> >>>>>
> >>>>> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
> >>>>>
> >>>>> - I didn't want to touch the stats script itself, but we could for
> >>>>> example also make this type relation generic and allow for other like
> >>>>> wdt:P31 or skos:subject via a commandline option which provides any
> URI
> >>>>> as the type relation with default being rdf:type - but that's for
> sure
> >>>>> probably overkill
> >>>>>
> >>>>> - there is a bug in the stats script or file I guess, because of of
> >>> some
> >>>>> overflow? the count value is
> >>>>>
> >>>>> (count -1983667112))
> >>>>>
> >>>>> which indicates this. I opened a ticket:
> >>>>> https://issues.apache.org/jira/browse/JENA-2225
> >>>>>
> >>>>>
> >>>>> [1]
> >>>>>
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>>
> >>>>> On 18.12.21 11:35, Marco Neumann wrote:
> >>>>>> good morning Lorenz,
> >>>>>>
> >>>>>> Maybe time to get a few query bencharms tests? :)
> >>>>>>
> >>>>>> What does tdb2.tdbstats report?
> >>>>>>
> >>>>>> Marco
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
> >>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Good morning,
> >>>>>>>
> >>>>>>> loading of Wikidata truthy is done, this time I didn't forget to
> keep
> >>>>>>> logs:
> >>>>>>>
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>>>>
> >>>>>>> I'm a bit surprised that this time it was 8h faster than last time,
> >>> 31h
> >>>>>>> vs 39h. Not sure if a) there was something else on the server last
> >>> time
> >>>>>>> (at least I couldn't see any running tasks) or b) if this is a
> >>>>>>> consequence of the more parallelized Unix sort now - I set it to
> >>>>>>> --parallel=16
> >>>>>>>
> >>>>>>> I mean, the piped input stream is single threaded I guess, but
> maybe
> >>> the
> >>>>>>> sort merge step can benefit from more threads? I guess I have to
> >>> clean
> >>>>>>> up everything and run it again with the original setup with 2 Unix
> >>> sort
> >>>>>>> threads ...
> >>>>>>>
> >>>>>>>
> >>>>>>> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>>>>>>
> >>>>>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>> I am getting a slow down during data ingestion. However, your
> >>> summary
> >>>>>>>>> figures don't show that in the ingest phase. The whole logs may
> >>> have
> >>>>>>>>> the signal in it but less pronounced.
> >>>>>>>>>
> >>>>>>>>> My working assumption is now that it is random access to the node
> >>>>>>>>> table. Your results point to it not being a CPU issue but that my
> >>>>>>>>> setup is saturating the I/O path. While the portable has a NVMe
> >>> SSD,
> >>>>>>>>> it has probably not got the same I/O bandwidth as a server class
> >>>>>>>>> machine.
> >>>>>>>>>
> >>>>>>>>> I'm not sure what to do about this other than run with a much
> >>> bigger
> >>>>>>>>> node table cache for the ingestion phase. Substituting some file
> >>>>>>>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>>>>>>> noticed before, it is probably visible in logs of smaller loads
> on
> >>>>>>>>> closer inspection. Experimenting on a small dataset is a lot
> >>> easier.
> >>>>>>>> I'm more sure of this - not yet definite.
> >>>>>>>>
> >>>>>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>>>>>>> Seems rather small for the task.
> >>>>>>>>
> >>>>>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
> >>> SPARQL
> >>>>>>>> results.
> >>>>>>>>
> >>>>>>>> 2 pieces of data will help:
> >>>>>>>>
> >>>>>>>> Experimenting with very small cache settings.
> >>>>>>>>
> >>>>>>>> Letting my slow load keep going to see if there is the same
> >>>>>>>> characteristics at the index stage. There shouldn't be if
> >>> nodeToNodeId
> >>>>>>>> is the cause; it's only an influence in the data ingestion step.
> >>>>>>>>
> >>>>>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
> >>> and
> >>>>>>>> maybe loader=phased. It falls into the same situation although the
> >>>>>>>> improvement there is going to be less marked. "Parallel" saturates
> >>> the
> >>>>>>>> I/O by other means as well.
> >>>>>>>>
> >>>>>>>> Andy
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >> --
> >>
> >>
> >> ---
> >> Marco Neumann
> >> KONA
> >>
> >>
> >
>
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
Excellent news!
Updated: https://www.w3.org/wiki/LargeTripleStores
Andy
On 28/12/2021 10:11, Marco Neumann wrote:
> Ok here is another successful tdb2 load. this time with the full wikidata
> download (20211222_latest-all.nt.gz 172G ) file.
>
> counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
> entire load.
>
> I think with the right hardware this could easily be time compressed quite
> a bit.
>
> http://www.lotico.com/temp/LOG-45497
>
> Best,
> Marco
>
>
> On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <ma...@gmail.com>
> wrote:
>
>> Thank you Andy. found it in revisions somewhere
>>
>> just finished another run with truthy
>>
>> http://lotico.com/temp/LOG-1214
>>
>> will now increase RAM before running an additional load with increased
>> thread count.
>>
>> Marco
>>
>> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:
>>
>>> gists are git repos: so the file is there ... somewhere:
>>>
>>>
>>> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>>>
>>> Andy
>>>
>>> On 19/12/2021 17:56, Marco Neumann wrote:
>>>> Thank you Lorenz,
>>>> unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
>>>> github
>>>>
>>>>
>>>> On Sun, Dec 19, 2021 at 9:46 AM LB <conpcomplete@googlemail.com
>>> .invalid>
>>>> wrote:
>>>>
>>>>> I edited the Gist [1] and put the default stats there. Takes ~4min to
>>>>> compute the stats.
>>>>>
>>>>> Findings:
>>>>>
>>>>> - for Wikidata we have to extend those stats with the stats for wdt:P31
>>>>> property as Wikidata does use this property as their own rdf:type
>>>>> relation. It is indeed trivial, just execute
>>>>>
>>>>> select ?c (count(*) as ?cnt) {?s
>>>>> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>>>>>
>>>>> and convert it into the stats rule language (SSE) and put those rules
>>>>> before the more generic rule
>>>>>
>>>>> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>>>>>
>>>>> - I didn't want to touch the stats script itself, but we could for
>>>>> example also make this type relation generic and allow for other like
>>>>> wdt:P31 or skos:subject via a commandline option which provides any URI
>>>>> as the type relation with default being rdf:type - but that's for sure
>>>>> probably overkill
>>>>>
>>>>> - there is a bug in the stats script or file I guess, because of of
>>> some
>>>>> overflow? the count value is
>>>>>
>>>>> (count -1983667112))
>>>>>
>>>>> which indicates this. I opened a ticket:
>>>>> https://issues.apache.org/jira/browse/JENA-2225
>>>>>
>>>>>
>>>>> [1]
>>>>>
>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>>>
>>>>> On 18.12.21 11:35, Marco Neumann wrote:
>>>>>> good morning Lorenz,
>>>>>>
>>>>>> Maybe time to get a few query bencharms tests? :)
>>>>>>
>>>>>> What does tdb2.tdbstats report?
>>>>>>
>>>>>> Marco
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
>>> .invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Good morning,
>>>>>>>
>>>>>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>>>>>>> logs:
>>>>>>>
>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>>>>>
>>>>>>> I'm a bit surprised that this time it was 8h faster than last time,
>>> 31h
>>>>>>> vs 39h. Not sure if a) there was something else on the server last
>>> time
>>>>>>> (at least I couldn't see any running tasks) or b) if this is a
>>>>>>> consequence of the more parallelized Unix sort now - I set it to
>>>>>>> --parallel=16
>>>>>>>
>>>>>>> I mean, the piped input stream is single threaded I guess, but maybe
>>> the
>>>>>>> sort merge step can benefit from more threads? I guess I have to
>>> clean
>>>>>>> up everything and run it again with the original setup with 2 Unix
>>> sort
>>>>>>> threads ...
>>>>>>>
>>>>>>>
>>>>>>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>>>>>>
>>>>>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> I am getting a slow down during data ingestion. However, your
>>> summary
>>>>>>>>> figures don't show that in the ingest phase. The whole logs may
>>> have
>>>>>>>>> the signal in it but less pronounced.
>>>>>>>>>
>>>>>>>>> My working assumption is now that it is random access to the node
>>>>>>>>> table. Your results point to it not being a CPU issue but that my
>>>>>>>>> setup is saturating the I/O path. While the portable has a NVMe
>>> SSD,
>>>>>>>>> it has probably not got the same I/O bandwidth as a server class
>>>>>>>>> machine.
>>>>>>>>>
>>>>>>>>> I'm not sure what to do about this other than run with a much
>>> bigger
>>>>>>>>> node table cache for the ingestion phase. Substituting some file
>>>>>>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>>>>>>> noticed before, it is probably visible in logs of smaller loads on
>>>>>>>>> closer inspection. Experimenting on a small dataset is a lot
>>> easier.
>>>>>>>> I'm more sure of this - not yet definite.
>>>>>>>>
>>>>>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>>>>>>> Seems rather small for the task.
>>>>>>>>
>>>>>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
>>> SPARQL
>>>>>>>> results.
>>>>>>>>
>>>>>>>> 2 pieces of data will help:
>>>>>>>>
>>>>>>>> Experimenting with very small cache settings.
>>>>>>>>
>>>>>>>> Letting my slow load keep going to see if there is the same
>>>>>>>> characteristics at the index stage. There shouldn't be if
>>> nodeToNodeId
>>>>>>>> is the cause; it's only an influence in the data ingestion step.
>>>>>>>>
>>>>>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
>>> and
>>>>>>>> maybe loader=phased. It falls into the same situation although the
>>>>>>>> improvement there is going to be less marked. "Parallel" saturates
>>> the
>>>>>>>> I/O by other means as well.
>>>>>>>>
>>>>>>>> Andy
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>>
>> ---
>> Marco Neumann
>> KONA
>>
>>
>
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.
counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
entire load.
I think with the right hardware this could easily be time compressed quite
a bit.
http://www.lotico.com/temp/LOG-45497
Best,
Marco
On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <ma...@gmail.com>
wrote:
> Thank you Andy. found it in revisions somewhere
>
> just finished another run with truthy
>
> http://lotico.com/temp/LOG-1214
>
> will now increase RAM before running an additional load with increased
> thread count.
>
> Marco
>
> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:
>
>> gists are git repos: so the file is there ... somewhere:
>>
>>
>> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>>
>> Andy
>>
>> On 19/12/2021 17:56, Marco Neumann wrote:
>> > Thank you Lorenz,
>> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
>> > github
>> >
>> >
>> > On Sun, Dec 19, 2021 at 9:46 AM LB <conpcomplete@googlemail.com
>> .invalid>
>> > wrote:
>> >
>> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
>> >> compute the stats.
>> >>
>> >> Findings:
>> >>
>> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
>> >> property as Wikidata does use this property as their own rdf:type
>> >> relation. It is indeed trivial, just execute
>> >>
>> >> select ?c (count(*) as ?cnt) {?s
>> >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>> >>
>> >> and convert it into the stats rule language (SSE) and put those rules
>> >> before the more generic rule
>> >>
>> >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>> >>
>> >> - I didn't want to touch the stats script itself, but we could for
>> >> example also make this type relation generic and allow for other like
>> >> wdt:P31 or skos:subject via a commandline option which provides any URI
>> >> as the type relation with default being rdf:type - but that's for sure
>> >> probably overkill
>> >>
>> >> - there is a bug in the stats script or file I guess, because of of
>> some
>> >> overflow? the count value is
>> >>
>> >> (count -1983667112))
>> >>
>> >> which indicates this. I opened a ticket:
>> >> https://issues.apache.org/jira/browse/JENA-2225
>> >>
>> >>
>> >> [1]
>> >>
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> >>
>> >> On 18.12.21 11:35, Marco Neumann wrote:
>> >>> good morning Lorenz,
>> >>>
>> >>> Maybe time to get a few query bencharms tests? :)
>> >>>
>> >>> What does tdb2.tdbstats report?
>> >>>
>> >>> Marco
>> >>>
>> >>>
>> >>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
>> .invalid>
>> >>> wrote:
>> >>>
>> >>>> Good morning,
>> >>>>
>> >>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>> >>>> logs:
>> >>>>
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> >>>>
>> >>>> I'm a bit surprised that this time it was 8h faster than last time,
>> 31h
>> >>>> vs 39h. Not sure if a) there was something else on the server last
>> time
>> >>>> (at least I couldn't see any running tasks) or b) if this is a
>> >>>> consequence of the more parallelized Unix sort now - I set it to
>> >>>> --parallel=16
>> >>>>
>> >>>> I mean, the piped input stream is single threaded I guess, but maybe
>> the
>> >>>> sort merge step can benefit from more threads? I guess I have to
>> clean
>> >>>> up everything and run it again with the original setup with 2 Unix
>> sort
>> >>>> threads ...
>> >>>>
>> >>>>
>> >>>> On 16.12.21 14:48, Andy Seaborne wrote:
>> >>>>>
>> >>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>> >>>>> ...
>> >>>>>
>> >>>>>> I am getting a slow down during data ingestion. However, your
>> summary
>> >>>>>> figures don't show that in the ingest phase. The whole logs may
>> have
>> >>>>>> the signal in it but less pronounced.
>> >>>>>>
>> >>>>>> My working assumption is now that it is random access to the node
>> >>>>>> table. Your results point to it not being a CPU issue but that my
>> >>>>>> setup is saturating the I/O path. While the portable has a NVMe
>> SSD,
>> >>>>>> it has probably not got the same I/O bandwidth as a server class
>> >>>>>> machine.
>> >>>>>>
>> >>>>>> I'm not sure what to do about this other than run with a much
>> bigger
>> >>>>>> node table cache for the ingestion phase. Substituting some file
>> >>>>>> mapper file area for bigger cache should be a win. While I hadn't
>> >>>>>> noticed before, it is probably visible in logs of smaller loads on
>> >>>>>> closer inspection. Experimenting on a small dataset is a lot
>> easier.
>> >>>>> I'm more sure of this - not yet definite.
>> >>>>>
>> >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>> >>>>> Seems rather small for the task.
>> >>>>>
>> >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
>> SPARQL
>> >>>>> results.
>> >>>>>
>> >>>>> 2 pieces of data will help:
>> >>>>>
>> >>>>> Experimenting with very small cache settings.
>> >>>>>
>> >>>>> Letting my slow load keep going to see if there is the same
>> >>>>> characteristics at the index stage. There shouldn't be if
>> nodeToNodeId
>> >>>>> is the cause; it's only an influence in the data ingestion step.
>> >>>>>
>> >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
>> and
>> >>>>> maybe loader=phased. It falls into the same situation although the
>> >>>>> improvement there is going to be less marked. "Parallel" saturates
>> the
>> >>>>> I/O by other means as well.
>> >>>>>
>> >>>>> Andy
>> >>>
>> >>
>> >
>> >
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
Thank you Andy. found it in revisions somewhere
just finished another run with truthy
http://lotico.com/temp/LOG-1214
will now increase RAM before running an additional load with increased
thread count.
Marco
On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:
> gists are git repos: so the file is there ... somewhere:
>
>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>
> Andy
>
> On 19/12/2021 17:56, Marco Neumann wrote:
> > Thank you Lorenz,
> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> > github
> >
> >
> > On Sun, Dec 19, 2021 at 9:46 AM LB <co...@googlemail.com.invalid>
> > wrote:
> >
> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
> >> compute the stats.
> >>
> >> Findings:
> >>
> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
> >> property as Wikidata does use this property as their own rdf:type
> >> relation. It is indeed trivial, just execute
> >>
> >> select ?c (count(*) as ?cnt) {?s
> >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
> >>
> >> and convert it into the stats rule language (SSE) and put those rules
> >> before the more generic rule
> >>
> >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
> >>
> >> - I didn't want to touch the stats script itself, but we could for
> >> example also make this type relation generic and allow for other like
> >> wdt:P31 or skos:subject via a commandline option which provides any URI
> >> as the type relation with default being rdf:type - but that's for sure
> >> probably overkill
> >>
> >> - there is a bug in the stats script or file I guess, because of of some
> >> overflow? the count value is
> >>
> >> (count -1983667112))
> >>
> >> which indicates this. I opened a ticket:
> >> https://issues.apache.org/jira/browse/JENA-2225
> >>
> >>
> >> [1]
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> On 18.12.21 11:35, Marco Neumann wrote:
> >>> good morning Lorenz,
> >>>
> >>> Maybe time to get a few query bencharms tests? :)
> >>>
> >>> What does tdb2.tdbstats report?
> >>>
> >>> Marco
> >>>
> >>>
> >>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
> .invalid>
> >>> wrote:
> >>>
> >>>> Good morning,
> >>>>
> >>>> loading of Wikidata truthy is done, this time I didn't forget to keep
> >>>> logs:
> >>>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>
> >>>> I'm a bit surprised that this time it was 8h faster than last time,
> 31h
> >>>> vs 39h. Not sure if a) there was something else on the server last
> time
> >>>> (at least I couldn't see any running tasks) or b) if this is a
> >>>> consequence of the more parallelized Unix sort now - I set it to
> >>>> --parallel=16
> >>>>
> >>>> I mean, the piped input stream is single threaded I guess, but maybe
> the
> >>>> sort merge step can benefit from more threads? I guess I have to clean
> >>>> up everything and run it again with the original setup with 2 Unix
> sort
> >>>> threads ...
> >>>>
> >>>>
> >>>> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>>>
> >>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>>>> ...
> >>>>>
> >>>>>> I am getting a slow down during data ingestion. However, your
> summary
> >>>>>> figures don't show that in the ingest phase. The whole logs may have
> >>>>>> the signal in it but less pronounced.
> >>>>>>
> >>>>>> My working assumption is now that it is random access to the node
> >>>>>> table. Your results point to it not being a CPU issue but that my
> >>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >>>>>> it has probably not got the same I/O bandwidth as a server class
> >>>>>> machine.
> >>>>>>
> >>>>>> I'm not sure what to do about this other than run with a much bigger
> >>>>>> node table cache for the ingestion phase. Substituting some file
> >>>>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>>>> noticed before, it is probably visible in logs of smaller loads on
> >>>>>> closer inspection. Experimenting on a small dataset is a lot easier.
> >>>>> I'm more sure of this - not yet definite.
> >>>>>
> >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>>>> Seems rather small for the task.
> >>>>>
> >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
> SPARQL
> >>>>> results.
> >>>>>
> >>>>> 2 pieces of data will help:
> >>>>>
> >>>>> Experimenting with very small cache settings.
> >>>>>
> >>>>> Letting my slow load keep going to see if there is the same
> >>>>> characteristics at the index stage. There shouldn't be if
> nodeToNodeId
> >>>>> is the cause; it's only an influence in the data ingestion step.
> >>>>>
> >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
> and
> >>>>> maybe loader=phased. It falls into the same situation although the
> >>>>> improvement there is going to be less marked. "Parallel" saturates
> the
> >>>>> I/O by other means as well.
> >>>>>
> >>>>> Andy
> >>>
> >>
> >
> >
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
gists are git repos: so the file is there ... somewhere:
https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
Andy
On 19/12/2021 17:56, Marco Neumann wrote:
> Thank you Lorenz,
> unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> github
>
>
> On Sun, Dec 19, 2021 at 9:46 AM LB <co...@googlemail.com.invalid>
> wrote:
>
>> I edited the Gist [1] and put the default stats there. Takes ~4min to
>> compute the stats.
>>
>> Findings:
>>
>> - for Wikidata we have to extend those stats with the stats for wdt:P31
>> property as Wikidata does use this property as their own rdf:type
>> relation. It is indeed trivial, just execute
>>
>> select ?c (count(*) as ?cnt) {?s
>> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>>
>> and convert it into the stats rule language (SSE) and put those rules
>> before the more generic rule
>>
>> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>>
>> - I didn't want to touch the stats script itself, but we could for
>> example also make this type relation generic and allow for other like
>> wdt:P31 or skos:subject via a commandline option which provides any URI
>> as the type relation with default being rdf:type - but that's for sure
>> probably overkill
>>
>> - there is a bug in the stats script or file I guess, because of of some
>> overflow? the count value is
>>
>> (count -1983667112))
>>
>> which indicates this. I opened a ticket:
>> https://issues.apache.org/jira/browse/JENA-2225
>>
>>
>> [1]
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>
>> On 18.12.21 11:35, Marco Neumann wrote:
>>> good morning Lorenz,
>>>
>>> Maybe time to get a few query bencharms tests? :)
>>>
>>> What does tdb2.tdbstats report?
>>>
>>> Marco
>>>
>>>
>>> On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
>>> wrote:
>>>
>>>> Good morning,
>>>>
>>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>>>> logs:
>>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>>
>>>> I'm a bit surprised that this time it was 8h faster than last time, 31h
>>>> vs 39h. Not sure if a) there was something else on the server last time
>>>> (at least I couldn't see any running tasks) or b) if this is a
>>>> consequence of the more parallelized Unix sort now - I set it to
>>>> --parallel=16
>>>>
>>>> I mean, the piped input stream is single threaded I guess, but maybe the
>>>> sort merge step can benefit from more threads? I guess I have to clean
>>>> up everything and run it again with the original setup with 2 Unix sort
>>>> threads ...
>>>>
>>>>
>>>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>>>
>>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>>>> ...
>>>>>
>>>>>> I am getting a slow down during data ingestion. However, your summary
>>>>>> figures don't show that in the ingest phase. The whole logs may have
>>>>>> the signal in it but less pronounced.
>>>>>>
>>>>>> My working assumption is now that it is random access to the node
>>>>>> table. Your results point to it not being a CPU issue but that my
>>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>>>>>> it has probably not got the same I/O bandwidth as a server class
>>>>>> machine.
>>>>>>
>>>>>> I'm not sure what to do about this other than run with a much bigger
>>>>>> node table cache for the ingestion phase. Substituting some file
>>>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>>>> noticed before, it is probably visible in logs of smaller loads on
>>>>>> closer inspection. Experimenting on a small dataset is a lot easier.
>>>>> I'm more sure of this - not yet definite.
>>>>>
>>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>>>> Seems rather small for the task.
>>>>>
>>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
>>>>> results.
>>>>>
>>>>> 2 pieces of data will help:
>>>>>
>>>>> Experimenting with very small cache settings.
>>>>>
>>>>> Letting my slow load keep going to see if there is the same
>>>>> characteristics at the index stage. There shouldn't be if nodeToNodeId
>>>>> is the cause; it's only an influence in the data ingestion step.
>>>>>
>>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
>>>>> maybe loader=phased. It falls into the same situation although the
>>>>> improvement there is going to be less marked. "Parallel" saturates the
>>>>> I/O by other means as well.
>>>>>
>>>>> Andy
>>>
>>
>
>
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github
On Sun, Dec 19, 2021 at 9:46 AM LB <co...@googlemail.com.invalid>
wrote:
> I edited the Gist [1] and put the default stats there. Takes ~4min to
> compute the stats.
>
> Findings:
>
> - for Wikidata we have to extend those stats with the stats for wdt:P31
> property as Wikidata does use this property as their own rdf:type
> relation. It is indeed trivial, just execute
>
> select ?c (count(*) as ?cnt) {?s
> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>
> and convert it into the stats rule language (SSE) and put those rules
> before the more generic rule
>
> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>
> - I didn't want to touch the stats script itself, but we could for
> example also make this type relation generic and allow for other like
> wdt:P31 or skos:subject via a commandline option which provides any URI
> as the type relation with default being rdf:type - but that's for sure
> probably overkill
>
> - there is a bug in the stats script or file I guess, because of of some
> overflow? the count value is
>
> (count -1983667112))
>
> which indicates this. I opened a ticket:
> https://issues.apache.org/jira/browse/JENA-2225
>
>
> [1]
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> On 18.12.21 11:35, Marco Neumann wrote:
> > good morning Lorenz,
> >
> > Maybe time to get a few query bencharms tests? :)
> >
> > What does tdb2.tdbstats report?
> >
> > Marco
> >
> >
> > On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
> > wrote:
> >
> >> Good morning,
> >>
> >> loading of Wikidata truthy is done, this time I didn't forget to keep
> >> logs:
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> I'm a bit surprised that this time it was 8h faster than last time, 31h
> >> vs 39h. Not sure if a) there was something else on the server last time
> >> (at least I couldn't see any running tasks) or b) if this is a
> >> consequence of the more parallelized Unix sort now - I set it to
> >> --parallel=16
> >>
> >> I mean, the piped input stream is single threaded I guess, but maybe the
> >> sort merge step can benefit from more threads? I guess I have to clean
> >> up everything and run it again with the original setup with 2 Unix sort
> >> threads ...
> >>
> >>
> >> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>
> >>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>> ...
> >>>
> >>>> I am getting a slow down during data ingestion. However, your summary
> >>>> figures don't show that in the ingest phase. The whole logs may have
> >>>> the signal in it but less pronounced.
> >>>>
> >>>> My working assumption is now that it is random access to the node
> >>>> table. Your results point to it not being a CPU issue but that my
> >>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >>>> it has probably not got the same I/O bandwidth as a server class
> >>>> machine.
> >>>>
> >>>> I'm not sure what to do about this other than run with a much bigger
> >>>> node table cache for the ingestion phase. Substituting some file
> >>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>> noticed before, it is probably visible in logs of smaller loads on
> >>>> closer inspection. Experimenting on a small dataset is a lot easier.
> >>> I'm more sure of this - not yet definite.
> >>>
> >>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>> Seems rather small for the task.
> >>>
> >>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> >>> results.
> >>>
> >>> 2 pieces of data will help:
> >>>
> >>> Experimenting with very small cache settings.
> >>>
> >>> Letting my slow load keep going to see if there is the same
> >>> characteristics at the index stage. There shouldn't be if nodeToNodeId
> >>> is the cause; it's only an influence in the data ingestion step.
> >>>
> >>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> >>> maybe loader=phased. It falls into the same situation although the
> >>> improvement there is going to be less marked. "Parallel" saturates the
> >>> I/O by other means as well.
> >>>
> >>> Andy
> >
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
On 19/12/2021 09:46, LB wrote:
> I edited the Gist [1] and put the default stats there. Takes ~4min to
> compute the stats.
>
> Findings:
>
> - for Wikidata we have to extend those stats with the stats for wdt:P31
> property as Wikidata does use this property as their own rdf:type
> relation. It is indeed trivial, just execute
>
> select ?c (count(*) as ?cnt) {?s
> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
Hmm, that's going to have an impact on the default fixed reorder
algorithm which has "rdf:type" built-in.
Probably another thread here - there are places where the design of
ARQ/TDB could be improved for multi-billion triple datasets.
Several high volume triple stores are read-only with 6 indexes for the
graph. That give two things: no need to support transactions, and any
join can be a merge join or parallel merge join.
Having cursors into the B+Trees (iterators that can jump over sections
of the scan to support SIP (sideways information passing)) is another area.
There's some mileage yet in single machine stores, mostly because
hardware keeps getting better, at a cost, and persistent RAM will change
things ("in 5 yars time").
But native cluster engines already have some advantages.
> and convert it into the stats rule language (SSE) and put those rules
> before the more generic rule
>
> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>
> - I didn't want to touch the stats script itself, but we could for
> example also make this type relation generic and allow for other like
> wdt:P31 or skos:subject via a commandline option which provides any URI
> as the type relation with default being rdf:type - but that's for sure
> probably overkill
>
> - there is a bug in the stats script or file I guess, because of of some
> overflow? the count value is
>
> (count -1983667112))
>
> which indicates this. I opened a ticket:
> https://issues.apache.org/jira/browse/JENA-2225
Thanks.
>
>
> [1] https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> On 18.12.21 11:35, Marco Neumann wrote:
>> good morning Lorenz,
>>
>> Maybe time to get a few query bencharms tests? :)
>>
>> What does tdb2.tdbstats report?
>>
>> Marco
>>
>>
>> On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
>> wrote:
>>
>>> Good morning,
>>>
>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>>> logs:
>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>
>>> I'm a bit surprised that this time it was 8h faster than last time, 31h
>>> vs 39h. Not sure if a) there was something else on the server last time
>>> (at least I couldn't see any running tasks) or b) if this is a
>>> consequence of the more parallelized Unix sort now - I set it to
>>> --parallel=16
>>>
>>> I mean, the piped input stream is single threaded I guess, but maybe the
>>> sort merge step can benefit from more threads? I guess I have to clean
>>> up everything and run it again with the original setup with 2 Unix sort
>>> threads ...
>>>
>>>
>>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>>
>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>>> ...
>>>>
>>>>> I am getting a slow down during data ingestion. However, your summary
>>>>> figures don't show that in the ingest phase. The whole logs may have
>>>>> the signal in it but less pronounced.
>>>>>
>>>>> My working assumption is now that it is random access to the node
>>>>> table. Your results point to it not being a CPU issue but that my
>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>>>>> it has probably not got the same I/O bandwidth as a server class
>>>>> machine.
>>>>>
>>>>> I'm not sure what to do about this other than run with a much bigger
>>>>> node table cache for the ingestion phase. Substituting some file
>>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>>> noticed before, it is probably visible in logs of smaller loads on
>>>>> closer inspection. Experimenting on a small dataset is a lot easier.
>>>> I'm more sure of this - not yet definite.
>>>>
>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>>> Seems rather small for the task.
>>>>
>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
>>>> results.
>>>>
>>>> 2 pieces of data will help:
>>>>
>>>> Experimenting with very small cache settings.
>>>>
>>>> Letting my slow load keep going to see if there is the same
>>>> characteristics at the index stage. There shouldn't be if nodeToNodeId
>>>> is the cause; it's only an influence in the data ingestion step.
>>>>
>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
>>>> maybe loader=phased. It falls into the same situation although the
>>>> improvement there is going to be less marked. "Parallel" saturates the
>>>> I/O by other means as well.
>>>>
>>>> Andy
>>
>
Re: Testing tdb2.xloader
Posted by LB <co...@googlemail.com.INVALID>.
I edited the Gist [1] and put the default stats there. Takes ~4min to
compute the stats.
Findings:
- for Wikidata we have to extend those stats with the stats for wdt:P31
property as Wikidata does use this property as their own rdf:type
relation. It is indeed trivial, just execute
select ?c (count(*) as ?cnt) {?s
<http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
and convert it into the stats rule language (SSE) and put those rules
before the more generic rule
|(<http://www.wikidata.org/prop/direct/P31> 98152611)|
- I didn't want to touch the stats script itself, but we could for
example also make this type relation generic and allow for other like
wdt:P31 or skos:subject via a commandline option which provides any URI
as the type relation with default being rdf:type - but that's for sure
probably overkill
- there is a bug in the stats script or file I guess, because of of some
overflow? the count value is
(count -1983667112))
which indicates this. I opened a ticket:
https://issues.apache.org/jira/browse/JENA-2225
[1] https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
On 18.12.21 11:35, Marco Neumann wrote:
> good morning Lorenz,
>
> Maybe time to get a few query bencharms tests? :)
>
> What does tdb2.tdbstats report?
>
> Marco
>
>
> On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
> wrote:
>
>> Good morning,
>>
>> loading of Wikidata truthy is done, this time I didn't forget to keep
>> logs:
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>
>> I'm a bit surprised that this time it was 8h faster than last time, 31h
>> vs 39h. Not sure if a) there was something else on the server last time
>> (at least I couldn't see any running tasks) or b) if this is a
>> consequence of the more parallelized Unix sort now - I set it to
>> --parallel=16
>>
>> I mean, the piped input stream is single threaded I guess, but maybe the
>> sort merge step can benefit from more threads? I guess I have to clean
>> up everything and run it again with the original setup with 2 Unix sort
>> threads ...
>>
>>
>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>
>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>> ...
>>>
>>>> I am getting a slow down during data ingestion. However, your summary
>>>> figures don't show that in the ingest phase. The whole logs may have
>>>> the signal in it but less pronounced.
>>>>
>>>> My working assumption is now that it is random access to the node
>>>> table. Your results point to it not being a CPU issue but that my
>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>>>> it has probably not got the same I/O bandwidth as a server class
>>>> machine.
>>>>
>>>> I'm not sure what to do about this other than run with a much bigger
>>>> node table cache for the ingestion phase. Substituting some file
>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>> noticed before, it is probably visible in logs of smaller loads on
>>>> closer inspection. Experimenting on a small dataset is a lot easier.
>>> I'm more sure of this - not yet definite.
>>>
>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>> Seems rather small for the task.
>>>
>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
>>> results.
>>>
>>> 2 pieces of data will help:
>>>
>>> Experimenting with very small cache settings.
>>>
>>> Letting my slow load keep going to see if there is the same
>>> characteristics at the index stage. There shouldn't be if nodeToNodeId
>>> is the cause; it's only an influence in the data ingestion step.
>>>
>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
>>> maybe loader=phased. It falls into the same situation although the
>>> improvement there is going to be less marked. "Parallel" saturates the
>>> I/O by other means as well.
>>>
>>> Andy
>
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
good morning Lorenz,
Maybe time to get a few query bencharms tests? :)
What does tdb2.tdbstats report?
Marco
On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
wrote:
> Good morning,
>
> loading of Wikidata truthy is done, this time I didn't forget to keep
> logs:
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> I'm a bit surprised that this time it was 8h faster than last time, 31h
> vs 39h. Not sure if a) there was something else on the server last time
> (at least I couldn't see any running tasks) or b) if this is a
> consequence of the more parallelized Unix sort now - I set it to
> --parallel=16
>
> I mean, the piped input stream is single threaded I guess, but maybe the
> sort merge step can benefit from more threads? I guess I have to clean
> up everything and run it again with the original setup with 2 Unix sort
> threads ...
>
>
> On 16.12.21 14:48, Andy Seaborne wrote:
> >
> >
> > On 16/12/2021 10:52, Andy Seaborne wrote:
> > ...
> >
> >> I am getting a slow down during data ingestion. However, your summary
> >> figures don't show that in the ingest phase. The whole logs may have
> >> the signal in it but less pronounced.
> >>
> >> My working assumption is now that it is random access to the node
> >> table. Your results point to it not being a CPU issue but that my
> >> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >> it has probably not got the same I/O bandwidth as a server class
> >> machine.
> >>
> >> I'm not sure what to do about this other than run with a much bigger
> >> node table cache for the ingestion phase. Substituting some file
> >> mapper file area for bigger cache should be a win. While I hadn't
> >> noticed before, it is probably visible in logs of smaller loads on
> >> closer inspection. Experimenting on a small dataset is a lot easier.
> >
> > I'm more sure of this - not yet definite.
> >
> > The nodeToNodeId cache is 200k -- this is on the load/update path.
> > Seems rather small for the task.
> >
> > The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> > results.
> >
> > 2 pieces of data will help:
> >
> > Experimenting with very small cache settings.
> >
> > Letting my slow load keep going to see if there is the same
> > characteristics at the index stage. There shouldn't be if nodeToNodeId
> > is the cause; it's only an influence in the data ingestion step.
> >
> > Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> > maybe loader=phased. It falls into the same situation although the
> > improvement there is going to be less marked. "Parallel" saturates the
> > I/O by other means as well.
> >
> > Andy
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
https://gist.github.com/afs/c97ebc7351478bce2989b79c9195ef11
Dell XPS13 (2021 edition)
32G RAM
4 core
1T SSD disk
Jena 4.3.1
Data:
wikidata-20211208-truthy-BETA.nt.gz
14:47:09 INFO Load node table = 39976 seconds
14:47:09 INFO Load ingest data = 100007 seconds
14:47:09 INFO Build index SPO = 32927 seconds
14:47:09 INFO Build index POS = 62628 seconds
14:47:09 INFO Build index OSP = 43457 seconds
14:47:09 INFO Overall 278995 seconds
14:47:09 INFO Overall 77h 29m 55s
14:47:09 INFO Triples loaded = 6,610,055,778
14:47:09 INFO Quads loaded = 0
14:47:09 INFO Overall Rate 23,692 tuples per second
Of note:
Ingest time
POS time
Haven't had the chance to look in detail yet.
Andy
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
Hi Lorenz,
On 18/12/2021 08:09, LB wrote:
> Good morning,
>
> loading of Wikidata truthy is done, this time I didn't forget to keep
> logs:
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> I'm a bit surprised that this time it was 8h faster than last time, 31h
> vs 39h.
Great!
> Not sure if a) there was something else on the server last time
> (at least I couldn't see any running tasks) or b) if this is a
> consequence of the more parallelized Unix sort now - I set it to
> --parallel=16
>
> I mean, the piped input stream is single threaded I guess, but maybe the
> sort merge step can benefit from more threads?
yes - the sorting itself can be more parallel on a machine the size of
you have.
Time to add a configuration file, rather than a slew of command line
arguments. The file also then acts as a record of the setup.
I'm finding a new characteristic:
Loading on a smaller machine (32G RAM), I think the index sorting is
recombining temp files. That results in more I/O and higher peak disk
usage. While POS is always slower, it appears to be very much slower
than SPO.
The internet has not been very clear on the effect of "batch size" but
the GNU man page talks about "--batch-size=16". I get more than 16 temp
files - you probably don't at this scale.
--batch-size=128 seems better -- unlikely to be a problem with the
number of file descriptors nowadays. 16 is probably just how it always was.
On my machine: per process:
ulimit -Sn is 1024 -- ulimit current setting
ulimit -Hn is 1048576 -- ulimit max without being root.
I'll investigate when the load finishes. I'm trying not to touch the
machine to avoid breaking something. It is currently doing OSP.
> I guess I have to clean
> up everything and run it again with the original setup with 2 Unix sort
> threads ...
Andy
Re: Testing tdb2.xloader
Posted by LB <co...@googlemail.com.INVALID>.
Good morning,
loading of Wikidata truthy is done, this time I didn't forget to keep
logs:
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
I'm a bit surprised that this time it was 8h faster than last time, 31h
vs 39h. Not sure if a) there was something else on the server last time
(at least I couldn't see any running tasks) or b) if this is a
consequence of the more parallelized Unix sort now - I set it to
--parallel=16
I mean, the piped input stream is single threaded I guess, but maybe the
sort merge step can benefit from more threads? I guess I have to clean
up everything and run it again with the original setup with 2 Unix sort
threads ...
On 16.12.21 14:48, Andy Seaborne wrote:
>
>
> On 16/12/2021 10:52, Andy Seaborne wrote:
> ...
>
>> I am getting a slow down during data ingestion. However, your summary
>> figures don't show that in the ingest phase. The whole logs may have
>> the signal in it but less pronounced.
>>
>> My working assumption is now that it is random access to the node
>> table. Your results point to it not being a CPU issue but that my
>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>> it has probably not got the same I/O bandwidth as a server class
>> machine.
>>
>> I'm not sure what to do about this other than run with a much bigger
>> node table cache for the ingestion phase. Substituting some file
>> mapper file area for bigger cache should be a win. While I hadn't
>> noticed before, it is probably visible in logs of smaller loads on
>> closer inspection. Experimenting on a small dataset is a lot easier.
>
> I'm more sure of this - not yet definite.
>
> The nodeToNodeId cache is 200k -- this is on the load/update path.
> Seems rather small for the task.
>
> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> results.
>
> 2 pieces of data will help:
>
> Experimenting with very small cache settings.
>
> Letting my slow load keep going to see if there is the same
> characteristics at the index stage. There shouldn't be if nodeToNodeId
> is the cause; it's only an influence in the data ingestion step.
>
> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> maybe loader=phased. It falls into the same situation although the
> improvement there is going to be less marked. "Parallel" saturates the
> I/O by other means as well.
>
> Andy
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
On 16/12/2021 10:52, Andy Seaborne wrote:
...
> I am getting a slow down during data ingestion. However, your summary
> figures don't show that in the ingest phase. The whole logs may have the
> signal in it but less pronounced.
>
> My working assumption is now that it is random access to the node table.
> Your results point to it not being a CPU issue but that my setup is
> saturating the I/O path. While the portable has a NVMe SSD, it has
> probably not got the same I/O bandwidth as a server class machine.
>
> I'm not sure what to do about this other than run with a much bigger
> node table cache for the ingestion phase. Substituting some file mapper
> file area for bigger cache should be a win. While I hadn't noticed
> before, it is probably visible in logs of smaller loads on closer
> inspection. Experimenting on a small dataset is a lot easier.
I'm more sure of this - not yet definite.
The nodeToNodeId cache is 200k -- this is on the load/update path. Seems
rather small for the task.
The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
results.
2 pieces of data will help:
Experimenting with very small cache settings.
Letting my slow load keep going to see if there is the same
characteristics at the index stage. There shouldn't be if nodeToNodeId
is the cause; it's only an influence in the data ingestion step.
Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
maybe loader=phased. It falls into the same situation although the
improvement there is going to be less marked. "Parallel" saturates the
I/O by other means as well.
Andy
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
On 16/12/2021 12:32, LB wrote:
> I couldn't get access to the full log as the output was too verbose for
> the screen and I forgot to pipe into a file ...
Yes - familiar ...
Maybe xloader should capture it's logging.
>
> I can confirm the triples.tmp.gz size was something around 35-40G if I
> remember correctly.
It leaves the file behind.
>
> I rerun the load now to a) keep logs and b) see if increasing the number
> of threads for parallel sort does change anything (though I don't think
> so).
a useful data point.
>
> In ~40h I will provide an update and share the logs.
>
>
> Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDB
> would be on disk in that case. We have ~4 x the size of the truthy dump,
> both compressed, I'd guess we get something between 1.8T and 2.3T -
> clearly, the number of nodes shouldn't increase linearly as many
> statements are still about nodes from truthy dump. On the other hand,
> Wikidata has this statement about statement stuff, i.e. we have lots of
> statement identifiers.
>
> Once I've time and enough resources on the server, I'll give it a try.
> Now I'm interested to see the final result
The indexes and the triples.tmp scale with number of triples.
I haven't parsed "all" to make sure it's parseable for a load. Given the
WD data model, most of the dubious stuff is in truthy but there is
always a chance of stray rubbish creeping in.
>
>
> On 16.12.21 11:52, Andy Seaborne wrote:
>> Awesome!
>> I'm really pleased to hear the news.
>>
>> That's better than I feared at this scale!
>>
>> How big is triples.tmp.gz? 2* that size, and the database size is the
>> peak storage space used. My estimate is about 40G making 604G overall.
>>
>> I'd appreciate having the whole log file. Could you email it to me?
>>
>> Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a
>> modern portable with 4 cores and a single notional 1TB SSD. If the
>> estimate is right, it will fit. More good news.
>>
>>
>> I am getting a slow down during data ingestion. However, your summary
>> figures don't show that in the ingest phase. The whole logs may have
>> the signal in it but less pronounced.
>>
>> My working assumption is now that it is random access to the node
>> table. Your results point to it not being a CPU issue but that my
>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>> it has probably not got the same I/O bandwidth as a server class machine.
>>
>> I'm not sure what to do about this other than run with a much bigger
>> node table cache for the ingestion phase. Substituting some file
>> mapper file area for bigger cache should be a win. While I hadn't
>> noticed before, it is probably visible in logs of smaller loads on
>> closer inspection. Experimenting on a small dataset is a lot easier.
>>
>>
>> I'm also watching the CPU temperature. When the graphics aren't
>> active, the fans aren't even on. After a few minutes of active screen
>> the fans spin up but the temperatures are still well within the
>> limits. The machine is raised up by 1cm to give good airflow. And I
>> keep the door shut to keep the cats away.
>>
>> Andy
>>
>> Inline ...
>>
>> On 16/12/2021 08:49, LB wrote:
>>> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>>>
>>> Server:
>>>
>>> AMD Ryzen 9 5950X (16C/32T)
>>> 128 GB DDR4 ECC RAM
>>> 2 x 3.84 TB NVMe SSD
>>
>> Nice.
>>
>>> Environment:
>>>
>>> - Ubuntu 20.04.3 LTS
>>> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
>>> - Jena 4.3.1
>>>
>>>
>>> Command:
>>>
>>>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb
>>>> --loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>>
>> I found .gz to be slightly faster than .bz2. This maybe because .gz is
>> better supported by the Java runtime or just the fact bz2 is designed
>> for best compression.
>>
>>>
>>>
>>> Log summary:
>>>
>>>> 04:14:28 INFO Load node table = 36600 seconds
>>>> 04:14:28 INFO Load ingest data = 25811 seconds
>>>> 04:14:28 INFO Build index SPO = 20688 seconds
>>>> 04:14:28 INFO Build index POS = 35466 seconds
>>>> 04:14:28 INFO Build index OSP = 25042 seconds
>>>> 04:14:28 INFO Overall 143607 seconds
>>>> 04:14:28 INFO Overall 39h 53m 27s
>>
>> Less than 2 days :-)
>>
>>>> 04:14:28 INFO Triples loaded = 6.610.055.778
>>>> 04:14:28 INFO Quads loaded = 0
>>>> 04:14:28 INFO Overall Rate 46.028 tuples per second
>>>
>>>
>>> Disk space usage according to
>>>
>>>> du -sh datasets/wikidata-tdb
>>>
>>> is
>>>
>>>> 524G datasets/wikidata-tdb
>>
>>
>>>
>>> During loading I could see ~90GB of RAM occupied (50% of total memory
>>> got to sort and it used 2 threads - is it intended to stick to 2
>>> threads with --parallel 2?)
>>
>> It is fixed at two for the sort currently.
>>
>> There may be some benefit in making this configurable but previously
>> I've found that more threads does not seem to yield much benefit
>> though you have a lot more threads! Experiment required.
>>
>>>
>>>
>>> Cheers,
>>> Lorenz
>>>
>>>
>>> On 12.12.21 13:07, Andy Seaborne wrote:
>>>> Hi, Øyvind,
>>>>
>>>> This is all very helpful feedback. Thank you.
>>>>
>>>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>>>> I'm trying out tdb2.xloader on an openstack vm, loading the
>>>>> wikidata truthy
>>>>> dump downloaded 2021-12-09.
>>>>
>>>> This is the 4.3.0 xloader?
>>>>
>>>> There are improvements in 4.3.1. Since that release was going out,
>>>> including using less temporary space, the development version got
>>>> merged in. It has had some testing.
>>>>
>>>> It compresses the triples.tmp and intermediate sort files in the
>>>> index stage making the peak usage much smaller.
>>>>
>>>>> The instance is a vm created on the Norwegian Research and
>>>>> Education Cloud,
>>>>> an openstack cloud provider.
>>>>>
>>>>> Instance type:
>>>>> 32 GB memory
>>>>> 4 CPU
>>>>
>>>> I using similar on a 7 year old desktop machine, SATA disk.
>>>>
>>>> I haven't got a machine I can dedicate to the multi-day load. I'll
>>>> try to find a way to at least push it though building the node table.
>>>>
>>>> Loading the first 1B of truthy:
>>>>
>>>> 1B triples , 40kTPS , 06h 54m 10s
>>>>
>>>> The database is 81G and building needs an addition 11.6G for
>>>> workspace for a total of 92G (+ the data file).
>>>>
>>>> While smaller, its seems bz2 files are much slower to decompress so
>>>> I've been using gz files.
>>>>
>>>> My current best guess for 6.4B truthy is
>>>>
>>>> Temp 96G
>>>> Database 540G
>>>> Data 48G
>>>> Total: 684G -- peak disk needed
>>>>
>>>> based on scaling up 1B truthy. Personally, I would make sure there
>>>> was more space. Also - I don't know if the shape of the data is
>>>> sufficiently uniform to make scaling predictable. The time doesn't
>>>> scale so simply.
>>>>
>>>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>>>
>>>> Compression reduces the size of triples.tmp -- the related sort
>>>> temporary files which add up to the same again -- 1/6 of the size.
>>>>
>>>>> The storage used for dump + temp files is mounted as a separate 900GB
>>>>> volume and is mounted on /var/fuseki/databases
>>>>> .The type of storage is described as
>>>>>> *mass-storage-default*: Storage backed by spinning hard drives,
>>>>> available to everybody and is the default type.
>>>>> with ext4 configured. At the moment I don't have access to the faster
>>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>>>>> can be
>>>>> overcommitted.
>>>>
>>>> "overcommitted" may be a problem.
>>>>
>>>> While it's not "tdb2 loader parallel" it does use a continuous CPU
>>>> in several threads.
>>>>
>>>> For memory - "it's complicated".
>>>>
>>>> The java parts only need say 2G. The sort is set to "buffer 50%
>>>> --parallel=2" and the java pipes into sort, that's another thread. I
>>>> think the effective peak is 3 active threads and they'll all be at
>>>> 100% for some of the time.
>>>>
>>>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>>>
>>>> It does not need space for memory mapped files (they aren't used at
>>>> all in the loading process and I/O is sequential.
>>>>
>>>> If that triggers over commitment swap out, the performance may go
>>>> down a lot.
>>>>
>>>> For disk - if that is physically remote, it should not a problem
>>>> (famous last words). I/O is sequential and in large continuous
>>>> chunks - typical for batch processing jobs.
>>>>
>>>>> OS for the instance is a clean Rocky Linux image, with no services
>>>>> except
>>>>> jena/fuseki installed. The systemd service
>>>>
>>>> set up for fuseki is stopped.
>>>>> jena and fuseki version is 4.3.0.
>>>>>
>>>>> openjdk 11.0.13 2021-10-19 LTS
>>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
>>>>> sharing)
>>>>
>>>> Just FYI: Java17 is a little faster. Some java improvements have
>>>> improved RDF parsing speed by up to 10%. in xloader that not
>>>> significant to the overall time.
>>>>
>>>>> I'm running from a tmux session to avoid connectivity issues and to
>>>>> capture
>>>>> the output.
>>>>
>>>> I use
>>>>
>>>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>>>
>>>> to capture the logs and see them. ">&" and "tail -f" would achieve
>>>> much the same effect
>>>>
>>>>> I think the output is stored in memory and not on disk.
>>>>> On First run I tried to have the tmpdir on the root partition, to
>>>>> separate
>>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon
>>>>> was disk
>>>>> full. For the second (current run) all directories are under
>>>>> /var/fuseki/databases.
>>>>
>>>> Yes - after making that mistake myself, the new version ignores
>>>> system TMPDIR. Using --tmpdir is best but otherwise it defaults to
>>>> the data directory.
>>>>
>>>>>
>>>>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>>>>> --tmpdir
>>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>>
>>>>> The import is so far at the "ingest data" stage where it has really
>>>>> slowed
>>>>> down.
>>>>
>>>> FYI: The first line of ingest is always very slow. It is not
>>>> measuring the start point correctly.
>>>>
>>>>>
>>>>> Current output is:
>>>>>
>>>>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch:
>>>>> 3,356 /
>>>>> Avg: 7,593)
>>>>>
>>>>> See full log so far at
>>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>>>>
>>>>
>>>> The earlier first pass also slows down and that should be fairly
>>>> constant-ish speed step once everything settles down.
>>>>
>>>>> Some notes:
>>>>>
>>>>> * There is a (time/info) lapse in the output log between the end of
>>>>> 'parse' and the start of 'index' for Terms. It is unclear to me
>>>>> what is
>>>>> happening in the 1h13 minutes between the lines.
>>>>
>>>> There is "sort" going on. "top" should show it.
>>>>
>>>> For each index there is also a very long pause for exactly the same
>>>> reason. It would be good to have some something go "tick" and log a
>>>> message occasionally.
>>>>
>>>>>
>>>>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
>>>>> [2021/12/10
>>>>> 22:33:46 CET]
>>>>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch:
>>>>> 237,755 /
>>>>> Avg: 237,755)
>>>>>
>>>>> * The ingest data step really slows down on the "ingest data
>>>>> stage": At the
>>>>> current rate, if I calculated correctly, it looks like
>>>>> PKG.CmdxIngestData
>>>>> has 10 days left before it finishes.
>>>>
>>>> Ouch.
>>>>
>>>>> * When I saw sort running in the background for the first parts of
>>>>> the job,
>>>>> I looked at the `sort` command. I noticed from some online sources
>>>>> that
>>>>> setting the environment variable LC_ALL=C improves speed for
>>>>> `sort`. Could
>>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>>> break/change something? I see the warning from the man page for
>>>>> `sort`.
>>>>>
>>>>> *** WARNING *** The locale specified by the environment
>>>>> affects
>>>>> sort order. Set LC_ALL=C to get the traditional sort order
>>>>> that
>>>>> uses native byte values.
>>>>
>>>> It shouldn't matter but, yes, better to set it and export it in the
>>>> control script and propagate to forked processes.
>>>>
>>>> The sort is doing a binary sort except because it a text sort
>>>> program, the binary is turned into hex (!!). hex is in the ASCII
>>>> subset and shoule be locale safe.
>>>>
>>>> But better to set LC_ALL=C.
>>>>
>>>> Andy
>>>>
>>>>
>>>>>
>>>>> Links:
>>>>> https://access.redhat.com/solutions/445233
>>>>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>>>>
>>>>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Øyvind
>>>>>
Re: Testing tdb2.xloader
Posted by LB <co...@googlemail.com.INVALID>.
I couldn't get access to the full log as the output was too verbose for
the screen and I forgot to pipe into a file ...
I can confirm the triples.tmp.gz size was something around 35-40G if I
remember correctly.
I rerun the load now to a) keep logs and b) see if increasing the number
of threads for parallel sort does change anything (though I don't think so).
In ~40h I will provide an update and share the logs.
Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDB
would be on disk in that case. We have ~4 x the size of the truthy dump,
both compressed, I'd guess we get something between 1.8T and 2.3T -
clearly, the number of nodes shouldn't increase linearly as many
statements are still about nodes from truthy dump. On the other hand,
Wikidata has this statement about statement stuff, i.e. we have lots of
statement identifiers.
Once I've time and enough resources on the server, I'll give it a try.
Now I'm interested to see the final result
On 16.12.21 11:52, Andy Seaborne wrote:
> Awesome!
> I'm really pleased to hear the news.
>
> That's better than I feared at this scale!
>
> How big is triples.tmp.gz? 2* that size, and the database size is the
> peak storage space used. My estimate is about 40G making 604G overall.
>
> I'd appreciate having the whole log file. Could you email it to me?
>
> Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a
> modern portable with 4 cores and a single notional 1TB SSD. If the
> estimate is right, it will fit. More good news.
>
>
> I am getting a slow down during data ingestion. However, your summary
> figures don't show that in the ingest phase. The whole logs may have
> the signal in it but less pronounced.
>
> My working assumption is now that it is random access to the node
> table. Your results point to it not being a CPU issue but that my
> setup is saturating the I/O path. While the portable has a NVMe SSD,
> it has probably not got the same I/O bandwidth as a server class machine.
>
> I'm not sure what to do about this other than run with a much bigger
> node table cache for the ingestion phase. Substituting some file
> mapper file area for bigger cache should be a win. While I hadn't
> noticed before, it is probably visible in logs of smaller loads on
> closer inspection. Experimenting on a small dataset is a lot easier.
>
>
> I'm also watching the CPU temperature. When the graphics aren't
> active, the fans aren't even on. After a few minutes of active screen
> the fans spin up but the temperatures are still well within the
> limits. The machine is raised up by 1cm to give good airflow. And I
> keep the door shut to keep the cats away.
>
> Andy
>
> Inline ...
>
> On 16/12/2021 08:49, LB wrote:
>> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>>
>> Server:
>>
>> AMD Ryzen 9 5950X (16C/32T)
>> 128 GB DDR4 ECC RAM
>> 2 x 3.84 TB NVMe SSD
>
> Nice.
>
>> Environment:
>>
>> - Ubuntu 20.04.3 LTS
>> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
>> - Jena 4.3.1
>>
>>
>> Command:
>>
>>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb
>>> --loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>
> I found .gz to be slightly faster than .bz2. This maybe because .gz is
> better supported by the Java runtime or just the fact bz2 is designed
> for best compression.
>
>>
>>
>> Log summary:
>>
>>> 04:14:28 INFO Load node table = 36600 seconds
>>> 04:14:28 INFO Load ingest data = 25811 seconds
>>> 04:14:28 INFO Build index SPO = 20688 seconds
>>> 04:14:28 INFO Build index POS = 35466 seconds
>>> 04:14:28 INFO Build index OSP = 25042 seconds
>>> 04:14:28 INFO Overall 143607 seconds
>>> 04:14:28 INFO Overall 39h 53m 27s
>
> Less than 2 days :-)
>
>>> 04:14:28 INFO Triples loaded = 6.610.055.778
>>> 04:14:28 INFO Quads loaded = 0
>>> 04:14:28 INFO Overall Rate 46.028 tuples per second
>>
>>
>> Disk space usage according to
>>
>>> du -sh datasets/wikidata-tdb
>>
>> is
>>
>>> 524G datasets/wikidata-tdb
>
>
>>
>> During loading I could see ~90GB of RAM occupied (50% of total memory
>> got to sort and it used 2 threads - is it intended to stick to 2
>> threads with --parallel 2?)
>
> It is fixed at two for the sort currently.
>
> There may be some benefit in making this configurable but previously
> I've found that more threads does not seem to yield much benefit
> though you have a lot more threads! Experiment required.
>
>>
>>
>> Cheers,
>> Lorenz
>>
>>
>> On 12.12.21 13:07, Andy Seaborne wrote:
>>> Hi, Øyvind,
>>>
>>> This is all very helpful feedback. Thank you.
>>>
>>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>>> I'm trying out tdb2.xloader on an openstack vm, loading the
>>>> wikidata truthy
>>>> dump downloaded 2021-12-09.
>>>
>>> This is the 4.3.0 xloader?
>>>
>>> There are improvements in 4.3.1. Since that release was going out,
>>> including using less temporary space, the development version got
>>> merged in. It has had some testing.
>>>
>>> It compresses the triples.tmp and intermediate sort files in the
>>> index stage making the peak usage much smaller.
>>>
>>>> The instance is a vm created on the Norwegian Research and
>>>> Education Cloud,
>>>> an openstack cloud provider.
>>>>
>>>> Instance type:
>>>> 32 GB memory
>>>> 4 CPU
>>>
>>> I using similar on a 7 year old desktop machine, SATA disk.
>>>
>>> I haven't got a machine I can dedicate to the multi-day load. I'll
>>> try to find a way to at least push it though building the node table.
>>>
>>> Loading the first 1B of truthy:
>>>
>>> 1B triples , 40kTPS , 06h 54m 10s
>>>
>>> The database is 81G and building needs an addition 11.6G for
>>> workspace for a total of 92G (+ the data file).
>>>
>>> While smaller, its seems bz2 files are much slower to decompress so
>>> I've been using gz files.
>>>
>>> My current best guess for 6.4B truthy is
>>>
>>> Temp 96G
>>> Database 540G
>>> Data 48G
>>> Total: 684G -- peak disk needed
>>>
>>> based on scaling up 1B truthy. Personally, I would make sure there
>>> was more space. Also - I don't know if the shape of the data is
>>> sufficiently uniform to make scaling predictable. The time doesn't
>>> scale so simply.
>>>
>>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>>
>>> Compression reduces the size of triples.tmp -- the related sort
>>> temporary files which add up to the same again -- 1/6 of the size.
>>>
>>>> The storage used for dump + temp files is mounted as a separate
>>>> 900GB
>>>> volume and is mounted on /var/fuseki/databases
>>>> .The type of storage is described as
>>>>> *mass-storage-default*: Storage backed by spinning hard drives,
>>>> available to everybody and is the default type.
>>>> with ext4 configured. At the moment I don't have access to the faster
>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>>>> can be
>>>> overcommitted.
>>>
>>> "overcommitted" may be a problem.
>>>
>>> While it's not "tdb2 loader parallel" it does use a continuous CPU
>>> in several threads.
>>>
>>> For memory - "it's complicated".
>>>
>>> The java parts only need say 2G. The sort is set to "buffer 50%
>>> --parallel=2" and the java pipes into sort, that's another thread. I
>>> think the effective peak is 3 active threads and they'll all be at
>>> 100% for some of the time.
>>>
>>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>>
>>> It does not need space for memory mapped files (they aren't used at
>>> all in the loading process and I/O is sequential.
>>>
>>> If that triggers over commitment swap out, the performance may go
>>> down a lot.
>>>
>>> For disk - if that is physically remote, it should not a problem
>>> (famous last words). I/O is sequential and in large continuous
>>> chunks - typical for batch processing jobs.
>>>
>>>> OS for the instance is a clean Rocky Linux image, with no services
>>>> except
>>>> jena/fuseki installed. The systemd service
>>>
>>> set up for fuseki is stopped.
>>>> jena and fuseki version is 4.3.0.
>>>>
>>>> openjdk 11.0.13 2021-10-19 LTS
>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
>>>> sharing)
>>>
>>> Just FYI: Java17 is a little faster. Some java improvements have
>>> improved RDF parsing speed by up to 10%. in xloader that not
>>> significant to the overall time.
>>>
>>>> I'm running from a tmux session to avoid connectivity issues and to
>>>> capture
>>>> the output.
>>>
>>> I use
>>>
>>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>>
>>> to capture the logs and see them. ">&" and "tail -f" would achieve
>>> much the same effect
>>>
>>>> I think the output is stored in memory and not on disk.
>>>> On First run I tried to have the tmpdir on the root partition, to
>>>> separate
>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon
>>>> was disk
>>>> full. For the second (current run) all directories are under
>>>> /var/fuseki/databases.
>>>
>>> Yes - after making that mistake myself, the new version ignores
>>> system TMPDIR. Using --tmpdir is best but otherwise it defaults to
>>> the data directory.
>>>
>>>>
>>>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>>>> --tmpdir
>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>
>>>> The import is so far at the "ingest data" stage where it has really
>>>> slowed
>>>> down.
>>>
>>> FYI: The first line of ingest is always very slow. It is not
>>> measuring the start point correctly.
>>>
>>>>
>>>> Current output is:
>>>>
>>>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch:
>>>> 3,356 /
>>>> Avg: 7,593)
>>>>
>>>> See full log so far at
>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>>>
>>>
>>> The earlier first pass also slows down and that should be fairly
>>> constant-ish speed step once everything settles down.
>>>
>>>> Some notes:
>>>>
>>>> * There is a (time/info) lapse in the output log between the end of
>>>> 'parse' and the start of 'index' for Terms. It is unclear to me
>>>> what is
>>>> happening in the 1h13 minutes between the lines.
>>>
>>> There is "sort" going on. "top" should show it.
>>>
>>> For each index there is also a very long pause for exactly the same
>>> reason. It would be good to have some something go "tick" and log a
>>> message occasionally.
>>>
>>>>
>>>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
>>>> [2021/12/10
>>>> 22:33:46 CET]
>>>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch:
>>>> 237,755 /
>>>> Avg: 237,755)
>>>>
>>>> * The ingest data step really slows down on the "ingest data
>>>> stage": At the
>>>> current rate, if I calculated correctly, it looks like
>>>> PKG.CmdxIngestData
>>>> has 10 days left before it finishes.
>>>
>>> Ouch.
>>>
>>>> * When I saw sort running in the background for the first parts of
>>>> the job,
>>>> I looked at the `sort` command. I noticed from some online sources
>>>> that
>>>> setting the environment variable LC_ALL=C improves speed for
>>>> `sort`. Could
>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>> break/change something? I see the warning from the man page for
>>>> `sort`.
>>>>
>>>> *** WARNING *** The locale specified by the environment
>>>> affects
>>>> sort order. Set LC_ALL=C to get the traditional sort order
>>>> that
>>>> uses native byte values.
>>>
>>> It shouldn't matter but, yes, better to set it and export it in the
>>> control script and propagate to forked processes.
>>>
>>> The sort is doing a binary sort except because it a text sort
>>> program, the binary is turned into hex (!!). hex is in the ASCII
>>> subset and shoule be locale safe.
>>>
>>> But better to set LC_ALL=C.
>>>
>>> Andy
>>>
>>>
>>>>
>>>> Links:
>>>> https://access.redhat.com/solutions/445233
>>>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>>>
>>>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>>>
>>>>
>>>> Best regards,
>>>> Øyvind
>>>>
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
Awesome!
I'm really pleased to hear the news.
That's better than I feared at this scale!
How big is triples.tmp.gz? 2* that size, and the database size is the
peak storage space used. My estimate is about 40G making 604G overall.
I'd appreciate having the whole log file. Could you email it to me?
Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a
modern portable with 4 cores and a single notional 1TB SSD. If the
estimate is right, it will fit. More good news.
I am getting a slow down during data ingestion. However, your summary
figures don't show that in the ingest phase. The whole logs may have the
signal in it but less pronounced.
My working assumption is now that it is random access to the node table.
Your results point to it not being a CPU issue but that my setup is
saturating the I/O path. While the portable has a NVMe SSD, it has
probably not got the same I/O bandwidth as a server class machine.
I'm not sure what to do about this other than run with a much bigger
node table cache for the ingestion phase. Substituting some file mapper
file area for bigger cache should be a win. While I hadn't noticed
before, it is probably visible in logs of smaller loads on closer
inspection. Experimenting on a small dataset is a lot easier.
I'm also watching the CPU temperature. When the graphics aren't active,
the fans aren't even on. After a few minutes of active screen the fans
spin up but the temperatures are still well within the limits. The
machine is raised up by 1cm to give good airflow. And I keep the door
shut to keep the cats away.
Andy
Inline ...
On 16/12/2021 08:49, LB wrote:
> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>
> Server:
>
> AMD Ryzen 9 5950X (16C/32T)
> 128 GB DDR4 ECC RAM
> 2 x 3.84 TB NVMe SSD
Nice.
> Environment:
>
> - Ubuntu 20.04.3 LTS
> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> - Jena 4.3.1
>
>
> Command:
>
>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
I found .gz to be slightly faster than .bz2. This maybe because .gz is
better supported by the Java runtime or just the fact bz2 is designed
for best compression.
>
>
> Log summary:
>
>> 04:14:28 INFO Load node table = 36600 seconds
>> 04:14:28 INFO Load ingest data = 25811 seconds
>> 04:14:28 INFO Build index SPO = 20688 seconds
>> 04:14:28 INFO Build index POS = 35466 seconds
>> 04:14:28 INFO Build index OSP = 25042 seconds
>> 04:14:28 INFO Overall 143607 seconds
>> 04:14:28 INFO Overall 39h 53m 27s
Less than 2 days :-)
>> 04:14:28 INFO Triples loaded = 6.610.055.778
>> 04:14:28 INFO Quads loaded = 0
>> 04:14:28 INFO Overall Rate 46.028 tuples per second
>
>
> Disk space usage according to
>
>> du -sh datasets/wikidata-tdb
>
> is
>
>> 524G datasets/wikidata-tdb
>
> During loading I could see ~90GB of RAM occupied (50% of total memory
> got to sort and it used 2 threads - is it intended to stick to 2 threads
> with --parallel 2?)
It is fixed at two for the sort currently.
There may be some benefit in making this configurable but previously
I've found that more threads does not seem to yield much benefit though
you have a lot more threads! Experiment required.
>
>
> Cheers,
> Lorenz
>
>
> On 12.12.21 13:07, Andy Seaborne wrote:
>> Hi, Øyvind,
>>
>> This is all very helpful feedback. Thank you.
>>
>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>>> truthy
>>> dump downloaded 2021-12-09.
>>
>> This is the 4.3.0 xloader?
>>
>> There are improvements in 4.3.1. Since that release was going out,
>> including using less temporary space, the development version got
>> merged in. It has had some testing.
>>
>> It compresses the triples.tmp and intermediate sort files in the index
>> stage making the peak usage much smaller.
>>
>>> The instance is a vm created on the Norwegian Research and Education
>>> Cloud,
>>> an openstack cloud provider.
>>>
>>> Instance type:
>>> 32 GB memory
>>> 4 CPU
>>
>> I using similar on a 7 year old desktop machine, SATA disk.
>>
>> I haven't got a machine I can dedicate to the multi-day load. I'll try
>> to find a way to at least push it though building the node table.
>>
>> Loading the first 1B of truthy:
>>
>> 1B triples , 40kTPS , 06h 54m 10s
>>
>> The database is 81G and building needs an addition 11.6G for workspace
>> for a total of 92G (+ the data file).
>>
>> While smaller, its seems bz2 files are much slower to decompress so
>> I've been using gz files.
>>
>> My current best guess for 6.4B truthy is
>>
>> Temp 96G
>> Database 540G
>> Data 48G
>> Total: 684G -- peak disk needed
>>
>> based on scaling up 1B truthy. Personally, I would make sure there was
>> more space. Also - I don't know if the shape of the data is
>> sufficiently uniform to make scaling predictable. The time doesn't
>> scale so simply.
>>
>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>
>> Compression reduces the size of triples.tmp -- the related sort
>> temporary files which add up to the same again -- 1/6 of the size.
>>
>>> The storage used for dump + temp files is mounted as a separate 900GB
>>> volume and is mounted on /var/fuseki/databases
>>> .The type of storage is described as
>>>> *mass-storage-default*: Storage backed by spinning hard drives,
>>> available to everybody and is the default type.
>>> with ext4 configured. At the moment I don't have access to the faster
>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>>> can be
>>> overcommitted.
>>
>> "overcommitted" may be a problem.
>>
>> While it's not "tdb2 loader parallel" it does use a continuous CPU in
>> several threads.
>>
>> For memory - "it's complicated".
>>
>> The java parts only need say 2G. The sort is set to "buffer 50%
>> --parallel=2" and the java pipes into sort, that's another thread. I
>> think the effective peak is 3 active threads and they'll all be at
>> 100% for some of the time.
>>
>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>
>> It does not need space for memory mapped files (they aren't used at
>> all in the loading process and I/O is sequential.
>>
>> If that triggers over commitment swap out, the performance may go down
>> a lot.
>>
>> For disk - if that is physically remote, it should not a problem
>> (famous last words). I/O is sequential and in large continuous chunks
>> - typical for batch processing jobs.
>>
>>> OS for the instance is a clean Rocky Linux image, with no services
>>> except
>>> jena/fuseki installed. The systemd service
>>
>> set up for fuseki is stopped.
>>> jena and fuseki version is 4.3.0.
>>>
>>> openjdk 11.0.13 2021-10-19 LTS
>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> Just FYI: Java17 is a little faster. Some java improvements have
>> improved RDF parsing speed by up to 10%. in xloader that not
>> significant to the overall time.
>>
>>> I'm running from a tmux session to avoid connectivity issues and to
>>> capture
>>> the output.
>>
>> I use
>>
>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>
>> to capture the logs and see them. ">&" and "tail -f" would achieve
>> much the same effect
>>
>>> I think the output is stored in memory and not on disk.
>>> On First run I tried to have the tmpdir on the root partition, to
>>> separate
>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>>> disk
>>> full. For the second (current run) all directories are under
>>> /var/fuseki/databases.
>>
>> Yes - after making that mistake myself, the new version ignores system
>> TMPDIR. Using --tmpdir is best but otherwise it defaults to the data
>> directory.
>>
>>>
>>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>>> --tmpdir
>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>
>>> The import is so far at the "ingest data" stage where it has really
>>> slowed
>>> down.
>>
>> FYI: The first line of ingest is always very slow. It is not measuring
>> the start point correctly.
>>
>>>
>>> Current output is:
>>>
>>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
>>> Avg: 7,593)
>>>
>>> See full log so far at
>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> The earlier first pass also slows down and that should be fairly
>> constant-ish speed step once everything settles down.
>>
>>> Some notes:
>>>
>>> * There is a (time/info) lapse in the output log between the end of
>>> 'parse' and the start of 'index' for Terms. It is unclear to me what is
>>> happening in the 1h13 minutes between the lines.
>>
>> There is "sort" going on. "top" should show it.
>>
>> For each index there is also a very long pause for exactly the same
>> reason. It would be good to have some something go "tick" and log a
>> message occasionally.
>>
>>>
>>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
>>> [2021/12/10
>>> 22:33:46 CET]
>>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>>> 6,560,468,631 triples/quads 129,331 TPS
>>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
>>> Avg: 237,755)
>>>
>>> * The ingest data step really slows down on the "ingest data stage":
>>> At the
>>> current rate, if I calculated correctly, it looks like
>>> PKG.CmdxIngestData
>>> has 10 days left before it finishes.
>>
>> Ouch.
>>
>>> * When I saw sort running in the background for the first parts of
>>> the job,
>>> I looked at the `sort` command. I noticed from some online sources that
>>> setting the environment variable LC_ALL=C improves speed for `sort`.
>>> Could
>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>> break/change something? I see the warning from the man page for `sort`.
>>>
>>> *** WARNING *** The locale specified by the environment affects
>>> sort order. Set LC_ALL=C to get the traditional sort order that
>>> uses native byte values.
>>
>> It shouldn't matter but, yes, better to set it and export it in the
>> control script and propagate to forked processes.
>>
>> The sort is doing a binary sort except because it a text sort program,
>> the binary is turned into hex (!!). hex is in the ASCII subset and
>> shoule be locale safe.
>>
>> But better to set LC_ALL=C.
>>
>> Andy
>>
>>
>>>
>>> Links:
>>> https://access.redhat.com/solutions/445233
>>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>>
>>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>>
>>>
>>> Best regards,
>>> Øyvind
>>>
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
On 16/12/2021 10:08, Marco Neumann wrote:
> thank you Lorenz, I am running this test myself now again with a larger
> disk. You may want to consider running a full load of wikidata as well. The
> timing info and disk space you have should be sufficient.
Full Wikidata (WD).
I've tried to gather a summary as of 2021-12.
The video [2] is most up-to-date at the time of writing.
WD about 16B currently.
It's growing at 1B/3 months [1]
The query service is 6 active machines (11 in total)
58 queries per server per second average (no figures for peak)
12 wikidata updates/330 triples per second (no figures for peak)
It seems this mixed workload is causing the most pain.
Wikidata will want at least a 5 year strategy, and planning for 10 years
isn't unreasonable. The service is 6 year old.
That's 36B and 56B as a baseline *minimum*.
That's without new streams of data.
They can't design for a minimum. "plan for success"
--> towards 100B triples.
Andy
[1]
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy
[2] Scaling the Wikidata Query Service
https://www.youtube.com/watch?v=oV4qelj9fxM
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
thank you Lorenz, I am running this test myself now again with a larger
disk. You may want to consider running a full load of wikidata as well. The
timing info and disk space you have should be sufficient.
Did we figure out a place to post the parser messages?
Marco
On Thu, Dec 16, 2021 at 10:01 AM LB <co...@googlemail.com.invalid>
wrote:
> Sure
>
> > wikidata-tdb/Data-0001:
> > total 524G
> > -rw-r--r-- 1 24 Dez 15 05:41 GOSP.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
> > -rw-r--r-- 1 24 Dez 15 05:41 GPOS.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
> > -rw-r--r-- 1 24 Dez 15 05:41 GPU.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
> > -rw-r--r-- 1 24 Dez 15 05:41 GSPO.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
> > -rw-r--r-- 1 0 Dez 15 05:41 journal.jrnl
> > -rw-r--r-- 1 24 Dez 15 05:41 nodes.bpt
> > -rw-r--r-- 1 36G Dez 15 05:41 nodes.dat
> > -rw-r--r-- 1 16 Dez 15 05:41 nodes-data.bdf
> > -rw-r--r-- 1 44G Dez 15 05:41 nodes-data.obj
> > -rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
> > -rw-r--r-- 1 24 Dez 15 05:41 OSP.bpt
> > -rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
> > -rw-r--r-- 1 24 Dez 15 05:41 OSPG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
> > -rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
> > -rw-r--r-- 1 24 Dez 15 05:41 POS.bpt
> > -rw-r--r-- 1 148G Dez 15 21:17 POS.dat
> > -rw-r--r-- 1 24 Dez 15 05:41 POSG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
> > -rw-r--r-- 1 528M Dez 15 21:17 POS.idn
> > -rw-r--r-- 1 24 Dez 15 05:41 prefixes.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
> > -rw-r--r-- 1 16 Dez 15 05:41 prefixes-data.bdf
> > -rw-r--r-- 1 0 Dez 14 12:21 prefixes-data.obj
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
> > -rw-r--r-- 1 24 Dez 15 05:41 SPO.bpt
> > -rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
> > -rw-r--r-- 1 24 Dez 15 05:41 SPOG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
> > -rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
> > -rw-r--r-- 1 8 Dez 15 21:17 tdb.lock
>
> On 16.12.21 10:27, Marco Neumann wrote:
> > Thank you Lorenz, can you please post a directory list for Data-0001 with
> > file sizes.
> >
> >
> > On Thu, Dec 16, 2021 at 8:49 AM LB <co...@googlemail.com.invalid>
> > wrote:
> >
> >> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
> >>
> >> Server:
> >>
> >> AMD Ryzen 9 5950X (16C/32T)
> >> 128 GB DDR4 ECC RAM
> >> 2 x 3.84 TB NVMe SSD
> >>
> >>
> >> Environment:
> >>
> >> - Ubuntu 20.04.3 LTS
> >> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> >> - Jena 4.3.1
> >>
> >>
> >> Command:
> >>
> >>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> >>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
> >>
> >> Log summary:
> >>
> >>> 04:14:28 INFO Load node table = 36600 seconds
> >>> 04:14:28 INFO Load ingest data = 25811 seconds
> >>> 04:14:28 INFO Build index SPO = 20688 seconds
> >>> 04:14:28 INFO Build index POS = 35466 seconds
> >>> 04:14:28 INFO Build index OSP = 25042 seconds
> >>> 04:14:28 INFO Overall 143607 seconds
> >>> 04:14:28 INFO Overall 39h 53m 27s
> >>> 04:14:28 INFO Triples loaded = 6.610.055.778
> >>> 04:14:28 INFO Quads loaded = 0
> >>> 04:14:28 INFO Overall Rate 46.028 tuples per second
> >>
> >> Disk space usage according to
> >>
> >>> du -sh datasets/wikidata-tdb
> >> is
> >>
> >>> 524G datasets/wikidata-tdb
> >> During loading I could see ~90GB of RAM occupied (50% of total memory
> >> got to sort and it used 2 threads - is it intended to stick to 2 threads
> >> with --parallel 2?)
> >>
> >>
> >> Cheers,
> >> Lorenz
> >>
> >>
> >> On 12.12.21 13:07, Andy Seaborne wrote:
> >>> Hi, Øyvind,
> >>>
> >>> This is all very helpful feedback. Thank you.
> >>>
> >>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> >>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >>>> truthy
> >>>> dump downloaded 2021-12-09.
> >>> This is the 4.3.0 xloader?
> >>>
> >>> There are improvements in 4.3.1. Since that release was going out,
> >>> including using less temporary space, the development version got
> >>> merged in. It has had some testing.
> >>>
> >>> It compresses the triples.tmp and intermediate sort files in the index
> >>> stage making the peak usage much smaller.
> >>>
> >>>> The instance is a vm created on the Norwegian Research and Education
> >>>> Cloud,
> >>>> an openstack cloud provider.
> >>>>
> >>>> Instance type:
> >>>> 32 GB memory
> >>>> 4 CPU
> >>> I using similar on a 7 year old desktop machine, SATA disk.
> >>>
> >>> I haven't got a machine I can dedicate to the multi-day load. I'll try
> >>> to find a way to at least push it though building the node table.
> >>>
> >>> Loading the first 1B of truthy:
> >>>
> >>> 1B triples , 40kTPS , 06h 54m 10s
> >>>
> >>> The database is 81G and building needs an addition 11.6G for workspace
> >>> for a total of 92G (+ the data file).
> >>>
> >>> While smaller, its seems bz2 files are much slower to decompress so
> >>> I've been using gz files.
> >>>
> >>> My current best guess for 6.4B truthy is
> >>>
> >>> Temp 96G
> >>> Database 540G
> >>> Data 48G
> >>> Total: 684G -- peak disk needed
> >>>
> >>> based on scaling up 1B truthy. Personally, I would make sure there was
> >>> more space. Also - I don't know if the shape of the data is
> >>> sufficiently uniform to make scaling predictable. The time doesn't
> >>> scale so simply.
> >>>
> >>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
> >>>
> >>> Compression reduces the size of triples.tmp -- the related sort
> >>> temporary files which add up to the same again -- 1/6 of the size.
> >>>
> >>>> The storage used for dump + temp files is mounted as a separate
> 900GB
> >>>> volume and is mounted on /var/fuseki/databases
> >>>> .The type of storage is described as
> >>>>> *mass-storage-default*: Storage backed by spinning hard drives,
> >>>> available to everybody and is the default type.
> >>>> with ext4 configured. At the moment I don't have access to the faster
> >>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> >>>> can be
> >>>> overcommitted.
> >>> "overcommitted" may be a problem.
> >>>
> >>> While it's not "tdb2 loader parallel" it does use a continuous CPU in
> >>> several threads.
> >>>
> >>> For memory - "it's complicated".
> >>>
> >>> The java parts only need say 2G. The sort is set to "buffer 50%
> >>> --parallel=2" and the java pipes into sort, that's another thread. I
> >>> think the effective peak is 3 active threads and they'll all be at
> >>> 100% for some of the time.
> >>>
> >>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
> >>>
> >>> It does not need space for memory mapped files (they aren't used at
> >>> all in the loading process and I/O is sequential.
> >>>
> >>> If that triggers over commitment swap out, the performance may go down
> >>> a lot.
> >>>
> >>> For disk - if that is physically remote, it should not a problem
> >>> (famous last words). I/O is sequential and in large continuous chunks
> >>> - typical for batch processing jobs.
> >>>
> >>>> OS for the instance is a clean Rocky Linux image, with no services
> >>>> except
> >>>> jena/fuseki installed. The systemd service
> >>> set up for fuseki is stopped.
> >>>> jena and fuseki version is 4.3.0.
> >>>>
> >>>> openjdk 11.0.13 2021-10-19 LTS
> >>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
> sharing)
> >>> Just FYI: Java17 is a little faster. Some java improvements have
> >>> improved RDF parsing speed by up to 10%. in xloader that not
> >>> significant to the overall time.
> >>>
> >>>> I'm running from a tmux session to avoid connectivity issues and to
> >>>> capture
> >>>> the output.
> >>> I use
> >>>
> >>> tdb2.xloader .... |& tee LOG-FILE-NAME
> >>>
> >>> to capture the logs and see them. ">&" and "tail -f" would achieve
> >>> much the same effect
> >>>
> >>>> I think the output is stored in memory and not on disk.
> >>>> On First run I tried to have the tmpdir on the root partition, to
> >>>> separate
> >>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> >>>> disk
> >>>> full. For the second (current run) all directories are under
> >>>> /var/fuseki/databases.
> >>> Yes - after making that mistake myself, the new version ignores system
> >>> TMPDIR. Using --tmpdir is best but otherwise it defaults to the data
> >>> directory.
> >>>
> >>>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >>>> --tmpdir
> >>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>>>
> >>>> The import is so far at the "ingest data" stage where it has really
> >>>> slowed
> >>>> down.
> >>> FYI: The first line of ingest is always very slow. It is not measuring
> >>> the start point correctly.
> >>>
> >>>> Current output is:
> >>>>
> >>>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356
> /
> >>>> Avg: 7,593)
> >>>>
> >>>> See full log so far at
> >>>>
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>> The earlier first pass also slows down and that should be fairly
> >>> constant-ish speed step once everything settles down.
> >>>
> >>>> Some notes:
> >>>>
> >>>> * There is a (time/info) lapse in the output log between the end of
> >>>> 'parse' and the start of 'index' for Terms. It is unclear to me what
> is
> >>>> happening in the 1h13 minutes between the lines.
> >>> There is "sort" going on. "top" should show it.
> >>>
> >>> For each index there is also a very long pause for exactly the same
> >>> reason. It would be good to have some something go "tick" and log a
> >>> message occasionally.
> >>>
> >>>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
> >>>> [2021/12/10
> >>>> 22:33:46 CET]
> >>>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
> >>>> 6,560,468,631 triples/quads 129,331 TPS
> >>>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch:
> 237,755 /
> >>>> Avg: 237,755)
> >>>>
> >>>> * The ingest data step really slows down on the "ingest data stage":
> >>>> At the
> >>>> current rate, if I calculated correctly, it looks like
> >>>> PKG.CmdxIngestData
> >>>> has 10 days left before it finishes.
> >>> Ouch.
> >>>
> >>>> * When I saw sort running in the background for the first parts of
> >>>> the job,
> >>>> I looked at the `sort` command. I noticed from some online sources
> that
> >>>> setting the environment variable LC_ALL=C improves speed for `sort`.
> >>>> Could
> >>>> this be set on the ProcessBuilder for the `sort` process? Could it
> >>>> break/change something? I see the warning from the man page for
> `sort`.
> >>>>
> >>>> *** WARNING *** The locale specified by the environment
> affects
> >>>> sort order. Set LC_ALL=C to get the traditional sort order
> that
> >>>> uses native byte values.
> >>> It shouldn't matter but, yes, better to set it and export it in the
> >>> control script and propagate to forked processes.
> >>>
> >>> The sort is doing a binary sort except because it a text sort program,
> >>> the binary is turned into hex (!!). hex is in the ASCII subset and
> >>> shoule be locale safe.
> >>>
> >>> But better to set LC_ALL=C.
> >>>
> >>> Andy
> >>>
> >>>
> >>>> Links:
> >>>> https://access.redhat.com/solutions/445233
> >>>>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>>>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>>>
> >>>> Best regards,
> >>>> Øyvind
> >>>>
> >
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by LB <co...@googlemail.com.INVALID>.
Sure
> wikidata-tdb/Data-0001:
> total 524G
> -rw-r--r-- 1 24 Dez 15 05:41 GOSP.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
> -rw-r--r-- 1 24 Dez 15 05:41 GPOS.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
> -rw-r--r-- 1 24 Dez 15 05:41 GPU.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
> -rw-r--r-- 1 24 Dez 15 05:41 GSPO.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
> -rw-r--r-- 1 0 Dez 15 05:41 journal.jrnl
> -rw-r--r-- 1 24 Dez 15 05:41 nodes.bpt
> -rw-r--r-- 1 36G Dez 15 05:41 nodes.dat
> -rw-r--r-- 1 16 Dez 15 05:41 nodes-data.bdf
> -rw-r--r-- 1 44G Dez 15 05:41 nodes-data.obj
> -rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
> -rw-r--r-- 1 24 Dez 15 05:41 OSP.bpt
> -rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
> -rw-r--r-- 1 24 Dez 15 05:41 OSPG.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
> -rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
> -rw-r--r-- 1 24 Dez 15 05:41 POS.bpt
> -rw-r--r-- 1 148G Dez 15 21:17 POS.dat
> -rw-r--r-- 1 24 Dez 15 05:41 POSG.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
> -rw-r--r-- 1 528M Dez 15 21:17 POS.idn
> -rw-r--r-- 1 24 Dez 15 05:41 prefixes.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
> -rw-r--r-- 1 16 Dez 15 05:41 prefixes-data.bdf
> -rw-r--r-- 1 0 Dez 14 12:21 prefixes-data.obj
> -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
> -rw-r--r-- 1 24 Dez 15 05:41 SPO.bpt
> -rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
> -rw-r--r-- 1 24 Dez 15 05:41 SPOG.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
> -rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
> -rw-r--r-- 1 8 Dez 15 21:17 tdb.lock
On 16.12.21 10:27, Marco Neumann wrote:
> Thank you Lorenz, can you please post a directory list for Data-0001 with
> file sizes.
>
>
> On Thu, Dec 16, 2021 at 8:49 AM LB <co...@googlemail.com.invalid>
> wrote:
>
>> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>>
>> Server:
>>
>> AMD Ryzen 9 5950X (16C/32T)
>> 128 GB DDR4 ECC RAM
>> 2 x 3.84 TB NVMe SSD
>>
>>
>> Environment:
>>
>> - Ubuntu 20.04.3 LTS
>> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
>> - Jena 4.3.1
>>
>>
>> Command:
>>
>>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
>>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>>
>> Log summary:
>>
>>> 04:14:28 INFO Load node table = 36600 seconds
>>> 04:14:28 INFO Load ingest data = 25811 seconds
>>> 04:14:28 INFO Build index SPO = 20688 seconds
>>> 04:14:28 INFO Build index POS = 35466 seconds
>>> 04:14:28 INFO Build index OSP = 25042 seconds
>>> 04:14:28 INFO Overall 143607 seconds
>>> 04:14:28 INFO Overall 39h 53m 27s
>>> 04:14:28 INFO Triples loaded = 6.610.055.778
>>> 04:14:28 INFO Quads loaded = 0
>>> 04:14:28 INFO Overall Rate 46.028 tuples per second
>>
>> Disk space usage according to
>>
>>> du -sh datasets/wikidata-tdb
>> is
>>
>>> 524G datasets/wikidata-tdb
>> During loading I could see ~90GB of RAM occupied (50% of total memory
>> got to sort and it used 2 threads - is it intended to stick to 2 threads
>> with --parallel 2?)
>>
>>
>> Cheers,
>> Lorenz
>>
>>
>> On 12.12.21 13:07, Andy Seaborne wrote:
>>> Hi, Øyvind,
>>>
>>> This is all very helpful feedback. Thank you.
>>>
>>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>>>> truthy
>>>> dump downloaded 2021-12-09.
>>> This is the 4.3.0 xloader?
>>>
>>> There are improvements in 4.3.1. Since that release was going out,
>>> including using less temporary space, the development version got
>>> merged in. It has had some testing.
>>>
>>> It compresses the triples.tmp and intermediate sort files in the index
>>> stage making the peak usage much smaller.
>>>
>>>> The instance is a vm created on the Norwegian Research and Education
>>>> Cloud,
>>>> an openstack cloud provider.
>>>>
>>>> Instance type:
>>>> 32 GB memory
>>>> 4 CPU
>>> I using similar on a 7 year old desktop machine, SATA disk.
>>>
>>> I haven't got a machine I can dedicate to the multi-day load. I'll try
>>> to find a way to at least push it though building the node table.
>>>
>>> Loading the first 1B of truthy:
>>>
>>> 1B triples , 40kTPS , 06h 54m 10s
>>>
>>> The database is 81G and building needs an addition 11.6G for workspace
>>> for a total of 92G (+ the data file).
>>>
>>> While smaller, its seems bz2 files are much slower to decompress so
>>> I've been using gz files.
>>>
>>> My current best guess for 6.4B truthy is
>>>
>>> Temp 96G
>>> Database 540G
>>> Data 48G
>>> Total: 684G -- peak disk needed
>>>
>>> based on scaling up 1B truthy. Personally, I would make sure there was
>>> more space. Also - I don't know if the shape of the data is
>>> sufficiently uniform to make scaling predictable. The time doesn't
>>> scale so simply.
>>>
>>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>>
>>> Compression reduces the size of triples.tmp -- the related sort
>>> temporary files which add up to the same again -- 1/6 of the size.
>>>
>>>> The storage used for dump + temp files is mounted as a separate 900GB
>>>> volume and is mounted on /var/fuseki/databases
>>>> .The type of storage is described as
>>>>> *mass-storage-default*: Storage backed by spinning hard drives,
>>>> available to everybody and is the default type.
>>>> with ext4 configured. At the moment I don't have access to the faster
>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>>>> can be
>>>> overcommitted.
>>> "overcommitted" may be a problem.
>>>
>>> While it's not "tdb2 loader parallel" it does use a continuous CPU in
>>> several threads.
>>>
>>> For memory - "it's complicated".
>>>
>>> The java parts only need say 2G. The sort is set to "buffer 50%
>>> --parallel=2" and the java pipes into sort, that's another thread. I
>>> think the effective peak is 3 active threads and they'll all be at
>>> 100% for some of the time.
>>>
>>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>>
>>> It does not need space for memory mapped files (they aren't used at
>>> all in the loading process and I/O is sequential.
>>>
>>> If that triggers over commitment swap out, the performance may go down
>>> a lot.
>>>
>>> For disk - if that is physically remote, it should not a problem
>>> (famous last words). I/O is sequential and in large continuous chunks
>>> - typical for batch processing jobs.
>>>
>>>> OS for the instance is a clean Rocky Linux image, with no services
>>>> except
>>>> jena/fuseki installed. The systemd service
>>> set up for fuseki is stopped.
>>>> jena and fuseki version is 4.3.0.
>>>>
>>>> openjdk 11.0.13 2021-10-19 LTS
>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>> Just FYI: Java17 is a little faster. Some java improvements have
>>> improved RDF parsing speed by up to 10%. in xloader that not
>>> significant to the overall time.
>>>
>>>> I'm running from a tmux session to avoid connectivity issues and to
>>>> capture
>>>> the output.
>>> I use
>>>
>>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>>
>>> to capture the logs and see them. ">&" and "tail -f" would achieve
>>> much the same effect
>>>
>>>> I think the output is stored in memory and not on disk.
>>>> On First run I tried to have the tmpdir on the root partition, to
>>>> separate
>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>>>> disk
>>>> full. For the second (current run) all directories are under
>>>> /var/fuseki/databases.
>>> Yes - after making that mistake myself, the new version ignores system
>>> TMPDIR. Using --tmpdir is best but otherwise it defaults to the data
>>> directory.
>>>
>>>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>>>> --tmpdir
>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>
>>>> The import is so far at the "ingest data" stage where it has really
>>>> slowed
>>>> down.
>>> FYI: The first line of ingest is always very slow. It is not measuring
>>> the start point correctly.
>>>
>>>> Current output is:
>>>>
>>>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
>>>> Avg: 7,593)
>>>>
>>>> See full log so far at
>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>> The earlier first pass also slows down and that should be fairly
>>> constant-ish speed step once everything settles down.
>>>
>>>> Some notes:
>>>>
>>>> * There is a (time/info) lapse in the output log between the end of
>>>> 'parse' and the start of 'index' for Terms. It is unclear to me what is
>>>> happening in the 1h13 minutes between the lines.
>>> There is "sort" going on. "top" should show it.
>>>
>>> For each index there is also a very long pause for exactly the same
>>> reason. It would be good to have some something go "tick" and log a
>>> message occasionally.
>>>
>>>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
>>>> [2021/12/10
>>>> 22:33:46 CET]
>>>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
>>>> Avg: 237,755)
>>>>
>>>> * The ingest data step really slows down on the "ingest data stage":
>>>> At the
>>>> current rate, if I calculated correctly, it looks like
>>>> PKG.CmdxIngestData
>>>> has 10 days left before it finishes.
>>> Ouch.
>>>
>>>> * When I saw sort running in the background for the first parts of
>>>> the job,
>>>> I looked at the `sort` command. I noticed from some online sources that
>>>> setting the environment variable LC_ALL=C improves speed for `sort`.
>>>> Could
>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>> break/change something? I see the warning from the man page for `sort`.
>>>>
>>>> *** WARNING *** The locale specified by the environment affects
>>>> sort order. Set LC_ALL=C to get the traditional sort order that
>>>> uses native byte values.
>>> It shouldn't matter but, yes, better to set it and export it in the
>>> control script and propagate to forked processes.
>>>
>>> The sort is doing a binary sort except because it a text sort program,
>>> the binary is turned into hex (!!). hex is in the ASCII subset and
>>> shoule be locale safe.
>>>
>>> But better to set LC_ALL=C.
>>>
>>> Andy
>>>
>>>
>>>> Links:
>>>> https://access.redhat.com/solutions/445233
>>>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>>>
>>>> Best regards,
>>>> Øyvind
>>>>
>
Re: Testing tdb2.xloader
Posted by Marco Neumann <ma...@gmail.com>.
Thank you Lorenz, can you please post a directory list for Data-0001 with
file sizes.
On Thu, Dec 16, 2021 at 8:49 AM LB <co...@googlemail.com.invalid>
wrote:
> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>
> Server:
>
> AMD Ryzen 9 5950X (16C/32T)
> 128 GB DDR4 ECC RAM
> 2 x 3.84 TB NVMe SSD
>
>
> Environment:
>
> - Ubuntu 20.04.3 LTS
> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> - Jena 4.3.1
>
>
> Command:
>
> > tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> > datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>
>
> Log summary:
>
> > 04:14:28 INFO Load node table = 36600 seconds
> > 04:14:28 INFO Load ingest data = 25811 seconds
> > 04:14:28 INFO Build index SPO = 20688 seconds
> > 04:14:28 INFO Build index POS = 35466 seconds
> > 04:14:28 INFO Build index OSP = 25042 seconds
> > 04:14:28 INFO Overall 143607 seconds
> > 04:14:28 INFO Overall 39h 53m 27s
> > 04:14:28 INFO Triples loaded = 6.610.055.778
> > 04:14:28 INFO Quads loaded = 0
> > 04:14:28 INFO Overall Rate 46.028 tuples per second
>
>
> Disk space usage according to
>
> > du -sh datasets/wikidata-tdb
>
> is
>
> > 524G datasets/wikidata-tdb
>
> During loading I could see ~90GB of RAM occupied (50% of total memory
> got to sort and it used 2 threads - is it intended to stick to 2 threads
> with --parallel 2?)
>
>
> Cheers,
> Lorenz
>
>
> On 12.12.21 13:07, Andy Seaborne wrote:
> > Hi, Øyvind,
> >
> > This is all very helpful feedback. Thank you.
> >
> > On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >
> > This is the 4.3.0 xloader?
> >
> > There are improvements in 4.3.1. Since that release was going out,
> > including using less temporary space, the development version got
> > merged in. It has had some testing.
> >
> > It compresses the triples.tmp and intermediate sort files in the index
> > stage making the peak usage much smaller.
> >
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >
> > I using similar on a 7 year old desktop machine, SATA disk.
> >
> > I haven't got a machine I can dedicate to the multi-day load. I'll try
> > to find a way to at least push it though building the node table.
> >
> > Loading the first 1B of truthy:
> >
> > 1B triples , 40kTPS , 06h 54m 10s
> >
> > The database is 81G and building needs an addition 11.6G for workspace
> > for a total of 92G (+ the data file).
> >
> > While smaller, its seems bz2 files are much slower to decompress so
> > I've been using gz files.
> >
> > My current best guess for 6.4B truthy is
> >
> > Temp 96G
> > Database 540G
> > Data 48G
> > Total: 684G -- peak disk needed
> >
> > based on scaling up 1B truthy. Personally, I would make sure there was
> > more space. Also - I don't know if the shape of the data is
> > sufficiently uniform to make scaling predictable. The time doesn't
> > scale so simply.
> >
> > This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
> >
> > Compression reduces the size of triples.tmp -- the related sort
> > temporary files which add up to the same again -- 1/6 of the size.
> >
> >> The storage used for dump + temp files is mounted as a separate 900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >>> *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> >> can be
> >> overcommitted.
> >
> > "overcommitted" may be a problem.
> >
> > While it's not "tdb2 loader parallel" it does use a continuous CPU in
> > several threads.
> >
> > For memory - "it's complicated".
> >
> > The java parts only need say 2G. The sort is set to "buffer 50%
> > --parallel=2" and the java pipes into sort, that's another thread. I
> > think the effective peak is 3 active threads and they'll all be at
> > 100% for some of the time.
> >
> > So it's going to need 50% of RAM + 2G for a java proces, +OS.
> >
> > It does not need space for memory mapped files (they aren't used at
> > all in the loading process and I/O is sequential.
> >
> > If that triggers over commitment swap out, the performance may go down
> > a lot.
> >
> > For disk - if that is physically remote, it should not a problem
> > (famous last words). I/O is sequential and in large continuous chunks
> > - typical for batch processing jobs.
> >
> >> OS for the instance is a clean Rocky Linux image, with no services
> >> except
> >> jena/fuseki installed. The systemd service
> >
> > set up for fuseki is stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >
> > Just FYI: Java17 is a little faster. Some java improvements have
> > improved RDF parsing speed by up to 10%. in xloader that not
> > significant to the overall time.
> >
> >> I'm running from a tmux session to avoid connectivity issues and to
> >> capture
> >> the output.
> >
> > I use
> >
> > tdb2.xloader .... |& tee LOG-FILE-NAME
> >
> > to capture the logs and see them. ">&" and "tail -f" would achieve
> > much the same effect
> >
> >> I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> >> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> >> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >
> > Yes - after making that mistake myself, the new version ignores system
> > TMPDIR. Using --tmpdir is best but otherwise it defaults to the data
> > directory.
> >
> >>
> >> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> >> slowed
> >> down.
> >
> > FYI: The first line of ingest is always very slow. It is not measuring
> > the start point correctly.
> >
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >
> > The earlier first pass also slows down and that should be fairly
> > constant-ish speed step once everything settles down.
> >
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the end of
> >> 'parse' and the start of 'index' for Terms. It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >
> > There is "sort" going on. "top" should show it.
> >
> > For each index there is also a very long pause for exactly the same
> > reason. It would be good to have some something go "tick" and log a
> > message occasionally.
> >
> >>
> >> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
> >> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage":
> >> At the
> >> current rate, if I calculated correctly, it looks like
> >> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >
> > Ouch.
> >
> >> * When I saw sort running in the background for the first parts of
> >> the job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> >> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >> *** WARNING *** The locale specified by the environment affects
> >> sort order. Set LC_ALL=C to get the traditional sort order that
> >> uses native byte values.
> >
> > It shouldn't matter but, yes, better to set it and export it in the
> > control script and propagate to forked processes.
> >
> > The sort is doing a binary sort except because it a text sort program,
> > the binary is turned into hex (!!). hex is in the ASCII subset and
> > shoule be locale safe.
> >
> > But better to set LC_ALL=C.
> >
> > Andy
> >
> >
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >>
> >> Best regards,
> >> Øyvind
> >>
>
--
---
Marco Neumann
KONA
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
I've updated:
https://www.w3.org/wiki/LargeTripleStores#Apache_Jena_TDB_.286.6B.29
for Lorenz's first run.
Andy
On 16/12/2021 08:49, LB wrote:
> 39h 53m 27s
> 04:14:28 INFO Triples loaded = 6.610.055.778
> 04:14:28 INFO Quads loaded = 0
> 04:14:28 INFO Overall Rate 46.028 tuples per second
Re: Testing tdb2.xloader
Posted by LB <co...@googlemail.com.INVALID>.
Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
Server:
AMD Ryzen 9 5950X (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD
Environment:
- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1
Command:
> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
Log summary:
> 04:14:28 INFO Load node table = 36600 seconds
> 04:14:28 INFO Load ingest data = 25811 seconds
> 04:14:28 INFO Build index SPO = 20688 seconds
> 04:14:28 INFO Build index POS = 35466 seconds
> 04:14:28 INFO Build index OSP = 25042 seconds
> 04:14:28 INFO Overall 143607 seconds
> 04:14:28 INFO Overall 39h 53m 27s
> 04:14:28 INFO Triples loaded = 6.610.055.778
> 04:14:28 INFO Quads loaded = 0
> 04:14:28 INFO Overall Rate 46.028 tuples per second
Disk space usage according to
> du -sh datasets/wikidata-tdb
is
> 524G datasets/wikidata-tdb
During loading I could see ~90GB of RAM occupied (50% of total memory
got to sort and it used 2 threads - is it intended to stick to 2 threads
with --parallel 2?)
Cheers,
Lorenz
On 12.12.21 13:07, Andy Seaborne wrote:
> Hi, Øyvind,
>
> This is all very helpful feedback. Thank you.
>
> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> dump downloaded 2021-12-09.
>
> This is the 4.3.0 xloader?
>
> There are improvements in 4.3.1. Since that release was going out,
> including using less temporary space, the development version got
> merged in. It has had some testing.
>
> It compresses the triples.tmp and intermediate sort files in the index
> stage making the peak usage much smaller.
>
>> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>
> I using similar on a 7 year old desktop machine, SATA disk.
>
> I haven't got a machine I can dedicate to the multi-day load. I'll try
> to find a way to at least push it though building the node table.
>
> Loading the first 1B of truthy:
>
> 1B triples , 40kTPS , 06h 54m 10s
>
> The database is 81G and building needs an addition 11.6G for workspace
> for a total of 92G (+ the data file).
>
> While smaller, its seems bz2 files are much slower to decompress so
> I've been using gz files.
>
> My current best guess for 6.4B truthy is
>
> Temp 96G
> Database 540G
> Data 48G
> Total: 684G -- peak disk needed
>
> based on scaling up 1B truthy. Personally, I would make sure there was
> more space. Also - I don't know if the shape of the data is
> sufficiently uniform to make scaling predictable. The time doesn't
> scale so simply.
>
> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>
> Compression reduces the size of triples.tmp -- the related sort
> temporary files which add up to the same again -- 1/6 of the size.
>
>> The storage used for dump + temp files is mounted as a separate 900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>>> *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>> can be
>> overcommitted.
>
> "overcommitted" may be a problem.
>
> While it's not "tdb2 loader parallel" it does use a continuous CPU in
> several threads.
>
> For memory - "it's complicated".
>
> The java parts only need say 2G. The sort is set to "buffer 50%
> --parallel=2" and the java pipes into sort, that's another thread. I
> think the effective peak is 3 active threads and they'll all be at
> 100% for some of the time.
>
> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>
> It does not need space for memory mapped files (they aren't used at
> all in the loading process and I/O is sequential.
>
> If that triggers over commitment swap out, the performance may go down
> a lot.
>
> For disk - if that is physically remote, it should not a problem
> (famous last words). I/O is sequential and in large continuous chunks
> - typical for batch processing jobs.
>
>> OS for the instance is a clean Rocky Linux image, with no services
>> except
>> jena/fuseki installed. The systemd service
>
> set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>
> Just FYI: Java17 is a little faster. Some java improvements have
> improved RDF parsing speed by up to 10%. in xloader that not
> significant to the overall time.
>
>> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> the output.
>
> I use
>
> tdb2.xloader .... |& tee LOG-FILE-NAME
>
> to capture the logs and see them. ">&" and "tail -f" would achieve
> much the same effect
>
>> I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to
>> separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>> disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>
> Yes - after making that mistake myself, the new version ignores system
> TMPDIR. Using --tmpdir is best but otherwise it defaults to the data
> directory.
>
>>
>> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really
>> slowed
>> down.
>
> FYI: The first line of ingest is always very slow. It is not measuring
> the start point correctly.
>
>>
>> Current output is:
>>
>> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>
> The earlier first pass also slows down and that should be fairly
> constant-ish speed step once everything settles down.
>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the end of
>> 'parse' and the start of 'index' for Terms. It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>
> There is "sort" going on. "top" should show it.
>
> For each index there is also a very long pause for exactly the same
> reason. It would be good to have some something go "tick" and log a
> message occasionally.
>
>>
>> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
>> [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage":
>> At the
>> current rate, if I calculated correctly, it looks like
>> PKG.CmdxIngestData
>> has 10 days left before it finishes.
>
> Ouch.
>
>> * When I saw sort running in the background for the first parts of
>> the job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`.
>> Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>> *** WARNING *** The locale specified by the environment affects
>> sort order. Set LC_ALL=C to get the traditional sort order that
>> uses native byte values.
>
> It shouldn't matter but, yes, better to set it and export it in the
> control script and propagate to forked processes.
>
> The sort is doing a binary sort except because it a text sort program,
> the binary is turned into hex (!!). hex is in the ASCII subset and
> shoule be locale safe.
>
> But better to set LC_ALL=C.
>
> Andy
>
>
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>
>>
>> Best regards,
>> Øyvind
>>
Re: Testing tdb2.xloader
Posted by Andy Seaborne <an...@apache.org>.
Hi, Øyvind,
This is all very helpful feedback. Thank you.
On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
> dump downloaded 2021-12-09.
This is the 4.3.0 xloader?
There are improvements in 4.3.1. Since that release was going out,
including using less temporary space, the development version got merged
in. It has had some testing.
It compresses the triples.tmp and intermediate sort files in the index
stage making the peak usage much smaller.
> The instance is a vm created on the Norwegian Research and Education Cloud,
> an openstack cloud provider.
>
> Instance type:
> 32 GB memory
> 4 CPU
I using similar on a 7 year old desktop machine, SATA disk.
I haven't got a machine I can dedicate to the multi-day load. I'll try
to find a way to at least push it though building the node table.
Loading the first 1B of truthy:
1B triples , 40kTPS , 06h 54m 10s
The database is 81G and building needs an addition 11.6G for workspace
for a total of 92G (+ the data file).
While smaller, its seems bz2 files are much slower to decompress so I've
been using gz files.
My current best guess for 6.4B truthy is
Temp 96G
Database 540G
Data 48G
Total: 684G -- peak disk needed
based on scaling up 1B truthy. Personally, I would make sure there was
more space. Also - I don't know if the shape of the data is sufficiently
uniform to make scaling predictable. The time doesn't scale so simply.
This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
Compression reduces the size of triples.tmp -- the related sort
temporary files which add up to the same again -- 1/6 of the size.
> The storage used for dump + temp files is mounted as a separate 900GB
> volume and is mounted on /var/fuseki/databases
> .The type of storage is described as
>> *mass-storage-default*: Storage backed by spinning hard drives,
> available to everybody and is the default type.
> with ext4 configured. At the moment I don't have access to the faster
> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
> overcommitted.
"overcommitted" may be a problem.
While it's not "tdb2 loader parallel" it does use a continuous CPU in
several threads.
For memory - "it's complicated".
The java parts only need say 2G. The sort is set to "buffer 50%
--parallel=2" and the java pipes into sort, that's another thread. I
think the effective peak is 3 active threads and they'll all be at 100%
for some of the time.
So it's going to need 50% of RAM + 2G for a java proces, +OS.
It does not need space for memory mapped files (they aren't used at all
in the loading process and I/O is sequential.
If that triggers over commitment swap out, the performance may go down a
lot.
For disk - if that is physically remote, it should not a problem (famous
last words). I/O is sequential and in large continuous chunks - typical
for batch processing jobs.
> OS for the instance is a clean Rocky Linux image, with no services except
> jena/fuseki installed. The systemd service
set up for fuseki is stopped.
> jena and fuseki version is 4.3.0.
>
> openjdk 11.0.13 2021-10-19 LTS
> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
Just FYI: Java17 is a little faster. Some java improvements have
improved RDF parsing speed by up to 10%. in xloader that not significant
to the overall time.
> I'm running from a tmux session to avoid connectivity issues and to capture
> the output.
I use
tdb2.xloader .... |& tee LOG-FILE-NAME
to capture the logs and see them. ">&" and "tail -f" would achieve much
the same effect
> I think the output is stored in memory and not on disk.
> On First run I tried to have the tmpdir on the root partition, to separate
> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
> full. For the second (current run) all directories are under
> /var/fuseki/databases.
Yes - after making that mistake myself, the new version ignores system
TMPDIR. Using --tmpdir is best but otherwise it defaults to the data
directory.
>
> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
> /var/fuseki/databases/tmp latest-truthy.nt.gz
>
> The import is so far at the "ingest data" stage where it has really slowed
> down.
FYI: The first line of ingest is always very slow. It is not measuring
the start point correctly.
>
> Current output is:
>
> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 /
> Avg: 7,593)
>
> See full log so far at
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
The earlier first pass also slows down and that should be fairly
constant-ish speed step once everything settles down.
> Some notes:
>
> * There is a (time/info) lapse in the output log between the end of
> 'parse' and the start of 'index' for Terms. It is unclear to me what is
> happening in the 1h13 minutes between the lines.
There is "sort" going on. "top" should show it.
For each index there is also a very long pause for exactly the same
reason. It would be good to have some something go "tick" and log a
message occasionally.
>
> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds [2021/12/10
> 22:33:46 CET]
> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
> 6,560,468,631 triples/quads 129,331 TPS
> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 /
> Avg: 237,755)
>
> * The ingest data step really slows down on the "ingest data stage": At the
> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
> has 10 days left before it finishes.
Ouch.
> * When I saw sort running in the background for the first parts of the job,
> I looked at the `sort` command. I noticed from some online sources that
> setting the environment variable LC_ALL=C improves speed for `sort`. Could
> this be set on the ProcessBuilder for the `sort` process? Could it
> break/change something? I see the warning from the man page for `sort`.
>
> *** WARNING *** The locale specified by the environment affects
> sort order. Set LC_ALL=C to get the traditional sort order that
> uses native byte values.
It shouldn't matter but, yes, better to set it and export it in the
control script and propagate to forked processes.
The sort is doing a binary sort except because it a text sort program,
the binary is turned into hex (!!). hex is in the ASCII subset and
shoule be locale safe.
But better to set LC_ALL=C.
Andy
>
> Links:
> https://access.redhat.com/solutions/445233
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>
> Best regards,
> Øyvind
>