You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Øyvind Gjesdal <oy...@gmail.com> on 2021/12/11 21:45:08 UTC

Testing tdb2.xloader

I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
dump downloaded 2021-12-09.

The instance is a vm created on the Norwegian Research and Education Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU

The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
>  *mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
overcommitted.

OS for the instance is a clean Rocky Linux image, with no services except
jena/fuseki installed. The systemd service  set up for fuseki is stopped.
jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)

I'm running from a tmux session to avoid connectivity issues and to capture
the output. I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to separate
temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
full. For the second (current run) all directories are under
/var/fuseki/databases.

 $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz

The import is so far at the "ingest data" stage where it has really slowed
down.

Current output is:

20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab

Some notes:

* There is a (time/info) lapse in the output log between the  end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.

22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds [2021/12/10
22:33:46 CET]
22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)

* The ingest data step really slows down on the "ingest data stage": At the
current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
has 10 days left before it finishes.

* When I saw sort running in the background for the first parts of the job,
I looked at the `sort` command. I noticed from some online sources that
setting the environment variable LC_ALL=C improves speed for `sort`. Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.

       *** WARNING *** The locale specified by the environment affects
       sort order.  Set LC_ALL=C to get the traditional sort order that
       uses native byte values.

Links:
https://access.redhat.com/solutions/445233
https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort

Best regards,
Øyvind

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
The more tests we have on different machines the better. :)

Personally I'd say if you have a choice go for a PCIe 4.0 NVMe SSDs and
stay away from SATA < III SSDs. Also for the tests SSD RAID isn't necessary.

These components have become extremely affordable in recent years and
really should be part of a fast pipeline imo in particular for
tdb2.tdbloader in parallel mode.

But as Andy has emphasized he has designed the tdb2.xloader process to
be spinning-disk friendly. so SSDs are not a prerequisite for xloader


On Tue, Dec 14, 2021 at 10:38 AM Øyvind Gjesdal <oy...@gmail.com> wrote:

> Hi Marco,
>
> Very useful to compare with your log on the different runs. Still working
> with configuration to see if I can get the ingest data stage to be usable
> for hdd. It looks like I get close to the performance of your run on the
> earlier stages, while ingest data is still very much too slow. Having to
> use SSD may be necessary, for a real world large import to complete?  I'lI
> request some ssd storage as well, and hope there's a quota for me :)
>
> Maybe I could also test different distros, to see if some of the default OS
> settings affect the import.
>
> Best regards,
> Øyvind
>
> søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <marco.neumann@gmail.com
> >:
>
> > Øyvind, looks like the above was the wrong log from a prior sharding
> > experiment.
> >
> > This is the correct log file for the truthy dataset.
> >
> > http://www.lotico.com/temp/LOG-98085
> >
> >
> >
> > On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
> > wrote:
> >
> > > Thank you Øyvind for sharing, great to see more tests in the wild.
> > >
> > > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > > dataset and quickly ran out of disk space. It finished the job but did
> > not
> > > write any of the indexes to disk due to lack of space. no error
> messages.
> > >
> > > http://www.lotico.com/temp/LOG-95239
> > >
> > > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> > the
> > > full wikidata dataset,
> > >
> > > I personally had the best experience with dedicated hardware so far
> (can
> > > be in the data center), shared or dedicated virtual compute engines did
> > not
> > > deliver as expected. And I have not seen great benefits from data
> center
> > > grade multicore cpus. But I think they will during runtime in multi
> user
> > > settings (eg fuseki).
> > >
> > > Best,
> > > Marco
> > >
> > > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
> > wrote:
> > >
> > >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> > >> truthy
> > >> dump downloaded 2021-12-09.
> > >>
> > >> The instance is a vm created on the Norwegian Research and Education
> > >> Cloud,
> > >> an openstack cloud provider.
> > >>
> > >> Instance type:
> > >> 32 GB memory
> > >> 4 CPU
> > >>
> > >> The storage used for dump + temp files  is mounted as a separate
> 900GB
> > >> volume and is mounted on /var/fuseki/databases
> > >> .The type of storage is described as
> > >> >  *mass-storage-default*: Storage backed by spinning hard drives,
> > >> available to everybody and is the default type.
> > >> with ext4 configured. At the moment I don't have access to the faster
> > >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> can
> > be
> > >> overcommitted.
> > >>
> > >> OS for the instance is a clean Rocky Linux image, with no services
> > except
> > >> jena/fuseki installed. The systemd service  set up for fuseki is
> > stopped.
> > >> jena and fuseki version is 4.3.0.
> > >>
> > >> openjdk 11.0.13 2021-10-19 LTS
> > >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> > >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
> sharing)
> > >>
> > >> I'm running from a tmux session to avoid connectivity issues and to
> > >> capture
> > >> the output. I think the output is stored in memory and not on disk.
> > >> On First run I tried to have the tmpdir on the root partition, to
> > separate
> > >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> > disk
> > >> full. For the second (current run) all directories are under
> > >> /var/fuseki/databases.
> > >>
> > >>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> > >> --tmpdir
> > >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> > >>
> > >> The import is so far at the "ingest data" stage where it has really
> > slowed
> > >> down.
> > >>
> > >> Current output is:
> > >>
> > >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356
> /
> > >> Avg: 7,593)
> > >>
> > >> See full log so far at
> > >>
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> > >>
> > >> Some notes:
> > >>
> > >> * There is a (time/info) lapse in the output log between the  end of
> > >> 'parse' and the start of 'index' for Terms.  It is unclear to me what
> is
> > >> happening in the 1h13 minutes between the lines.
> > >>
> > >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> > [2021/12/10
> > >> 22:33:46 CET]
> > >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> > >> 6,560,468,631 triples/quads 129,331 TPS
> > >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch:
> 237,755 /
> > >> Avg: 237,755)
> > >>
> > >> * The ingest data step really slows down on the "ingest data stage":
> At
> > >> the
> > >> current rate, if I calculated correctly, it looks like
> > PKG.CmdxIngestData
> > >> has 10 days left before it finishes.
> > >>
> > >> * When I saw sort running in the background for the first parts of the
> > >> job,
> > >> I looked at the `sort` command. I noticed from some online sources
> that
> > >> setting the environment variable LC_ALL=C improves speed for `sort`.
> > Could
> > >> this be set on the ProcessBuilder for the `sort` process? Could it
> > >> break/change something? I see the warning from the man page for
> `sort`.
> > >>
> > >>        *** WARNING *** The locale specified by the environment affects
> > >>        sort order.  Set LC_ALL=C to get the traditional sort order
> that
> > >>        uses native byte values.
> > >>
> > >> Links:
> > >> https://access.redhat.com/solutions/445233
> > >>
> > >>
> >
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> > >>
> > >>
> >
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> > >>
> > >> Best regards,
> > >> Øyvind
> > >>
> > >
> > >
> > > --
> > >
> > >
> > > ---
> > > Marco Neumann
> > > KONA
> > >
> > >
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.

On 14/12/2021 10:38, Øyvind Gjesdal wrote:
> Hi Marco,
> 
> Very useful to compare with your log on the different runs. Still working
> with configuration to see if I can get the ingest data stage to be usable
> for hdd. It looks like I get close to the performance of your run on the
> earlier stages, while ingest data is still very much too slow. Having to
> use SSD may be necessary, for a real world large import to complete?  I'lI
> request some ssd storage as well, and hope there's a quota for me :)

The access patterns should (tm!) be spinning-disk friendly. There is no 
random IO updating B+trees directly.

All the B+Trees are written "bottom-up" by specially writing blocks of 
the right layout to disk, not via the B+Tree runtime code which would be 
top down via "add record" style access.

     Andy

> Maybe I could also test different distros, to see if some of the default OS
> settings affect the import.
> 
> Best regards,
> Øyvind
> 
> søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <ma...@gmail.com>:
> 
>> Øyvind, looks like the above was the wrong log from a prior sharding
>> experiment.
>>
>> This is the correct log file for the truthy dataset.
>>
>> http://www.lotico.com/temp/LOG-98085
>>
>>
>>
>> On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
>> wrote:
>>
>>> Thank you Øyvind for sharing, great to see more tests in the wild.
>>>
>>> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
>>> dataset and quickly ran out of disk space. It finished the job but did
>> not
>>> write any of the indexes to disk due to lack of space. no error messages.
>>>
>>> http://www.lotico.com/temp/LOG-95239
>>>
>>> I have now ordered a new 4TB SSD drive to rerun the test possibly with
>> the
>>> full wikidata dataset,
>>>
>>> I personally had the best experience with dedicated hardware so far (can
>>> be in the data center), shared or dedicated virtual compute engines did
>> not
>>> deliver as expected. And I have not seen great benefits from data center
>>> grade multicore cpus. But I think they will during runtime in multi user
>>> settings (eg fuseki).
>>>
>>> Best,
>>> Marco
>>>
>>> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
>> wrote:
>>>
>>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>>>> truthy
>>>> dump downloaded 2021-12-09.
>>>>
>>>> The instance is a vm created on the Norwegian Research and Education
>>>> Cloud,
>>>> an openstack cloud provider.
>>>>
>>>> Instance type:
>>>> 32 GB memory
>>>> 4 CPU
>>>>
>>>> The storage used for dump + temp files  is mounted as a separate  900GB
>>>> volume and is mounted on /var/fuseki/databases
>>>> .The type of storage is described as
>>>>>   *mass-storage-default*: Storage backed by spinning hard drives,
>>>> available to everybody and is the default type.
>>>> with ext4 configured. At the moment I don't have access to the faster
>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
>> be
>>>> overcommitted.
>>>>
>>>> OS for the instance is a clean Rocky Linux image, with no services
>> except
>>>> jena/fuseki installed. The systemd service  set up for fuseki is
>> stopped.
>>>> jena and fuseki version is 4.3.0.
>>>>
>>>> openjdk 11.0.13 2021-10-19 LTS
>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>>>
>>>> I'm running from a tmux session to avoid connectivity issues and to
>>>> capture
>>>> the output. I think the output is stored in memory and not on disk.
>>>> On First run I tried to have the tmpdir on the root partition, to
>> separate
>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>> disk
>>>> full. For the second (current run) all directories are under
>>>> /var/fuseki/databases.
>>>>
>>>>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>>>> --tmpdir
>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>
>>>> The import is so far at the "ingest data" stage where it has really
>> slowed
>>>> down.
>>>>
>>>> Current output is:
>>>>
>>>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>>>> Avg: 7,593)
>>>>
>>>> See full log so far at
>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>>>
>>>> Some notes:
>>>>
>>>> * There is a (time/info) lapse in the output log between the  end of
>>>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>>>> happening in the 1h13 minutes between the lines.
>>>>
>>>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
>> [2021/12/10
>>>> 22:33:46 CET]
>>>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
>>>> Avg: 237,755)
>>>>
>>>> * The ingest data step really slows down on the "ingest data stage": At
>>>> the
>>>> current rate, if I calculated correctly, it looks like
>> PKG.CmdxIngestData
>>>> has 10 days left before it finishes.
>>>>
>>>> * When I saw sort running in the background for the first parts of the
>>>> job,
>>>> I looked at the `sort` command. I noticed from some online sources that
>>>> setting the environment variable LC_ALL=C improves speed for `sort`.
>> Could
>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>> break/change something? I see the warning from the man page for `sort`.
>>>>
>>>>         *** WARNING *** The locale specified by the environment affects
>>>>         sort order.  Set LC_ALL=C to get the traditional sort order that
>>>>         uses native byte values.
>>>>
>>>> Links:
>>>> https://access.redhat.com/solutions/445233
>>>>
>>>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>>>
>>>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>>>
>>>> Best regards,
>>>> Øyvind
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> ---
>>> Marco Neumann
>>> KONA
>>>
>>>
>>
>> --
>>
>>
>> ---
>> Marco Neumann
>> KONA
>>
> 

Re: Testing tdb2.xloader

Posted by Øyvind Gjesdal <oy...@gmail.com>.
Hi Marco,

Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages, while ingest data is still very much too slow. Having to
use SSD may be necessary, for a real world large import to complete?  I'lI
request some ssd storage as well, and hope there's a quota for me :)

Maybe I could also test different distros, to see if some of the default OS
settings affect the import.

Best regards,
Øyvind

søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <ma...@gmail.com>:

> Øyvind, looks like the above was the wrong log from a prior sharding
> experiment.
>
> This is the correct log file for the truthy dataset.
>
> http://www.lotico.com/temp/LOG-98085
>
>
>
> On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
> wrote:
>
> > Thank you Øyvind for sharing, great to see more tests in the wild.
> >
> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > dataset and quickly ran out of disk space. It finished the job but did
> not
> > write any of the indexes to disk due to lack of space. no error messages.
> >
> > http://www.lotico.com/temp/LOG-95239
> >
> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> the
> > full wikidata dataset,
> >
> > I personally had the best experience with dedicated hardware so far (can
> > be in the data center), shared or dedicated virtual compute engines did
> not
> > deliver as expected. And I have not seen great benefits from data center
> > grade multicore cpus. But I think they will during runtime in multi user
> > settings (eg fuseki).
> >
> > Best,
> > Marco
> >
> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
> wrote:
> >
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >>
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >>
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >> >  *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
> be
> >> overcommitted.
> >>
> >> OS for the instance is a clean Rocky Linux image, with no services
> except
> >> jena/fuseki installed. The systemd service  set up for fuseki is
> stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >>
> >> I'm running from a tmux session to avoid connectivity issues and to
> >> capture
> >> the output. I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >>
> >>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> slowed
> >> down.
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the  end of
> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >>
> >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage": At
> >> the
> >> current rate, if I calculated correctly, it looks like
> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >>
> >> * When I saw sort running in the background for the first parts of the
> >> job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >>        *** WARNING *** The locale specified by the environment affects
> >>        sort order.  Set LC_ALL=C to get the traditional sort order that
> >>        uses native byte values.
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >> Best regards,
> >> Øyvind
> >>
> >
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
> >
> >
>
> --
>
>
> ---
> Marco Neumann
> KONA
>

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
Øyvind, looks like the above was the wrong log from a prior sharding
experiment.

This is the correct log file for the truthy dataset.

http://www.lotico.com/temp/LOG-98085



On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <ma...@gmail.com>
wrote:

> Thank you Øyvind for sharing, great to see more tests in the wild.
>
> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> dataset and quickly ran out of disk space. It finished the job but did not
> write any of the indexes to disk due to lack of space. no error messages.
>
> http://www.lotico.com/temp/LOG-95239
>
> I have now ordered a new 4TB SSD drive to rerun the test possibly with the
> full wikidata dataset,
>
> I personally had the best experience with dedicated hardware so far (can
> be in the data center), shared or dedicated virtual compute engines did not
> deliver as expected. And I have not seen great benefits from data center
> grade multicore cpus. But I think they will during runtime in multi user
> settings (eg fuseki).
>
> Best,
> Marco
>
> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com> wrote:
>
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> dump downloaded 2021-12-09.
>>
>> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>>
>> The storage used for dump + temp files  is mounted as a separate  900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>> >  *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
>> overcommitted.
>>
>> OS for the instance is a clean Rocky Linux image, with no services except
>> jena/fuseki installed. The systemd service  set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> the output. I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>>
>>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really slowed
>> down.
>>
>> Current output is:
>>
>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the  end of
>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>>
>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage": At
>> the
>> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
>> has 10 days left before it finishes.
>>
>> * When I saw sort running in the background for the first parts of the
>> job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`. Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>>        *** WARNING *** The locale specified by the environment affects
>>        sort order.  Set LC_ALL=C to get the traditional sort order that
>>        uses native byte values.
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>
>> Best regards,
>> Øyvind
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA

log4j implications [Was: Testing tdb2.xloader]

Posted by Andy Seaborne <an...@apache.org>.
4.3.1 will contain the fixed log4j 2.15.0. No special mitigations necessary.

Jena uses log4j2 via the slf4j adapter from Apache Logging 
(log4j-slf4j-impl). 2.15.0 should be compatible in Jena usage with 
2.14.* for Jena 4.x.

 From the download, replace log4j-(api|core|log4j-slf4j-impl) with the 
2.15.0 versions in lib/

User applications are already deciding which logging to use - log4j2 is 
an optional dependency of Jena so it isn't pulled in by a Jena 
POM/Gradle entry. The application developer adds their choice of slf4j 
provider [*]

Fuseki however shades dependencies into a combined jar.

We could say "set the system property '-Dlog4j2.formatMsgNoLookups=true' 
but it isn't always to change scripts etc for at least two reasons

(1) when teams running the production deployment are not the 
person/people producing the deployment package.

(2) Some of our users are data people, not programmer-developers. This 
exploit is live now - making the fix easy seems like a good thing.

A new release (even if 4.3.0 only lasted a day!) is a routine upgrade 
procedure.

And it's easy to say "upgrade to 4.3.1" than explain the steps needed to 
secure 4.3.0, even if we achieve 4.4.0 at the earlier point of January 
with the reworked UI using up-to-date JS dependencies.

     Andy

[*] The patch server in RDF Delta doesn't use log4j at all - it uses 
java.util.logging (JUL). Client side, Delta does use log4j.


On 12/12/2021 13:45, Marco Neumann wrote:
> Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability?

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability?

On Sun, Dec 12, 2021 at 1:24 PM Marco Neumann <ma...@gmail.com>
wrote:

> As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB
> SSD drive and an old laptop.
>
> I also have a contact who has just set up a new datacenter in Ireland. I
> may be able to run a few tests on much bigger machines as well. Otherwise I
> am very happy with the iron in Finland.as long as they are dedicated
> machines.
>
> On Sun, Dec 12, 2021 at 12:44 PM Andy Seaborne <an...@apache.org> wrote:
>
>>
>>
>> On 11/12/2021 22:02, Marco Neumann wrote:
>> > Thank you Øyvind for sharing, great to see more tests in the wild.
>> >
>> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
>> > dataset and quickly ran out of disk space. It finished the job but did
>> not
>> > write any of the indexes to disk due to lack of space. no error
>> messages.
>>
>> The 4.3.1 xloader should hopefully address the space issue.
>>
>>      Andy
>>
>> >
>> > http://www.lotico.com/temp/LOG-95239
>> >
>> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
>> the
>> > full wikidata dataset,
>> >
>> > I personally had the best experience with dedicated hardware so far
>> (can be
>> > in the data center), shared or dedicated virtual compute engines did not
>> > deliver as expected. And I have not seen great benefits from data center
>> > grade multicore cpus. But I think they will during runtime in multi user
>> > settings (eg fuseki).
>> >
>> > Best,
>> > Marco
>> >
>> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
>> wrote:
>> >
>> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>> truthy
>> >> dump downloaded 2021-12-09.
>> >>
>> >> The instance is a vm created on the Norwegian Research and Education
>> Cloud,
>> >> an openstack cloud provider.
>> >>
>> >> Instance type:
>> >> 32 GB memory
>> >> 4 CPU
>> >>
>> >> The storage used for dump + temp files  is mounted as a separate  900GB
>> >> volume and is mounted on /var/fuseki/databases
>> >> .The type of storage is described as
>> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
>> >> available to everybody and is the default type.
>> >> with ext4 configured. At the moment I don't have access to the faster
>> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>> can be
>> >> overcommitted.
>> >>
>> >> OS for the instance is a clean Rocky Linux image, with no services
>> except
>> >> jena/fuseki installed. The systemd service  set up for fuseki is
>> stopped.
>> >> jena and fuseki version is 4.3.0.
>> >>
>> >> openjdk 11.0.13 2021-10-19 LTS
>> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
>> sharing)
>> >>
>> >> I'm running from a tmux session to avoid connectivity issues and to
>> capture
>> >> the output. I think the output is stored in memory and not on disk.
>> >> On First run I tried to have the tmpdir on the root partition, to
>> separate
>> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>> disk
>> >> full. For the second (current run) all directories are under
>> >> /var/fuseki/databases.
>> >>
>> >>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>> --tmpdir
>> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
>> >>
>> >> The import is so far at the "ingest data" stage where it has really
>> slowed
>> >> down.
>> >>
>> >> Current output is:
>> >>
>> >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>> >> Avg: 7,593)
>> >>
>> >> See full log so far at
>> >>
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>> >>
>> >> Some notes:
>> >>
>> >> * There is a (time/info) lapse in the output log between the  end of
>> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what
>> is
>> >> happening in the 1h13 minutes between the lines.
>> >>
>> >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
>> [2021/12/10
>> >> 22:33:46 CET]
>> >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>> >> 6,560,468,631 triples/quads 129,331 TPS
>> >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755
>> /
>> >> Avg: 237,755)
>> >>
>> >> * The ingest data step really slows down on the "ingest data stage":
>> At the
>> >> current rate, if I calculated correctly, it looks like
>> PKG.CmdxIngestData
>> >> has 10 days left before it finishes.
>> >>
>> >> * When I saw sort running in the background for the first parts of the
>> job,
>> >> I looked at the `sort` command. I noticed from some online sources that
>> >> setting the environment variable LC_ALL=C improves speed for `sort`.
>> Could
>> >> this be set on the ProcessBuilder for the `sort` process? Could it
>> >> break/change something? I see the warning from the man page for `sort`.
>> >>
>> >>         *** WARNING *** The locale specified by the environment affects
>> >>         sort order.  Set LC_ALL=C to get the traditional sort order
>> that
>> >>         uses native byte values.
>> >>
>> >> Links:
>> >> https://access.redhat.com/solutions/445233
>> >>
>> >>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>> >>
>> >>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>> >>
>> >> Best regards,
>> >> Øyvind
>> >>
>> >
>> >
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB SSD
drive and an old laptop.

I also have a contact who has just set up a new datacenter in Ireland. I
may be able to run a few tests on much bigger machines as well. Otherwise I
am very happy with the iron in Finland.as long as they are dedicated
machines.

On Sun, Dec 12, 2021 at 12:44 PM Andy Seaborne <an...@apache.org> wrote:

>
>
> On 11/12/2021 22:02, Marco Neumann wrote:
> > Thank you Øyvind for sharing, great to see more tests in the wild.
> >
> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > dataset and quickly ran out of disk space. It finished the job but did
> not
> > write any of the indexes to disk due to lack of space. no error messages.
>
> The 4.3.1 xloader should hopefully address the space issue.
>
>      Andy
>
> >
> > http://www.lotico.com/temp/LOG-95239
> >
> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> the
> > full wikidata dataset,
> >
> > I personally had the best experience with dedicated hardware so far (can
> be
> > in the data center), shared or dedicated virtual compute engines did not
> > deliver as expected. And I have not seen great benefits from data center
> > grade multicore cpus. But I think they will during runtime in multi user
> > settings (eg fuseki).
> >
> > Best,
> > Marco
> >
> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com>
> wrote:
> >
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> truthy
> >> dump downloaded 2021-12-09.
> >>
> >> The instance is a vm created on the Norwegian Research and Education
> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >>
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
> be
> >> overcommitted.
> >>
> >> OS for the instance is a clean Rocky Linux image, with no services
> except
> >> jena/fuseki installed. The systemd service  set up for fuseki is
> stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >>
> >> I'm running from a tmux session to avoid connectivity issues and to
> capture
> >> the output. I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >>
> >>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> slowed
> >> down.
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the  end of
> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >>
> >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage": At
> the
> >> current rate, if I calculated correctly, it looks like
> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >>
> >> * When I saw sort running in the background for the first parts of the
> job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >>         *** WARNING *** The locale specified by the environment affects
> >>         sort order.  Set LC_ALL=C to get the traditional sort order that
> >>         uses native byte values.
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >> Best regards,
> >> Øyvind
> >>
> >
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.

On 11/12/2021 22:02, Marco Neumann wrote:
> Thank you Øyvind for sharing, great to see more tests in the wild.
> 
> I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> dataset and quickly ran out of disk space. It finished the job but did not
> write any of the indexes to disk due to lack of space. no error messages.

The 4.3.1 xloader should hopefully address the space issue.

     Andy

> 
> http://www.lotico.com/temp/LOG-95239
> 
> I have now ordered a new 4TB SSD drive to rerun the test possibly with the
> full wikidata dataset,
> 
> I personally had the best experience with dedicated hardware so far (can be
> in the data center), shared or dedicated virtual compute engines did not
> deliver as expected. And I have not seen great benefits from data center
> grade multicore cpus. But I think they will during runtime in multi user
> settings (eg fuseki).
> 
> Best,
> Marco
> 
> On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com> wrote:
> 
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
>> dump downloaded 2021-12-09.
>>
>> The instance is a vm created on the Norwegian Research and Education Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>>
>> The storage used for dump + temp files  is mounted as a separate  900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>>>   *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
>> overcommitted.
>>
>> OS for the instance is a clean Rocky Linux image, with no services except
>> jena/fuseki installed. The systemd service  set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> I'm running from a tmux session to avoid connectivity issues and to capture
>> the output. I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>>
>>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really slowed
>> down.
>>
>> Current output is:
>>
>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the  end of
>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>>
>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage": At the
>> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
>> has 10 days left before it finishes.
>>
>> * When I saw sort running in the background for the first parts of the job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`. Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>>         *** WARNING *** The locale specified by the environment affects
>>         sort order.  Set LC_ALL=C to get the traditional sort order that
>>         uses native byte values.
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>
>> Best regards,
>> Øyvind
>>
> 
> 

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
Thank you Øyvind for sharing, great to see more tests in the wild.

I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did not
write any of the indexes to disk due to lack of space. no error messages.

http://www.lotico.com/temp/LOG-95239

I have now ordered a new 4TB SSD drive to rerun the test possibly with the
full wikidata dataset,

I personally had the best experience with dedicated hardware so far (can be
in the data center), shared or dedicated virtual compute engines did not
deliver as expected. And I have not seen great benefits from data center
grade multicore cpus. But I think they will during runtime in multi user
settings (eg fuseki).

Best,
Marco

On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oy...@gmail.com> wrote:

> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
> dump downloaded 2021-12-09.
>
> The instance is a vm created on the Norwegian Research and Education Cloud,
> an openstack cloud provider.
>
> Instance type:
> 32 GB memory
> 4 CPU
>
> The storage used for dump + temp files  is mounted as a separate  900GB
> volume and is mounted on /var/fuseki/databases
> .The type of storage is described as
> >  *mass-storage-default*: Storage backed by spinning hard drives,
> available to everybody and is the default type.
> with ext4 configured. At the moment I don't have access to the faster
> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
> overcommitted.
>
> OS for the instance is a clean Rocky Linux image, with no services except
> jena/fuseki installed. The systemd service  set up for fuseki is stopped.
> jena and fuseki version is 4.3.0.
>
> openjdk 11.0.13 2021-10-19 LTS
> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>
> I'm running from a tmux session to avoid connectivity issues and to capture
> the output. I think the output is stored in memory and not on disk.
> On First run I tried to have the tmpdir on the root partition, to separate
> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
> full. For the second (current run) all directories are under
> /var/fuseki/databases.
>
>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
> /var/fuseki/databases/tmp latest-truthy.nt.gz
>
> The import is so far at the "ingest data" stage where it has really slowed
> down.
>
> Current output is:
>
> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
> Avg: 7,593)
>
> See full log so far at
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>
> Some notes:
>
> * There is a (time/info) lapse in the output log between the  end of
> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> happening in the 1h13 minutes between the lines.
>
> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds [2021/12/10
> 22:33:46 CET]
> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> 6,560,468,631 triples/quads 129,331 TPS
> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
> Avg: 237,755)
>
> * The ingest data step really slows down on the "ingest data stage": At the
> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
> has 10 days left before it finishes.
>
> * When I saw sort running in the background for the first parts of the job,
> I looked at the `sort` command. I noticed from some online sources that
> setting the environment variable LC_ALL=C improves speed for `sort`. Could
> this be set on the ProcessBuilder for the `sort` process? Could it
> break/change something? I see the warning from the man page for `sort`.
>
>        *** WARNING *** The locale specified by the environment affects
>        sort order.  Set LC_ALL=C to get the traditional sort order that
>        uses native byte values.
>
> Links:
> https://access.redhat.com/solutions/445233
>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>
> Best regards,
> Øyvind
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Øyvind Gjesdal <oy...@gmail.com>.
I'm debugging now and think I've found the some possible culprits of the
slow data ingest stage on my setup.

In the ingest data stage, I see 100% disk use read, with only 2.5 MB/s,
and  the ps command also shows my processor spending time waiting for IO.
In the jena-base/src/main/java/org/apache/jena/atlas/io/IO.java class both
GZIPInputReader and GZIPOutputReader are called with one parameter, which
by default uses "512" as buffer. (
https://github.com/openjdk/jdk/blob/6e1da6440b898d8a10b3db56707a75f32a489149/src/java.base/share/classes/java/util/zip/GZIPInputStream.java#L90
)

I've built a snapshot now, with  "return new
GZIP[Output/Input]Stream(out,8*1024);" (which is the buffer value used in
other places for GZIPOutput/Input in jena), which is currently running, but
I'm thinking of maybe testing larger buffers, and see what my "lowest best
buffer" is. I'm  now waiting for the first steps to complete, to see the
effect of changing it.

What do I need to do to start from the exec_java $PKG.CmdxIngestData step
in tdb2.xloader?

Would
* stopping xloader before this step once
* taking a copy of $LOC
* then copying the copy to $LOC before each run (and commenting out
preceding steps in bin/tdb2.xloader)

be enough to have a working test setup to start at "data ingest", or do I
need to do something else?

Best regards,
Øyvind



tir. 28. des. 2021 kl. 14:21 skrev Andy Seaborne <an...@apache.org>:

> Excellent news!
>
> Updated: https://www.w3.org/wiki/LargeTripleStores
>
>      Andy
>
> On 28/12/2021 10:11, Marco Neumann wrote:
> > Ok here is another successful tdb2 load. this time with the full wikidata
> > download (20211222_latest-all.nt.gz 172G ) file.
> >
> > counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
> > entire load.
> >
> > I think with the right hardware this could easily be time compressed
> quite
> > a bit.
> >
> > http://www.lotico.com/temp/LOG-45497
> >
> > Best,
> > Marco
> >
> >
> > On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <ma...@gmail.com>
> > wrote:
> >
> >> Thank you Andy. found it in revisions somewhere
> >>
> >> just finished another run with truthy
> >>
> >> http://lotico.com/temp/LOG-1214
> >>
> >> will now increase RAM before running an additional load with increased
> >> thread count.
> >>
> >> Marco
> >>
> >> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:
> >>
> >>> gists are git repos: so the file is there ... somewhere:
> >>>
> >>>
> >>>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
> >>>
> >>>       Andy
> >>>
> >>> On 19/12/2021 17:56, Marco Neumann wrote:
> >>>> Thank you Lorenz,
> >>>> unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> >>>> github
> >>>>
> >>>>
> >>>> On Sun, Dec 19, 2021 at 9:46 AM LB <conpcomplete@googlemail.com
> >>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> I edited the Gist [1] and put the default stats there. Takes ~4min to
> >>>>> compute the stats.
> >>>>>
> >>>>> Findings:
> >>>>>
> >>>>> - for Wikidata we have to extend those stats with the stats for
> wdt:P31
> >>>>> property as Wikidata does use this property as their own rdf:type
> >>>>> relation. It is indeed trivial, just execute
> >>>>>
> >>>>> select ?c (count(*) as ?cnt) {?s
> >>>>> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
> >>>>>
> >>>>> and convert it into the stats rule language (SSE) and put those rules
> >>>>> before the more generic rule
> >>>>>
> >>>>> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
> >>>>>
> >>>>> - I didn't want to touch the stats script itself, but we could for
> >>>>> example also make this type relation generic and allow for other like
> >>>>> wdt:P31 or skos:subject via a commandline option which provides any
> URI
> >>>>> as the type relation with default being rdf:type - but that's for
> sure
> >>>>> probably overkill
> >>>>>
> >>>>> - there is a bug in the stats script or file I guess, because of of
> >>> some
> >>>>> overflow? the count value is
> >>>>>
> >>>>> (count -1983667112))
> >>>>>
> >>>>> which indicates this.  I opened a ticket:
> >>>>> https://issues.apache.org/jira/browse/JENA-2225
> >>>>>
> >>>>>
> >>>>> [1]
> >>>>>
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>>
> >>>>> On 18.12.21 11:35, Marco Neumann wrote:
> >>>>>> good morning Lorenz,
> >>>>>>
> >>>>>> Maybe time to get a few query bencharms tests? :)
> >>>>>>
> >>>>>> What does tdb2.tdbstats report?
> >>>>>>
> >>>>>> Marco
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
> >>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Good morning,
> >>>>>>>
> >>>>>>> loading of Wikidata truthy is done, this time I didn't forget to
> keep
> >>>>>>> logs:
> >>>>>>>
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>>>>
> >>>>>>> I'm a bit surprised that this time it was 8h faster than last time,
> >>> 31h
> >>>>>>> vs 39h. Not sure if a) there was something else on the server last
> >>> time
> >>>>>>> (at least I couldn't see any running tasks) or b) if this is a
> >>>>>>> consequence of the more parallelized Unix sort now - I set it to
> >>>>>>> --parallel=16
> >>>>>>>
> >>>>>>> I mean, the piped input stream is single threaded I guess, but
> maybe
> >>> the
> >>>>>>> sort merge step can benefit from more threads? I guess I have to
> >>> clean
> >>>>>>> up everything and run it again with the original setup with 2 Unix
> >>> sort
> >>>>>>> threads ...
> >>>>>>>
> >>>>>>>
> >>>>>>> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>>>>>>
> >>>>>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>> I am getting a slow down during data ingestion. However, your
> >>> summary
> >>>>>>>>> figures don't show that in the ingest phase. The whole logs may
> >>> have
> >>>>>>>>> the signal in it but less pronounced.
> >>>>>>>>>
> >>>>>>>>> My working assumption is now that it is random access to the node
> >>>>>>>>> table. Your results point to it not being a CPU issue but that my
> >>>>>>>>> setup is saturating the I/O path. While the portable has a NVMe
> >>> SSD,
> >>>>>>>>> it has probably not got the same I/O bandwidth as a server class
> >>>>>>>>> machine.
> >>>>>>>>>
> >>>>>>>>> I'm not sure what to do about this other than run with a much
> >>> bigger
> >>>>>>>>> node table cache for the ingestion phase. Substituting some file
> >>>>>>>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>>>>>>> noticed before, it is probably visible in logs of smaller loads
> on
> >>>>>>>>> closer inspection. Experimenting on a small dataset is a lot
> >>> easier.
> >>>>>>>> I'm more sure of this - not yet definite.
> >>>>>>>>
> >>>>>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>>>>>>> Seems rather small for the task.
> >>>>>>>>
> >>>>>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
> >>> SPARQL
> >>>>>>>> results.
> >>>>>>>>
> >>>>>>>> 2 pieces of data will help:
> >>>>>>>>
> >>>>>>>> Experimenting with very small cache settings.
> >>>>>>>>
> >>>>>>>> Letting my slow load keep going to see if there is the same
> >>>>>>>> characteristics at the index stage. There shouldn't be if
> >>> nodeToNodeId
> >>>>>>>> is the cause; it's only an influence in the data ingestion step.
> >>>>>>>>
> >>>>>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
> >>> and
> >>>>>>>> maybe loader=phased. It falls into the same situation although the
> >>>>>>>> improvement there is going to be less marked. "Parallel" saturates
> >>> the
> >>>>>>>> I/O by other means as well.
> >>>>>>>>
> >>>>>>>>        Andy
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >> --
> >>
> >>
> >> ---
> >> Marco Neumann
> >> KONA
> >>
> >>
> >
>

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
Excellent news!

Updated: https://www.w3.org/wiki/LargeTripleStores

     Andy

On 28/12/2021 10:11, Marco Neumann wrote:
> Ok here is another successful tdb2 load. this time with the full wikidata
> download (20211222_latest-all.nt.gz 172G ) file.
> 
> counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
> entire load.
> 
> I think with the right hardware this could easily be time compressed quite
> a bit.
> 
> http://www.lotico.com/temp/LOG-45497
> 
> Best,
> Marco
> 
> 
> On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <ma...@gmail.com>
> wrote:
> 
>> Thank you Andy. found it in revisions somewhere
>>
>> just finished another run with truthy
>>
>> http://lotico.com/temp/LOG-1214
>>
>> will now increase RAM before running an additional load with increased
>> thread count.
>>
>> Marco
>>
>> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:
>>
>>> gists are git repos: so the file is there ... somewhere:
>>>
>>>
>>> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>>>
>>>       Andy
>>>
>>> On 19/12/2021 17:56, Marco Neumann wrote:
>>>> Thank you Lorenz,
>>>> unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
>>>> github
>>>>
>>>>
>>>> On Sun, Dec 19, 2021 at 9:46 AM LB <conpcomplete@googlemail.com
>>> .invalid>
>>>> wrote:
>>>>
>>>>> I edited the Gist [1] and put the default stats there. Takes ~4min to
>>>>> compute the stats.
>>>>>
>>>>> Findings:
>>>>>
>>>>> - for Wikidata we have to extend those stats with the stats for wdt:P31
>>>>> property as Wikidata does use this property as their own rdf:type
>>>>> relation. It is indeed trivial, just execute
>>>>>
>>>>> select ?c (count(*) as ?cnt) {?s
>>>>> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>>>>>
>>>>> and convert it into the stats rule language (SSE) and put those rules
>>>>> before the more generic rule
>>>>>
>>>>> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>>>>>
>>>>> - I didn't want to touch the stats script itself, but we could for
>>>>> example also make this type relation generic and allow for other like
>>>>> wdt:P31 or skos:subject via a commandline option which provides any URI
>>>>> as the type relation with default being rdf:type - but that's for sure
>>>>> probably overkill
>>>>>
>>>>> - there is a bug in the stats script or file I guess, because of of
>>> some
>>>>> overflow? the count value is
>>>>>
>>>>> (count -1983667112))
>>>>>
>>>>> which indicates this.  I opened a ticket:
>>>>> https://issues.apache.org/jira/browse/JENA-2225
>>>>>
>>>>>
>>>>> [1]
>>>>>
>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>>>
>>>>> On 18.12.21 11:35, Marco Neumann wrote:
>>>>>> good morning Lorenz,
>>>>>>
>>>>>> Maybe time to get a few query bencharms tests? :)
>>>>>>
>>>>>> What does tdb2.tdbstats report?
>>>>>>
>>>>>> Marco
>>>>>>
>>>>>>
>>>>>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
>>> .invalid>
>>>>>> wrote:
>>>>>>
>>>>>>> Good morning,
>>>>>>>
>>>>>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>>>>>>> logs:
>>>>>>>
>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>>>>>
>>>>>>> I'm a bit surprised that this time it was 8h faster than last time,
>>> 31h
>>>>>>> vs 39h. Not sure if a) there was something else on the server last
>>> time
>>>>>>> (at least I couldn't see any running tasks) or b) if this is a
>>>>>>> consequence of the more parallelized Unix sort now - I set it to
>>>>>>> --parallel=16
>>>>>>>
>>>>>>> I mean, the piped input stream is single threaded I guess, but maybe
>>> the
>>>>>>> sort merge step can benefit from more threads? I guess I have to
>>> clean
>>>>>>> up everything and run it again with the original setup with 2 Unix
>>> sort
>>>>>>> threads ...
>>>>>>>
>>>>>>>
>>>>>>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>>>>>>
>>>>>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>>>>>>> ...
>>>>>>>>
>>>>>>>>> I am getting a slow down during data ingestion. However, your
>>> summary
>>>>>>>>> figures don't show that in the ingest phase. The whole logs may
>>> have
>>>>>>>>> the signal in it but less pronounced.
>>>>>>>>>
>>>>>>>>> My working assumption is now that it is random access to the node
>>>>>>>>> table. Your results point to it not being a CPU issue but that my
>>>>>>>>> setup is saturating the I/O path. While the portable has a NVMe
>>> SSD,
>>>>>>>>> it has probably not got the same I/O bandwidth as a server class
>>>>>>>>> machine.
>>>>>>>>>
>>>>>>>>> I'm not sure what to do about this other than run with a much
>>> bigger
>>>>>>>>> node table cache for the ingestion phase. Substituting some file
>>>>>>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>>>>>>> noticed before, it is probably visible in logs of smaller loads on
>>>>>>>>> closer inspection. Experimenting on a small dataset is a lot
>>> easier.
>>>>>>>> I'm more sure of this - not yet definite.
>>>>>>>>
>>>>>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>>>>>>> Seems rather small for the task.
>>>>>>>>
>>>>>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
>>> SPARQL
>>>>>>>> results.
>>>>>>>>
>>>>>>>> 2 pieces of data will help:
>>>>>>>>
>>>>>>>> Experimenting with very small cache settings.
>>>>>>>>
>>>>>>>> Letting my slow load keep going to see if there is the same
>>>>>>>> characteristics at the index stage. There shouldn't be if
>>> nodeToNodeId
>>>>>>>> is the cause; it's only an influence in the data ingestion step.
>>>>>>>>
>>>>>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
>>> and
>>>>>>>> maybe loader=phased. It falls into the same situation although the
>>>>>>>> improvement there is going to be less marked. "Parallel" saturates
>>> the
>>>>>>>> I/O by other means as well.
>>>>>>>>
>>>>>>>>        Andy
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>>
>>
>> ---
>> Marco Neumann
>> KONA
>>
>>
> 

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.

counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
entire load.

I think with the right hardware this could easily be time compressed quite
a bit.

http://www.lotico.com/temp/LOG-45497

Best,
Marco


On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <ma...@gmail.com>
wrote:

> Thank you Andy. found it in revisions somewhere
>
> just finished another run with truthy
>
> http://lotico.com/temp/LOG-1214
>
> will now increase RAM before running an additional load with increased
> thread count.
>
> Marco
>
> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:
>
>> gists are git repos: so the file is there ... somewhere:
>>
>>
>> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>>
>>      Andy
>>
>> On 19/12/2021 17:56, Marco Neumann wrote:
>> > Thank you Lorenz,
>> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
>> > github
>> >
>> >
>> > On Sun, Dec 19, 2021 at 9:46 AM LB <conpcomplete@googlemail.com
>> .invalid>
>> > wrote:
>> >
>> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
>> >> compute the stats.
>> >>
>> >> Findings:
>> >>
>> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
>> >> property as Wikidata does use this property as their own rdf:type
>> >> relation. It is indeed trivial, just execute
>> >>
>> >> select ?c (count(*) as ?cnt) {?s
>> >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>> >>
>> >> and convert it into the stats rule language (SSE) and put those rules
>> >> before the more generic rule
>> >>
>> >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>> >>
>> >> - I didn't want to touch the stats script itself, but we could for
>> >> example also make this type relation generic and allow for other like
>> >> wdt:P31 or skos:subject via a commandline option which provides any URI
>> >> as the type relation with default being rdf:type - but that's for sure
>> >> probably overkill
>> >>
>> >> - there is a bug in the stats script or file I guess, because of of
>> some
>> >> overflow? the count value is
>> >>
>> >> (count -1983667112))
>> >>
>> >> which indicates this.  I opened a ticket:
>> >> https://issues.apache.org/jira/browse/JENA-2225
>> >>
>> >>
>> >> [1]
>> >>
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> >>
>> >> On 18.12.21 11:35, Marco Neumann wrote:
>> >>> good morning Lorenz,
>> >>>
>> >>> Maybe time to get a few query bencharms tests? :)
>> >>>
>> >>> What does tdb2.tdbstats report?
>> >>>
>> >>> Marco
>> >>>
>> >>>
>> >>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
>> .invalid>
>> >>> wrote:
>> >>>
>> >>>> Good morning,
>> >>>>
>> >>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>> >>>> logs:
>> >>>>
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>> >>>>
>> >>>> I'm a bit surprised that this time it was 8h faster than last time,
>> 31h
>> >>>> vs 39h. Not sure if a) there was something else on the server last
>> time
>> >>>> (at least I couldn't see any running tasks) or b) if this is a
>> >>>> consequence of the more parallelized Unix sort now - I set it to
>> >>>> --parallel=16
>> >>>>
>> >>>> I mean, the piped input stream is single threaded I guess, but maybe
>> the
>> >>>> sort merge step can benefit from more threads? I guess I have to
>> clean
>> >>>> up everything and run it again with the original setup with 2 Unix
>> sort
>> >>>> threads ...
>> >>>>
>> >>>>
>> >>>> On 16.12.21 14:48, Andy Seaborne wrote:
>> >>>>>
>> >>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>> >>>>> ...
>> >>>>>
>> >>>>>> I am getting a slow down during data ingestion. However, your
>> summary
>> >>>>>> figures don't show that in the ingest phase. The whole logs may
>> have
>> >>>>>> the signal in it but less pronounced.
>> >>>>>>
>> >>>>>> My working assumption is now that it is random access to the node
>> >>>>>> table. Your results point to it not being a CPU issue but that my
>> >>>>>> setup is saturating the I/O path. While the portable has a NVMe
>> SSD,
>> >>>>>> it has probably not got the same I/O bandwidth as a server class
>> >>>>>> machine.
>> >>>>>>
>> >>>>>> I'm not sure what to do about this other than run with a much
>> bigger
>> >>>>>> node table cache for the ingestion phase. Substituting some file
>> >>>>>> mapper file area for bigger cache should be a win. While I hadn't
>> >>>>>> noticed before, it is probably visible in logs of smaller loads on
>> >>>>>> closer inspection. Experimenting on a small dataset is a lot
>> easier.
>> >>>>> I'm more sure of this - not yet definite.
>> >>>>>
>> >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>> >>>>> Seems rather small for the task.
>> >>>>>
>> >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
>> SPARQL
>> >>>>> results.
>> >>>>>
>> >>>>> 2 pieces of data will help:
>> >>>>>
>> >>>>> Experimenting with very small cache settings.
>> >>>>>
>> >>>>> Letting my slow load keep going to see if there is the same
>> >>>>> characteristics at the index stage. There shouldn't be if
>> nodeToNodeId
>> >>>>> is the cause; it's only an influence in the data ingestion step.
>> >>>>>
>> >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
>> and
>> >>>>> maybe loader=phased. It falls into the same situation although the
>> >>>>> improvement there is going to be less marked. "Parallel" saturates
>> the
>> >>>>> I/O by other means as well.
>> >>>>>
>> >>>>>       Andy
>> >>>
>> >>
>> >
>> >
>>
>
>
> --
>
>
> ---
> Marco Neumann
> KONA
>
>

-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
Thank you Andy. found it in revisions somewhere

just finished another run with truthy

http://lotico.com/temp/LOG-1214

will now increase RAM before running an additional load with increased
thread count.

Marco

On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <an...@apache.org> wrote:

> gists are git repos: so the file is there ... somewhere:
>
>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>
>      Andy
>
> On 19/12/2021 17:56, Marco Neumann wrote:
> > Thank you Lorenz,
> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> > github
> >
> >
> > On Sun, Dec 19, 2021 at 9:46 AM LB <co...@googlemail.com.invalid>
> > wrote:
> >
> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
> >> compute the stats.
> >>
> >> Findings:
> >>
> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
> >> property as Wikidata does use this property as their own rdf:type
> >> relation. It is indeed trivial, just execute
> >>
> >> select ?c (count(*) as ?cnt) {?s
> >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
> >>
> >> and convert it into the stats rule language (SSE) and put those rules
> >> before the more generic rule
> >>
> >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
> >>
> >> - I didn't want to touch the stats script itself, but we could for
> >> example also make this type relation generic and allow for other like
> >> wdt:P31 or skos:subject via a commandline option which provides any URI
> >> as the type relation with default being rdf:type - but that's for sure
> >> probably overkill
> >>
> >> - there is a bug in the stats script or file I guess, because of of some
> >> overflow? the count value is
> >>
> >> (count -1983667112))
> >>
> >> which indicates this.  I opened a ticket:
> >> https://issues.apache.org/jira/browse/JENA-2225
> >>
> >>
> >> [1]
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> On 18.12.21 11:35, Marco Neumann wrote:
> >>> good morning Lorenz,
> >>>
> >>> Maybe time to get a few query bencharms tests? :)
> >>>
> >>> What does tdb2.tdbstats report?
> >>>
> >>> Marco
> >>>
> >>>
> >>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcomplete@googlemail.com
> .invalid>
> >>> wrote:
> >>>
> >>>> Good morning,
> >>>>
> >>>> loading of Wikidata truthy is done, this time I didn't forget to keep
> >>>> logs:
> >>>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>
> >>>> I'm a bit surprised that this time it was 8h faster than last time,
> 31h
> >>>> vs 39h. Not sure if a) there was something else on the server last
> time
> >>>> (at least I couldn't see any running tasks) or b) if this is a
> >>>> consequence of the more parallelized Unix sort now - I set it to
> >>>> --parallel=16
> >>>>
> >>>> I mean, the piped input stream is single threaded I guess, but maybe
> the
> >>>> sort merge step can benefit from more threads? I guess I have to clean
> >>>> up everything and run it again with the original setup with 2 Unix
> sort
> >>>> threads ...
> >>>>
> >>>>
> >>>> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>>>
> >>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>>>> ...
> >>>>>
> >>>>>> I am getting a slow down during data ingestion. However, your
> summary
> >>>>>> figures don't show that in the ingest phase. The whole logs may have
> >>>>>> the signal in it but less pronounced.
> >>>>>>
> >>>>>> My working assumption is now that it is random access to the node
> >>>>>> table. Your results point to it not being a CPU issue but that my
> >>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >>>>>> it has probably not got the same I/O bandwidth as a server class
> >>>>>> machine.
> >>>>>>
> >>>>>> I'm not sure what to do about this other than run with a much bigger
> >>>>>> node table cache for the ingestion phase. Substituting some file
> >>>>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>>>> noticed before, it is probably visible in logs of smaller loads on
> >>>>>> closer inspection. Experimenting on a small dataset is a lot easier.
> >>>>> I'm more sure of this - not yet definite.
> >>>>>
> >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>>>> Seems rather small for the task.
> >>>>>
> >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
> SPARQL
> >>>>> results.
> >>>>>
> >>>>> 2 pieces of data will help:
> >>>>>
> >>>>> Experimenting with very small cache settings.
> >>>>>
> >>>>> Letting my slow load keep going to see if there is the same
> >>>>> characteristics at the index stage. There shouldn't be if
> nodeToNodeId
> >>>>> is the cause; it's only an influence in the data ingestion step.
> >>>>>
> >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
> and
> >>>>> maybe loader=phased. It falls into the same situation although the
> >>>>> improvement there is going to be less marked. "Parallel" saturates
> the
> >>>>> I/O by other means as well.
> >>>>>
> >>>>>       Andy
> >>>
> >>
> >
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
gists are git repos: so the file is there ... somewhere:

https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log

     Andy

On 19/12/2021 17:56, Marco Neumann wrote:
> Thank you Lorenz,
> unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> github
> 
> 
> On Sun, Dec 19, 2021 at 9:46 AM LB <co...@googlemail.com.invalid>
> wrote:
> 
>> I edited the Gist [1] and put the default stats there. Takes ~4min to
>> compute the stats.
>>
>> Findings:
>>
>> - for Wikidata we have to extend those stats with the stats for wdt:P31
>> property as Wikidata does use this property as their own rdf:type
>> relation. It is indeed trivial, just execute
>>
>> select ?c (count(*) as ?cnt) {?s
>> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>>
>> and convert it into the stats rule language (SSE) and put those rules
>> before the more generic rule
>>
>> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>>
>> - I didn't want to touch the stats script itself, but we could for
>> example also make this type relation generic and allow for other like
>> wdt:P31 or skos:subject via a commandline option which provides any URI
>> as the type relation with default being rdf:type - but that's for sure
>> probably overkill
>>
>> - there is a bug in the stats script or file I guess, because of of some
>> overflow? the count value is
>>
>> (count -1983667112))
>>
>> which indicates this.  I opened a ticket:
>> https://issues.apache.org/jira/browse/JENA-2225
>>
>>
>> [1]
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>
>> On 18.12.21 11:35, Marco Neumann wrote:
>>> good morning Lorenz,
>>>
>>> Maybe time to get a few query bencharms tests? :)
>>>
>>> What does tdb2.tdbstats report?
>>>
>>> Marco
>>>
>>>
>>> On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
>>> wrote:
>>>
>>>> Good morning,
>>>>
>>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>>>> logs:
>>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>>
>>>> I'm a bit surprised that this time it was 8h faster than last time, 31h
>>>> vs 39h. Not sure if a) there was something else on the server last time
>>>> (at least I couldn't see any running tasks) or b) if this is a
>>>> consequence of the more parallelized Unix sort now - I set it to
>>>> --parallel=16
>>>>
>>>> I mean, the piped input stream is single threaded I guess, but maybe the
>>>> sort merge step can benefit from more threads? I guess I have to clean
>>>> up everything and run it again with the original setup with 2 Unix sort
>>>> threads ...
>>>>
>>>>
>>>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>>>
>>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>>>> ...
>>>>>
>>>>>> I am getting a slow down during data ingestion. However, your summary
>>>>>> figures don't show that in the ingest phase. The whole logs may have
>>>>>> the signal in it but less pronounced.
>>>>>>
>>>>>> My working assumption is now that it is random access to the node
>>>>>> table. Your results point to it not being a CPU issue but that my
>>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>>>>>> it has probably not got the same I/O bandwidth as a server class
>>>>>> machine.
>>>>>>
>>>>>> I'm not sure what to do about this other than run with a much bigger
>>>>>> node table cache for the ingestion phase. Substituting some file
>>>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>>>> noticed before, it is probably visible in logs of smaller loads on
>>>>>> closer inspection. Experimenting on a small dataset is a lot easier.
>>>>> I'm more sure of this - not yet definite.
>>>>>
>>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>>>> Seems rather small for the task.
>>>>>
>>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
>>>>> results.
>>>>>
>>>>> 2 pieces of data will help:
>>>>>
>>>>> Experimenting with very small cache settings.
>>>>>
>>>>> Letting my slow load keep going to see if there is the same
>>>>> characteristics at the index stage. There shouldn't be if nodeToNodeId
>>>>> is the cause; it's only an influence in the data ingestion step.
>>>>>
>>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
>>>>> maybe loader=phased. It falls into the same situation although the
>>>>> improvement there is going to be less marked. "Parallel" saturates the
>>>>> I/O by other means as well.
>>>>>
>>>>>       Andy
>>>
>>
> 
> 

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github


On Sun, Dec 19, 2021 at 9:46 AM LB <co...@googlemail.com.invalid>
wrote:

> I edited the Gist [1] and put the default stats there. Takes ~4min to
> compute the stats.
>
> Findings:
>
> - for Wikidata we have to extend those stats with the stats for wdt:P31
> property as Wikidata does use this property as their own rdf:type
> relation. It is indeed trivial, just execute
>
> select ?c (count(*) as ?cnt) {?s
> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>
> and convert it into the stats rule language (SSE) and put those rules
> before the more generic rule
>
> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>
> - I didn't want to touch the stats script itself, but we could for
> example also make this type relation generic and allow for other like
> wdt:P31 or skos:subject via a commandline option which provides any URI
> as the type relation with default being rdf:type - but that's for sure
> probably overkill
>
> - there is a bug in the stats script or file I guess, because of of some
> overflow? the count value is
>
> (count -1983667112))
>
> which indicates this.  I opened a ticket:
> https://issues.apache.org/jira/browse/JENA-2225
>
>
> [1]
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> On 18.12.21 11:35, Marco Neumann wrote:
> > good morning Lorenz,
> >
> > Maybe time to get a few query bencharms tests? :)
> >
> > What does tdb2.tdbstats report?
> >
> > Marco
> >
> >
> > On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
> > wrote:
> >
> >> Good morning,
> >>
> >> loading of Wikidata truthy is done, this time I didn't forget to keep
> >> logs:
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> I'm a bit surprised that this time it was 8h faster than last time, 31h
> >> vs 39h. Not sure if a) there was something else on the server last time
> >> (at least I couldn't see any running tasks) or b) if this is a
> >> consequence of the more parallelized Unix sort now - I set it to
> >> --parallel=16
> >>
> >> I mean, the piped input stream is single threaded I guess, but maybe the
> >> sort merge step can benefit from more threads? I guess I have to clean
> >> up everything and run it again with the original setup with 2 Unix sort
> >> threads ...
> >>
> >>
> >> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>
> >>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>> ...
> >>>
> >>>> I am getting a slow down during data ingestion. However, your summary
> >>>> figures don't show that in the ingest phase. The whole logs may have
> >>>> the signal in it but less pronounced.
> >>>>
> >>>> My working assumption is now that it is random access to the node
> >>>> table. Your results point to it not being a CPU issue but that my
> >>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >>>> it has probably not got the same I/O bandwidth as a server class
> >>>> machine.
> >>>>
> >>>> I'm not sure what to do about this other than run with a much bigger
> >>>> node table cache for the ingestion phase. Substituting some file
> >>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>> noticed before, it is probably visible in logs of smaller loads on
> >>>> closer inspection. Experimenting on a small dataset is a lot easier.
> >>> I'm more sure of this - not yet definite.
> >>>
> >>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>> Seems rather small for the task.
> >>>
> >>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> >>> results.
> >>>
> >>> 2 pieces of data will help:
> >>>
> >>> Experimenting with very small cache settings.
> >>>
> >>> Letting my slow load keep going to see if there is the same
> >>> characteristics at the index stage. There shouldn't be if nodeToNodeId
> >>> is the cause; it's only an influence in the data ingestion step.
> >>>
> >>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> >>> maybe loader=phased. It falls into the same situation although the
> >>> improvement there is going to be less marked. "Parallel" saturates the
> >>> I/O by other means as well.
> >>>
> >>>      Andy
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
On 19/12/2021 09:46, LB wrote:
> I edited the Gist [1] and put the default stats there. Takes ~4min to 
> compute the stats.
> 
> Findings:
> 
> - for Wikidata we have to extend those stats with the stats for wdt:P31 
> property as Wikidata does use this property as their own rdf:type 
> relation. It is indeed trivial, just execute
> 
> select ?c (count(*) as ?cnt) {?s 
> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c

Hmm, that's going to have an impact on the default fixed reorder 
algorithm which has "rdf:type" built-in.

Probably another thread here - there are places where the design of 
ARQ/TDB could be improved for multi-billion triple datasets.

Several high volume triple stores are read-only with 6 indexes for the 
graph. That give two things: no need to support transactions, and any 
join can be a merge join or parallel merge join.

Having cursors into the B+Trees (iterators that can jump over sections 
of the scan to support SIP (sideways information passing)) is another area.

There's some mileage yet in single machine stores, mostly because 
hardware keeps getting better, at a cost, and persistent RAM will change 
things ("in 5 yars time").

But native cluster engines already have some advantages.

> and convert it into the stats rule language (SSE) and put those rules 
> before the more generic rule
> 
> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
> 
> - I didn't want to touch the stats script itself, but we could for 
> example also make this type relation generic and allow for other like 
> wdt:P31 or skos:subject via a commandline option which provides any URI 
> as the type relation with default being rdf:type - but that's for sure 
> probably overkill
> 
> - there is a bug in the stats script or file I guess, because of of some 
> overflow? the count value is
> 
> (count -1983667112))
> 
> which indicates this.  I opened a ticket: 
> https://issues.apache.org/jira/browse/JENA-2225

Thanks.

> 
> 
> [1] https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> 
> On 18.12.21 11:35, Marco Neumann wrote:
>> good morning Lorenz,
>>
>> Maybe time to get a few query bencharms tests? :)
>>
>> What does tdb2.tdbstats report?
>>
>> Marco
>>
>>
>> On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
>> wrote:
>>
>>> Good morning,
>>>
>>> loading of Wikidata truthy is done, this time I didn't forget to keep
>>> logs:
>>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>>
>>> I'm a bit surprised that this time it was 8h faster than last time, 31h
>>> vs 39h. Not sure if a) there was something else on the server last time
>>> (at least I couldn't see any running tasks) or b) if this is a
>>> consequence of the more parallelized Unix sort now - I set it to
>>> --parallel=16
>>>
>>> I mean, the piped input stream is single threaded I guess, but maybe the
>>> sort merge step can benefit from more threads? I guess I have to clean
>>> up everything and run it again with the original setup with 2 Unix sort
>>> threads ...
>>>
>>>
>>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>>
>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>>> ...
>>>>
>>>>> I am getting a slow down during data ingestion. However, your summary
>>>>> figures don't show that in the ingest phase. The whole logs may have
>>>>> the signal in it but less pronounced.
>>>>>
>>>>> My working assumption is now that it is random access to the node
>>>>> table. Your results point to it not being a CPU issue but that my
>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>>>>> it has probably not got the same I/O bandwidth as a server class
>>>>> machine.
>>>>>
>>>>> I'm not sure what to do about this other than run with a much bigger
>>>>> node table cache for the ingestion phase. Substituting some file
>>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>>> noticed before, it is probably visible in logs of smaller loads on
>>>>> closer inspection. Experimenting on a small dataset is a lot easier.
>>>> I'm more sure of this - not yet definite.
>>>>
>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>>> Seems rather small for the task.
>>>>
>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
>>>> results.
>>>>
>>>> 2 pieces of data will help:
>>>>
>>>> Experimenting with very small cache settings.
>>>>
>>>> Letting my slow load keep going to see if there is the same
>>>> characteristics at the index stage. There shouldn't be if nodeToNodeId
>>>> is the cause; it's only an influence in the data ingestion step.
>>>>
>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
>>>> maybe loader=phased. It falls into the same situation although the
>>>> improvement there is going to be less marked. "Parallel" saturates the
>>>> I/O by other means as well.
>>>>
>>>>      Andy
>>
> 

Re: Testing tdb2.xloader

Posted by LB <co...@googlemail.com.INVALID>.
I edited the Gist [1] and put the default stats there. Takes ~4min to 
compute the stats.

Findings:

- for Wikidata we have to extend those stats with the stats for wdt:P31 
property as Wikidata does use this property as their own rdf:type 
relation. It is indeed trivial, just execute

select ?c (count(*) as ?cnt) {?s 
<http://www.wikidata.org/prop/direct/P31> ?c} group by ?c

and convert it into the stats rule language (SSE) and put those rules 
before the more generic rule

|(<http://www.wikidata.org/prop/direct/P31> 98152611)|

- I didn't want to touch the stats script itself, but we could for 
example also make this type relation generic and allow for other like 
wdt:P31 or skos:subject via a commandline option which provides any URI 
as the type relation with default being rdf:type - but that's for sure 
probably overkill

- there is a bug in the stats script or file I guess, because of of some 
overflow? the count value is

(count -1983667112))

which indicates this.  I opened a ticket: 
https://issues.apache.org/jira/browse/JENA-2225


[1] https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

On 18.12.21 11:35, Marco Neumann wrote:
> good morning Lorenz,
>
> Maybe time to get a few query bencharms tests? :)
>
> What does tdb2.tdbstats report?
>
> Marco
>
>
> On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
> wrote:
>
>> Good morning,
>>
>> loading of Wikidata truthy is done, this time I didn't forget to keep
>> logs:
>> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>>
>> I'm a bit surprised that this time it was 8h faster than last time, 31h
>> vs 39h. Not sure if a) there was something else on the server last time
>> (at least I couldn't see any running tasks) or b) if this is a
>> consequence of the more parallelized Unix sort now - I set it to
>> --parallel=16
>>
>> I mean, the piped input stream is single threaded I guess, but maybe the
>> sort merge step can benefit from more threads? I guess I have to clean
>> up everything and run it again with the original setup with 2 Unix sort
>> threads ...
>>
>>
>> On 16.12.21 14:48, Andy Seaborne wrote:
>>>
>>> On 16/12/2021 10:52, Andy Seaborne wrote:
>>> ...
>>>
>>>> I am getting a slow down during data ingestion. However, your summary
>>>> figures don't show that in the ingest phase. The whole logs may have
>>>> the signal in it but less pronounced.
>>>>
>>>> My working assumption is now that it is random access to the node
>>>> table. Your results point to it not being a CPU issue but that my
>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
>>>> it has probably not got the same I/O bandwidth as a server class
>>>> machine.
>>>>
>>>> I'm not sure what to do about this other than run with a much bigger
>>>> node table cache for the ingestion phase. Substituting some file
>>>> mapper file area for bigger cache should be a win. While I hadn't
>>>> noticed before, it is probably visible in logs of smaller loads on
>>>> closer inspection. Experimenting on a small dataset is a lot easier.
>>> I'm more sure of this - not yet definite.
>>>
>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
>>> Seems rather small for the task.
>>>
>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
>>> results.
>>>
>>> 2 pieces of data will help:
>>>
>>> Experimenting with very small cache settings.
>>>
>>> Letting my slow load keep going to see if there is the same
>>> characteristics at the index stage. There shouldn't be if nodeToNodeId
>>> is the cause; it's only an influence in the data ingestion step.
>>>
>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
>>> maybe loader=phased. It falls into the same situation although the
>>> improvement there is going to be less marked. "Parallel" saturates the
>>> I/O by other means as well.
>>>
>>>      Andy
>

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB <co...@googlemail.com.invalid>
wrote:

> Good morning,
>
> loading of Wikidata truthy is done, this time I didn't forget to keep
> logs:
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> I'm a bit surprised that this time it was 8h faster than last time, 31h
> vs 39h. Not sure if a) there was something else on the server last time
> (at least I couldn't see any running tasks) or b) if this is a
> consequence of the more parallelized Unix sort now - I set it to
> --parallel=16
>
> I mean, the piped input stream is single threaded I guess, but maybe the
> sort merge step can benefit from more threads? I guess I have to clean
> up everything and run it again with the original setup with 2 Unix sort
> threads ...
>
>
> On 16.12.21 14:48, Andy Seaborne wrote:
> >
> >
> > On 16/12/2021 10:52, Andy Seaborne wrote:
> > ...
> >
> >> I am getting a slow down during data ingestion. However, your summary
> >> figures don't show that in the ingest phase. The whole logs may have
> >> the signal in it but less pronounced.
> >>
> >> My working assumption is now that it is random access to the node
> >> table. Your results point to it not being a CPU issue but that my
> >> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >> it has probably not got the same I/O bandwidth as a server class
> >> machine.
> >>
> >> I'm not sure what to do about this other than run with a much bigger
> >> node table cache for the ingestion phase. Substituting some file
> >> mapper file area for bigger cache should be a win. While I hadn't
> >> noticed before, it is probably visible in logs of smaller loads on
> >> closer inspection. Experimenting on a small dataset is a lot easier.
> >
> > I'm more sure of this - not yet definite.
> >
> > The nodeToNodeId cache is 200k -- this is on the load/update path.
> > Seems rather small for the task.
> >
> > The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> > results.
> >
> > 2 pieces of data will help:
> >
> > Experimenting with very small cache settings.
> >
> > Letting my slow load keep going to see if there is the same
> > characteristics at the index stage. There shouldn't be if nodeToNodeId
> > is the cause; it's only an influence in the data ingestion step.
> >
> > Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> > maybe loader=phased. It falls into the same situation although the
> > improvement there is going to be less marked. "Parallel" saturates the
> > I/O by other means as well.
> >
> >     Andy
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
https://gist.github.com/afs/c97ebc7351478bce2989b79c9195ef11

Dell XPS13 (2021 edition)
32G RAM
4 core
1T SSD disk
Jena 4.3.1

Data:
wikidata-20211208-truthy-BETA.nt.gz

14:47:09 INFO  Load node table  = 39976 seconds
14:47:09 INFO  Load ingest data = 100007 seconds
14:47:09 INFO  Build index SPO  = 32927 seconds
14:47:09 INFO  Build index POS  = 62628 seconds
14:47:09 INFO  Build index OSP  = 43457 seconds
14:47:09 INFO  Overall          278995 seconds
14:47:09 INFO  Overall          77h 29m 55s
14:47:09 INFO  Triples loaded   = 6,610,055,778
14:47:09 INFO  Quads loaded     = 0
14:47:09 INFO  Overall Rate     23,692 tuples per second

Of note:

   Ingest time
   POS time

Haven't had the chance to look in detail yet.

     Andy

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
Hi Lorenz,

On 18/12/2021 08:09, LB wrote:
> Good morning,
> 
> loading of Wikidata truthy is done, this time I didn't forget to keep 
> logs: 
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> 
> I'm a bit surprised that this time it was 8h faster than last time, 31h 
> vs 39h.

Great!

> Not sure if a) there was something else on the server last time 
> (at least I couldn't see any running tasks) or b) if this is a 
> consequence of the more parallelized Unix sort now - I set it to 
> --parallel=16
> 
> I mean, the piped input stream is single threaded I guess, but maybe the 
> sort merge step can benefit from more threads?

yes - the sorting itself can be more parallel on a machine the size of 
you have.

Time to add a configuration file, rather than a slew of command line 
arguments. The file also then acts as a record of the setup.


I'm finding a new characteristic:

Loading on a smaller machine (32G RAM), I think the index sorting is 
recombining temp files. That results in more I/O and higher peak disk 
usage. While POS is always slower, it appears to be very much slower 
than SPO.

The internet has not been very clear on the effect of "batch size" but 
the GNU man page talks about "--batch-size=16".  I get more than 16 temp 
files - you probably don't at this scale.

--batch-size=128 seems better -- unlikely to be a problem with the 
number of file descriptors nowadays. 16 is probably just how it always was.

On my machine: per process:

ulimit -Sn is 1024     -- ulimit current setting
ulimit -Hn is 1048576  -- ulimit max without being root.

I'll investigate when the load finishes. I'm trying not to touch the 
machine to avoid breaking something. It is currently doing OSP.

> I guess I have to clean 
> up everything and run it again with the original setup with 2 Unix sort 
> threads ...

     Andy


Re: Testing tdb2.xloader

Posted by LB <co...@googlemail.com.INVALID>.
Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep 
logs: 
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

I'm a bit surprised that this time it was 8h faster than last time, 31h 
vs 39h. Not sure if a) there was something else on the server last time 
(at least I couldn't see any running tasks) or b) if this is a 
consequence of the more parallelized Unix sort now - I set it to 
--parallel=16

I mean, the piped input stream is single threaded I guess, but maybe the 
sort merge step can benefit from more threads? I guess I have to clean 
up everything and run it again with the original setup with 2 Unix sort 
threads ...


On 16.12.21 14:48, Andy Seaborne wrote:
>
>
> On 16/12/2021 10:52, Andy Seaborne wrote:
> ...
>
>> I am getting a slow down during data ingestion. However, your summary 
>> figures don't show that in the ingest phase. The whole logs may have 
>> the signal in it but less pronounced.
>>
>> My working assumption is now that it is random access to the node 
>> table. Your results point to it not being a CPU issue but that my 
>> setup is saturating the I/O path. While the portable has a NVMe SSD, 
>> it has probably not got the same I/O bandwidth as a server class 
>> machine.
>>
>> I'm not sure what to do about this other than run with a much bigger 
>> node table cache for the ingestion phase. Substituting some file 
>> mapper file area for bigger cache should be a win. While I hadn't 
>> noticed before, it is probably visible in logs of smaller loads on 
>> closer inspection. Experimenting on a small dataset is a lot easier.
>
> I'm more sure of this - not yet definite.
>
> The nodeToNodeId cache is 200k -- this is on the load/update path. 
> Seems rather small for the task.
>
> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL 
> results.
>
> 2 pieces of data will help:
>
> Experimenting with very small cache settings.
>
> Letting my slow load keep going to see if there is the same 
> characteristics at the index stage. There shouldn't be if nodeToNodeId 
> is the cause; it's only an influence in the data ingestion step.
>
> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and 
> maybe loader=phased. It falls into the same situation although the 
> improvement there is going to be less marked. "Parallel" saturates the 
> I/O by other means as well.
>
>     Andy

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.

On 16/12/2021 10:52, Andy Seaborne wrote:
...

> I am getting a slow down during data ingestion. However, your summary 
> figures don't show that in the ingest phase. The whole logs may have the 
> signal in it but less pronounced.
> 
> My working assumption is now that it is random access to the node table. 
> Your results point to it not being a CPU issue but that my setup is 
> saturating the I/O path. While the portable has a NVMe SSD, it has 
> probably not got the same I/O bandwidth as a server class machine.
> 
> I'm not sure what to do about this other than run with a much bigger 
> node table cache for the ingestion phase. Substituting some file mapper 
> file area for bigger cache should be a win. While I hadn't noticed 
> before, it is probably visible in logs of smaller loads on closer 
> inspection. Experimenting on a small dataset is a lot easier.

I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path. Seems 
rather small for the task.

The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL 
results.

2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same 
characteristics at the index stage. There shouldn't be if nodeToNodeId 
is the cause; it's only an influence in the data ingestion step.

Aside : Increasing nodeToNodeId could also help tdbloader=parallel and 
maybe loader=phased. It falls into the same situation although the 
improvement there is going to be less marked. "Parallel" saturates the 
I/O by other means as well.

     Andy

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
On 16/12/2021 12:32, LB wrote:
> I couldn't get access to the full log as the output was too verbose for 
> the screen and I forgot to pipe into a file ...

Yes - familiar ...

Maybe xloader should capture it's logging.

> 
> I can confirm the triples.tmp.gz size was something around 35-40G if I 
> remember correctly.

It leaves the file behind.

> 
> I rerun the load now to a) keep logs and b) see if increasing the number 
> of threads for parallel sort does change anything (though I don't think 
> so).

a useful data point.

> 
> In ~40h I will provide an update and share the logs.
> 
> 
> Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDB 
> would be on disk in that case. We have ~4 x the size of the truthy dump, 
> both compressed, I'd guess we get something between 1.8T and 2.3T - 
> clearly, the number of nodes shouldn't increase linearly as many 
> statements are still about nodes from truthy dump. On the other hand, 
> Wikidata has this statement about statement stuff, i.e. we have lots of 
> statement identifiers.
> 
> Once I've time and enough resources on the server, I'll give it a try. 
> Now I'm interested to see the final result

The indexes and the triples.tmp scale with number of triples.

I haven't parsed "all" to make sure it's parseable for a load. Given the 
WD data model, most of the dubious stuff is in truthy but there is 
always a chance of stray rubbish creeping in.

> 
> 
> On 16.12.21 11:52, Andy Seaborne wrote:
>> Awesome!
>> I'm really pleased to hear the news.
>>
>> That's better than I feared at this scale!
>>
>> How big is triples.tmp.gz? 2* that size, and the database size is the 
>> peak storage space used. My estimate is about 40G making 604G overall.
>>
>> I'd appreciate having the whole log file. Could you email it to me?
>>
>> Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a 
>> modern portable with 4 cores and a single notional 1TB SSD. If the 
>> estimate is right, it will fit. More good news.
>>
>>
>> I am getting a slow down during data ingestion. However, your summary 
>> figures don't show that in the ingest phase. The whole logs may have 
>> the signal in it but less pronounced.
>>
>> My working assumption is now that it is random access to the node 
>> table. Your results point to it not being a CPU issue but that my 
>> setup is saturating the I/O path. While the portable has a NVMe SSD, 
>> it has probably not got the same I/O bandwidth as a server class machine.
>>
>> I'm not sure what to do about this other than run with a much bigger 
>> node table cache for the ingestion phase. Substituting some file 
>> mapper file area for bigger cache should be a win. While I hadn't 
>> noticed before, it is probably visible in logs of smaller loads on 
>> closer inspection. Experimenting on a small dataset is a lot easier.
>>
>>
>> I'm also watching the CPU temperature. When the graphics aren't 
>> active, the fans aren't even on. After a few minutes of active screen 
>> the fans spin up but the temperatures are still well within the 
>> limits. The machine is raised up by 1cm to give good airflow. And I 
>> keep the door shut to keep the cats away.
>>
>>     Andy
>>
>> Inline ...
>>
>> On 16/12/2021 08:49, LB wrote:
>>> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>>>
>>> Server:
>>>
>>> AMD Ryzen 9 5950X  (16C/32T)
>>> 128 GB DDR4 ECC RAM
>>> 2 x 3.84 TB NVMe SSD
>>
>> Nice.
>>
>>> Environment:
>>>
>>> - Ubuntu 20.04.3 LTS
>>> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
>>> - Jena 4.3.1
>>>
>>>
>>> Command:
>>>
>>>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb 
>>>> --loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>>
>> I found .gz to be slightly faster than .bz2. This maybe because .gz is 
>> better supported by the Java runtime or just the fact bz2 is designed 
>> for best compression.
>>
>>>
>>>
>>> Log summary:
>>>
>>>> 04:14:28 INFO  Load node table  = 36600 seconds
>>>> 04:14:28 INFO  Load ingest data = 25811 seconds
>>>> 04:14:28 INFO  Build index SPO  = 20688 seconds
>>>> 04:14:28 INFO  Build index POS  = 35466 seconds
>>>> 04:14:28 INFO  Build index OSP  = 25042 seconds
>>>> 04:14:28 INFO  Overall          143607 seconds
>>>> 04:14:28 INFO  Overall          39h 53m 27s
>>
>> Less than 2 days :-)
>>
>>>> 04:14:28 INFO  Triples loaded   = 6.610.055.778
>>>> 04:14:28 INFO  Quads loaded     = 0
>>>> 04:14:28 INFO  Overall Rate     46.028 tuples per second
>>>
>>>
>>> Disk space usage according to
>>>
>>>> du -sh datasets/wikidata-tdb
>>>
>>>   is
>>>
>>>> 524G    datasets/wikidata-tdb
>>
>>
>>>
>>> During loading I could see ~90GB of RAM occupied (50% of total memory 
>>> got to sort and it used 2 threads - is it intended to stick to 2 
>>> threads with --parallel 2?)
>>
>> It is fixed at two for the sort currently.
>>
>> There may be some benefit in making this configurable but previously 
>> I've found that more threads does not seem to yield much benefit 
>> though you have a lot more threads! Experiment required.
>>
>>>
>>>
>>> Cheers,
>>> Lorenz
>>>
>>>
>>> On 12.12.21 13:07, Andy Seaborne wrote:
>>>> Hi, Øyvind,
>>>>
>>>> This is all very helpful feedback. Thank you.
>>>>
>>>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>>>> I'm trying out tdb2.xloader on an openstack vm, loading the 
>>>>> wikidata truthy
>>>>> dump downloaded 2021-12-09.
>>>>
>>>> This is the 4.3.0 xloader?
>>>>
>>>> There are improvements in 4.3.1. Since that release was going out, 
>>>> including using less temporary space, the development version got 
>>>> merged in. It has had some testing.
>>>>
>>>> It compresses the triples.tmp and intermediate sort files in the 
>>>> index stage making the peak usage much smaller.
>>>>
>>>>> The instance is a vm created on the Norwegian Research and 
>>>>> Education Cloud,
>>>>> an openstack cloud provider.
>>>>>
>>>>> Instance type:
>>>>> 32 GB memory
>>>>> 4 CPU
>>>>
>>>> I using similar on a 7 year old desktop machine, SATA disk.
>>>>
>>>> I haven't got a machine I can dedicate to the multi-day load. I'll 
>>>> try to find a way to at least push it though building the node table.
>>>>
>>>> Loading the first 1B of truthy:
>>>>
>>>> 1B triples , 40kTPS , 06h 54m 10s
>>>>
>>>> The database is 81G and building needs an addition 11.6G for 
>>>> workspace for a total of 92G (+ the data file).
>>>>
>>>> While smaller, its seems bz2 files are much slower to decompress so 
>>>> I've been using gz files.
>>>>
>>>> My current best guess for 6.4B truthy is
>>>>
>>>> Temp        96G
>>>> Database   540G
>>>> Data        48G
>>>> Total:     684G  -- peak disk needed
>>>>
>>>> based on scaling up 1B truthy. Personally, I would make sure there 
>>>> was more space. Also - I don't know if the shape of the data is 
>>>> sufficiently uniform to make scaling predictable.  The time doesn't 
>>>> scale so simply.
>>>>
>>>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>>>
>>>> Compression reduces the size of triples.tmp -- the related sort 
>>>> temporary files which add up to the same again -- 1/6 of the size.
>>>>
>>>>> The storage used for dump + temp files  is mounted as a separate 900GB
>>>>> volume and is mounted on /var/fuseki/databases
>>>>> .The type of storage is described as
>>>>>>   *mass-storage-default*: Storage backed by spinning hard drives,
>>>>> available to everybody and is the default type.
>>>>> with ext4 configured. At the moment I don't have access to the faster
>>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and 
>>>>> can be
>>>>> overcommitted.
>>>>
>>>> "overcommitted" may be a problem.
>>>>
>>>> While it's not "tdb2 loader parallel" it does use a continuous CPU 
>>>> in several threads.
>>>>
>>>> For memory - "it's complicated".
>>>>
>>>> The java parts only need say 2G. The sort is set to "buffer 50% 
>>>> --parallel=2" and the java pipes into sort, that's another thread. I 
>>>> think the effective peak is 3 active threads and they'll all be at 
>>>> 100% for some of the time.
>>>>
>>>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>>>
>>>> It does not need space for memory mapped files (they aren't used at 
>>>> all in the loading process and I/O is sequential.
>>>>
>>>> If that triggers over commitment swap out, the performance may go 
>>>> down a lot.
>>>>
>>>> For disk - if that is physically remote, it should not a problem 
>>>> (famous last words). I/O is sequential and in large continuous 
>>>> chunks - typical for batch processing jobs.
>>>>
>>>>> OS for the instance is a clean Rocky Linux image, with no services 
>>>>> except
>>>>> jena/fuseki installed. The systemd service 
>>>>
>>>>  set up for fuseki is stopped.
>>>>> jena and fuseki version is 4.3.0.
>>>>>
>>>>> openjdk 11.0.13 2021-10-19 LTS
>>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, 
>>>>> sharing)
>>>>
>>>> Just FYI: Java17 is a little faster. Some java improvements have 
>>>> improved RDF parsing speed by up to 10%. in xloader that not 
>>>> significant to the overall time.
>>>>
>>>>> I'm running from a tmux session to avoid connectivity issues and to 
>>>>> capture
>>>>> the output. 
>>>>
>>>> I use
>>>>
>>>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>>>
>>>> to capture the logs and see them. ">&" and "tail -f" would achieve 
>>>> much the same effect
>>>>
>>>>> I think the output is stored in memory and not on disk.
>>>>> On First run I tried to have the tmpdir on the root partition, to 
>>>>> separate
>>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon 
>>>>> was disk
>>>>> full. For the second (current run) all directories are under
>>>>> /var/fuseki/databases.
>>>>
>>>> Yes - after making that mistake myself, the new version ignores 
>>>> system TMPDIR.  Using --tmpdir is best but otherwise it defaults to 
>>>> the data directory.
>>>>
>>>>>
>>>>>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy 
>>>>> --tmpdir
>>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>>
>>>>> The import is so far at the "ingest data" stage where it has really 
>>>>> slowed
>>>>> down.
>>>>
>>>> FYI: The first line of ingest is always very slow. It is not 
>>>> measuring the start point correctly.
>>>>
>>>>>
>>>>> Current output is:
>>>>>
>>>>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 
>>>>> 3,356 /
>>>>> Avg: 7,593)
>>>>>
>>>>> See full log so far at
>>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab 
>>>>>
>>>>
>>>> The earlier first pass also slows down and that should be fairly 
>>>> constant-ish speed step once everything settles down.
>>>>
>>>>> Some notes:
>>>>>
>>>>> * There is a (time/info) lapse in the output log between the end of
>>>>> 'parse' and the start of 'index' for Terms.  It is unclear to me 
>>>>> what is
>>>>> happening in the 1h13 minutes between the lines.
>>>>
>>>> There is "sort" going on. "top" should show it.
>>>>
>>>> For each index there is also a very long pause for exactly the same 
>>>> reason.  It would be good to have some something go "tick" and log a 
>>>> message occasionally.
>>>>
>>>>>
>>>>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds 
>>>>> [2021/12/10
>>>>> 22:33:46 CET]
>>>>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 
>>>>> 237,755 /
>>>>> Avg: 237,755)
>>>>>
>>>>> * The ingest data step really slows down on the "ingest data 
>>>>> stage": At the
>>>>> current rate, if I calculated correctly, it looks like 
>>>>> PKG.CmdxIngestData
>>>>> has 10 days left before it finishes.
>>>>
>>>> Ouch.
>>>>
>>>>> * When I saw sort running in the background for the first parts of 
>>>>> the job,
>>>>> I looked at the `sort` command. I noticed from some online sources 
>>>>> that
>>>>> setting the environment variable LC_ALL=C improves speed for 
>>>>> `sort`. Could
>>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>>> break/change something? I see the warning from the man page for 
>>>>> `sort`.
>>>>>
>>>>>         *** WARNING *** The locale specified by the environment 
>>>>> affects
>>>>>         sort order.  Set LC_ALL=C to get the traditional sort order 
>>>>> that
>>>>>         uses native byte values.
>>>>
>>>> It shouldn't matter but, yes, better to set it and export it in the 
>>>> control script and propagate to forked processes.
>>>>
>>>> The sort is doing a binary sort except because it a text sort 
>>>> program, the binary is turned into hex (!!). hex is in the ASCII 
>>>> subset and shoule be locale safe.
>>>>
>>>> But better to set LC_ALL=C.
>>>>
>>>>     Andy
>>>>
>>>>
>>>>>
>>>>> Links:
>>>>> https://access.redhat.com/solutions/445233
>>>>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram 
>>>>>
>>>>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort 
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Øyvind
>>>>>

Re: Testing tdb2.xloader

Posted by LB <co...@googlemail.com.INVALID>.
I couldn't get access to the full log as the output was too verbose for 
the screen and I forgot to pipe into a file ...

I can confirm the triples.tmp.gz size was something around 35-40G if I 
remember correctly.

I rerun the load now to a) keep logs and b) see if increasing the number 
of threads for parallel sort does change anything (though I don't think so).

In ~40h I will provide an update and share the logs.


Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDB 
would be on disk in that case. We have ~4 x the size of the truthy dump, 
both compressed, I'd guess we get something between 1.8T and 2.3T - 
clearly, the number of nodes shouldn't increase linearly as many 
statements are still about nodes from truthy dump. On the other hand, 
Wikidata has this statement about statement stuff, i.e. we have lots of 
statement identifiers.

Once I've time and enough resources on the server, I'll give it a try. 
Now I'm interested to see the final result


On 16.12.21 11:52, Andy Seaborne wrote:
> Awesome!
> I'm really pleased to hear the news.
>
> That's better than I feared at this scale!
>
> How big is triples.tmp.gz? 2* that size, and the database size is the 
> peak storage space used. My estimate is about 40G making 604G overall.
>
> I'd appreciate having the whole log file. Could you email it to me?
>
> Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a 
> modern portable with 4 cores and a single notional 1TB SSD. If the 
> estimate is right, it will fit. More good news.
>
>
> I am getting a slow down during data ingestion. However, your summary 
> figures don't show that in the ingest phase. The whole logs may have 
> the signal in it but less pronounced.
>
> My working assumption is now that it is random access to the node 
> table. Your results point to it not being a CPU issue but that my 
> setup is saturating the I/O path. While the portable has a NVMe SSD, 
> it has probably not got the same I/O bandwidth as a server class machine.
>
> I'm not sure what to do about this other than run with a much bigger 
> node table cache for the ingestion phase. Substituting some file 
> mapper file area for bigger cache should be a win. While I hadn't 
> noticed before, it is probably visible in logs of smaller loads on 
> closer inspection. Experimenting on a small dataset is a lot easier.
>
>
> I'm also watching the CPU temperature. When the graphics aren't 
> active, the fans aren't even on. After a few minutes of active screen 
> the fans spin up but the temperatures are still well within the 
> limits. The machine is raised up by 1cm to give good airflow. And I 
> keep the door shut to keep the cats away.
>
>     Andy
>
> Inline ...
>
> On 16/12/2021 08:49, LB wrote:
>> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>>
>> Server:
>>
>> AMD Ryzen 9 5950X  (16C/32T)
>> 128 GB DDR4 ECC RAM
>> 2 x 3.84 TB NVMe SSD
>
> Nice.
>
>> Environment:
>>
>> - Ubuntu 20.04.3 LTS
>> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
>> - Jena 4.3.1
>>
>>
>> Command:
>>
>>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb 
>>> --loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>
> I found .gz to be slightly faster than .bz2. This maybe because .gz is 
> better supported by the Java runtime or just the fact bz2 is designed 
> for best compression.
>
>>
>>
>> Log summary:
>>
>>> 04:14:28 INFO  Load node table  = 36600 seconds
>>> 04:14:28 INFO  Load ingest data = 25811 seconds
>>> 04:14:28 INFO  Build index SPO  = 20688 seconds
>>> 04:14:28 INFO  Build index POS  = 35466 seconds
>>> 04:14:28 INFO  Build index OSP  = 25042 seconds
>>> 04:14:28 INFO  Overall          143607 seconds
>>> 04:14:28 INFO  Overall          39h 53m 27s
>
> Less than 2 days :-)
>
>>> 04:14:28 INFO  Triples loaded   = 6.610.055.778
>>> 04:14:28 INFO  Quads loaded     = 0
>>> 04:14:28 INFO  Overall Rate     46.028 tuples per second
>>
>>
>> Disk space usage according to
>>
>>> du -sh datasets/wikidata-tdb
>>
>>   is
>>
>>> 524G    datasets/wikidata-tdb
>
>
>>
>> During loading I could see ~90GB of RAM occupied (50% of total memory 
>> got to sort and it used 2 threads - is it intended to stick to 2 
>> threads with --parallel 2?)
>
> It is fixed at two for the sort currently.
>
> There may be some benefit in making this configurable but previously 
> I've found that more threads does not seem to yield much benefit 
> though you have a lot more threads! Experiment required.
>
>>
>>
>> Cheers,
>> Lorenz
>>
>>
>> On 12.12.21 13:07, Andy Seaborne wrote:
>>> Hi, Øyvind,
>>>
>>> This is all very helpful feedback. Thank you.
>>>
>>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>>> I'm trying out tdb2.xloader on an openstack vm, loading the 
>>>> wikidata truthy
>>>> dump downloaded 2021-12-09.
>>>
>>> This is the 4.3.0 xloader?
>>>
>>> There are improvements in 4.3.1. Since that release was going out, 
>>> including using less temporary space, the development version got 
>>> merged in. It has had some testing.
>>>
>>> It compresses the triples.tmp and intermediate sort files in the 
>>> index stage making the peak usage much smaller.
>>>
>>>> The instance is a vm created on the Norwegian Research and 
>>>> Education Cloud,
>>>> an openstack cloud provider.
>>>>
>>>> Instance type:
>>>> 32 GB memory
>>>> 4 CPU
>>>
>>> I using similar on a 7 year old desktop machine, SATA disk.
>>>
>>> I haven't got a machine I can dedicate to the multi-day load. I'll 
>>> try to find a way to at least push it though building the node table.
>>>
>>> Loading the first 1B of truthy:
>>>
>>> 1B triples , 40kTPS , 06h 54m 10s
>>>
>>> The database is 81G and building needs an addition 11.6G for 
>>> workspace for a total of 92G (+ the data file).
>>>
>>> While smaller, its seems bz2 files are much slower to decompress so 
>>> I've been using gz files.
>>>
>>> My current best guess for 6.4B truthy is
>>>
>>> Temp        96G
>>> Database   540G
>>> Data        48G
>>> Total:     684G  -- peak disk needed
>>>
>>> based on scaling up 1B truthy. Personally, I would make sure there 
>>> was more space. Also - I don't know if the shape of the data is 
>>> sufficiently uniform to make scaling predictable.  The time doesn't 
>>> scale so simply.
>>>
>>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>>
>>> Compression reduces the size of triples.tmp -- the related sort 
>>> temporary files which add up to the same again -- 1/6 of the size.
>>>
>>>> The storage used for dump + temp files  is mounted as a separate  
>>>> 900GB
>>>> volume and is mounted on /var/fuseki/databases
>>>> .The type of storage is described as
>>>>>   *mass-storage-default*: Storage backed by spinning hard drives,
>>>> available to everybody and is the default type.
>>>> with ext4 configured. At the moment I don't have access to the faster
>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and 
>>>> can be
>>>> overcommitted.
>>>
>>> "overcommitted" may be a problem.
>>>
>>> While it's not "tdb2 loader parallel" it does use a continuous CPU 
>>> in several threads.
>>>
>>> For memory - "it's complicated".
>>>
>>> The java parts only need say 2G. The sort is set to "buffer 50% 
>>> --parallel=2" and the java pipes into sort, that's another thread. I 
>>> think the effective peak is 3 active threads and they'll all be at 
>>> 100% for some of the time.
>>>
>>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>>
>>> It does not need space for memory mapped files (they aren't used at 
>>> all in the loading process and I/O is sequential.
>>>
>>> If that triggers over commitment swap out, the performance may go 
>>> down a lot.
>>>
>>> For disk - if that is physically remote, it should not a problem 
>>> (famous last words). I/O is sequential and in large continuous 
>>> chunks - typical for batch processing jobs.
>>>
>>>> OS for the instance is a clean Rocky Linux image, with no services 
>>>> except
>>>> jena/fuseki installed. The systemd service 
>>>
>>>  set up for fuseki is stopped.
>>>> jena and fuseki version is 4.3.0.
>>>>
>>>> openjdk 11.0.13 2021-10-19 LTS
>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, 
>>>> sharing)
>>>
>>> Just FYI: Java17 is a little faster. Some java improvements have 
>>> improved RDF parsing speed by up to 10%. in xloader that not 
>>> significant to the overall time.
>>>
>>>> I'm running from a tmux session to avoid connectivity issues and to 
>>>> capture
>>>> the output. 
>>>
>>> I use
>>>
>>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>>
>>> to capture the logs and see them. ">&" and "tail -f" would achieve 
>>> much the same effect
>>>
>>>> I think the output is stored in memory and not on disk.
>>>> On First run I tried to have the tmpdir on the root partition, to 
>>>> separate
>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon 
>>>> was disk
>>>> full. For the second (current run) all directories are under
>>>> /var/fuseki/databases.
>>>
>>> Yes - after making that mistake myself, the new version ignores 
>>> system TMPDIR.  Using --tmpdir is best but otherwise it defaults to 
>>> the data directory.
>>>
>>>>
>>>>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy 
>>>> --tmpdir
>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>
>>>> The import is so far at the "ingest data" stage where it has really 
>>>> slowed
>>>> down.
>>>
>>> FYI: The first line of ingest is always very slow. It is not 
>>> measuring the start point correctly.
>>>
>>>>
>>>> Current output is:
>>>>
>>>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 
>>>> 3,356 /
>>>> Avg: 7,593)
>>>>
>>>> See full log so far at
>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab 
>>>>
>>>
>>> The earlier first pass also slows down and that should be fairly 
>>> constant-ish speed step once everything settles down.
>>>
>>>> Some notes:
>>>>
>>>> * There is a (time/info) lapse in the output log between the end of
>>>> 'parse' and the start of 'index' for Terms.  It is unclear to me 
>>>> what is
>>>> happening in the 1h13 minutes between the lines.
>>>
>>> There is "sort" going on. "top" should show it.
>>>
>>> For each index there is also a very long pause for exactly the same 
>>> reason.  It would be good to have some something go "tick" and log a 
>>> message occasionally.
>>>
>>>>
>>>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds 
>>>> [2021/12/10
>>>> 22:33:46 CET]
>>>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 
>>>> 237,755 /
>>>> Avg: 237,755)
>>>>
>>>> * The ingest data step really slows down on the "ingest data 
>>>> stage": At the
>>>> current rate, if I calculated correctly, it looks like 
>>>> PKG.CmdxIngestData
>>>> has 10 days left before it finishes.
>>>
>>> Ouch.
>>>
>>>> * When I saw sort running in the background for the first parts of 
>>>> the job,
>>>> I looked at the `sort` command. I noticed from some online sources 
>>>> that
>>>> setting the environment variable LC_ALL=C improves speed for 
>>>> `sort`. Could
>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>> break/change something? I see the warning from the man page for 
>>>> `sort`.
>>>>
>>>>         *** WARNING *** The locale specified by the environment 
>>>> affects
>>>>         sort order.  Set LC_ALL=C to get the traditional sort order 
>>>> that
>>>>         uses native byte values.
>>>
>>> It shouldn't matter but, yes, better to set it and export it in the 
>>> control script and propagate to forked processes.
>>>
>>> The sort is doing a binary sort except because it a text sort 
>>> program, the binary is turned into hex (!!). hex is in the ASCII 
>>> subset and shoule be locale safe.
>>>
>>> But better to set LC_ALL=C.
>>>
>>>     Andy
>>>
>>>
>>>>
>>>> Links:
>>>> https://access.redhat.com/solutions/445233
>>>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram 
>>>>
>>>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort 
>>>>
>>>>
>>>> Best regards,
>>>> Øyvind
>>>>

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
Awesome!
I'm really pleased to hear the news.

That's better than I feared at this scale!

How big is triples.tmp.gz? 2* that size, and the database size is the 
peak storage space used. My estimate is about 40G making 604G overall.

I'd appreciate having the whole log file. Could you email it to me?

Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a 
modern portable with 4 cores and a single notional 1TB SSD. If the 
estimate is right, it will fit. More good news.


I am getting a slow down during data ingestion. However, your summary 
figures don't show that in the ingest phase. The whole logs may have the 
signal in it but less pronounced.

My working assumption is now that it is random access to the node table. 
Your results point to it not being a CPU issue but that my setup is 
saturating the I/O path. While the portable has a NVMe SSD, it has 
probably not got the same I/O bandwidth as a server class machine.

I'm not sure what to do about this other than run with a much bigger 
node table cache for the ingestion phase. Substituting some file mapper 
file area for bigger cache should be a win. While I hadn't noticed 
before, it is probably visible in logs of smaller loads on closer 
inspection. Experimenting on a small dataset is a lot easier.


I'm also watching the CPU temperature. When the graphics aren't active, 
the fans aren't even on. After a few minutes of active screen the fans 
spin up but the temperatures are still well within the limits. The 
machine is raised up by 1cm to give good airflow. And I keep the door 
shut to keep the cats away.

     Andy

Inline ...

On 16/12/2021 08:49, LB wrote:
> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
> 
> Server:
> 
> AMD Ryzen 9 5950X  (16C/32T)
> 128 GB DDR4 ECC RAM
> 2 x 3.84 TB NVMe SSD

Nice.

> Environment:
> 
> - Ubuntu 20.04.3 LTS
> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> - Jena 4.3.1
> 
> 
> Command:
> 
>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc 
>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2

I found .gz to be slightly faster than .bz2. This maybe because .gz is 
better supported by the Java runtime or just the fact bz2 is designed 
for best compression.

> 
> 
> Log summary:
> 
>> 04:14:28 INFO  Load node table  = 36600 seconds
>> 04:14:28 INFO  Load ingest data = 25811 seconds
>> 04:14:28 INFO  Build index SPO  = 20688 seconds
>> 04:14:28 INFO  Build index POS  = 35466 seconds
>> 04:14:28 INFO  Build index OSP  = 25042 seconds
>> 04:14:28 INFO  Overall          143607 seconds
>> 04:14:28 INFO  Overall          39h 53m 27s

Less than 2 days :-)

>> 04:14:28 INFO  Triples loaded   = 6.610.055.778
>> 04:14:28 INFO  Quads loaded     = 0
>> 04:14:28 INFO  Overall Rate     46.028 tuples per second
> 
> 
> Disk space usage according to
> 
>> du -sh datasets/wikidata-tdb
> 
>   is
> 
>> 524G    datasets/wikidata-tdb


> 
> During loading I could see ~90GB of RAM occupied (50% of total memory 
> got to sort and it used 2 threads - is it intended to stick to 2 threads 
> with --parallel 2?)

It is fixed at two for the sort currently.

There may be some benefit in making this configurable but previously 
I've found that more threads does not seem to yield much benefit though 
you have a lot more threads! Experiment required.

> 
> 
> Cheers,
> Lorenz
> 
> 
> On 12.12.21 13:07, Andy Seaborne wrote:
>> Hi, Øyvind,
>>
>> This is all very helpful feedback. Thank you.
>>
>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata 
>>> truthy
>>> dump downloaded 2021-12-09.
>>
>> This is the 4.3.0 xloader?
>>
>> There are improvements in 4.3.1. Since that release was going out, 
>> including using less temporary space, the development version got 
>> merged in. It has had some testing.
>>
>> It compresses the triples.tmp and intermediate sort files in the index 
>> stage making the peak usage much smaller.
>>
>>> The instance is a vm created on the Norwegian Research and Education 
>>> Cloud,
>>> an openstack cloud provider.
>>>
>>> Instance type:
>>> 32 GB memory
>>> 4 CPU
>>
>> I using similar on a 7 year old desktop machine, SATA disk.
>>
>> I haven't got a machine I can dedicate to the multi-day load. I'll try 
>> to find a way to at least push it though building the node table.
>>
>> Loading the first 1B of truthy:
>>
>> 1B triples , 40kTPS , 06h 54m 10s
>>
>> The database is 81G and building needs an addition 11.6G for workspace 
>> for a total of 92G (+ the data file).
>>
>> While smaller, its seems bz2 files are much slower to decompress so 
>> I've been using gz files.
>>
>> My current best guess for 6.4B truthy is
>>
>> Temp        96G
>> Database   540G
>> Data        48G
>> Total:     684G  -- peak disk needed
>>
>> based on scaling up 1B truthy. Personally, I would make sure there was 
>> more space. Also - I don't know if the shape of the data is 
>> sufficiently uniform to make scaling predictable.  The time doesn't 
>> scale so simply.
>>
>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>
>> Compression reduces the size of triples.tmp -- the related sort 
>> temporary files which add up to the same again -- 1/6 of the size.
>>
>>> The storage used for dump + temp files  is mounted as a separate  900GB
>>> volume and is mounted on /var/fuseki/databases
>>> .The type of storage is described as
>>>>   *mass-storage-default*: Storage backed by spinning hard drives,
>>> available to everybody and is the default type.
>>> with ext4 configured. At the moment I don't have access to the faster
>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and 
>>> can be
>>> overcommitted.
>>
>> "overcommitted" may be a problem.
>>
>> While it's not "tdb2 loader parallel" it does use a continuous CPU in 
>> several threads.
>>
>> For memory - "it's complicated".
>>
>> The java parts only need say 2G. The sort is set to "buffer 50% 
>> --parallel=2" and the java pipes into sort, that's another thread. I 
>> think the effective peak is 3 active threads and they'll all be at 
>> 100% for some of the time.
>>
>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>
>> It does not need space for memory mapped files (they aren't used at 
>> all in the loading process and I/O is sequential.
>>
>> If that triggers over commitment swap out, the performance may go down 
>> a lot.
>>
>> For disk - if that is physically remote, it should not a problem 
>> (famous last words). I/O is sequential and in large continuous chunks 
>> - typical for batch processing jobs.
>>
>>> OS for the instance is a clean Rocky Linux image, with no services 
>>> except
>>> jena/fuseki installed. The systemd service 
>>
>>  set up for fuseki is stopped.
>>> jena and fuseki version is 4.3.0.
>>>
>>> openjdk 11.0.13 2021-10-19 LTS
>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>
>> Just FYI: Java17 is a little faster. Some java improvements have 
>> improved RDF parsing speed by up to 10%. in xloader that not 
>> significant to the overall time.
>>
>>> I'm running from a tmux session to avoid connectivity issues and to 
>>> capture
>>> the output. 
>>
>> I use
>>
>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>
>> to capture the logs and see them. ">&" and "tail -f" would achieve 
>> much the same effect
>>
>>> I think the output is stored in memory and not on disk.
>>> On First run I tried to have the tmpdir on the root partition, to 
>>> separate
>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was 
>>> disk
>>> full. For the second (current run) all directories are under
>>> /var/fuseki/databases.
>>
>> Yes - after making that mistake myself, the new version ignores system 
>> TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data 
>> directory.
>>
>>>
>>>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy 
>>> --tmpdir
>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>
>>> The import is so far at the "ingest data" stage where it has really 
>>> slowed
>>> down.
>>
>> FYI: The first line of ingest is always very slow. It is not measuring 
>> the start point correctly.
>>
>>>
>>> Current output is:
>>>
>>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>>> Avg: 7,593)
>>>
>>> See full log so far at
>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>
>> The earlier first pass also slows down and that should be fairly 
>> constant-ish speed step once everything settles down.
>>
>>> Some notes:
>>>
>>> * There is a (time/info) lapse in the output log between the end of
>>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>>> happening in the 1h13 minutes between the lines.
>>
>> There is "sort" going on. "top" should show it.
>>
>> For each index there is also a very long pause for exactly the same 
>> reason.  It would be good to have some something go "tick" and log a 
>> message occasionally.
>>
>>>
>>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds 
>>> [2021/12/10
>>> 22:33:46 CET]
>>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>>> 6,560,468,631 triples/quads 129,331 TPS
>>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
>>> Avg: 237,755)
>>>
>>> * The ingest data step really slows down on the "ingest data stage": 
>>> At the
>>> current rate, if I calculated correctly, it looks like 
>>> PKG.CmdxIngestData
>>> has 10 days left before it finishes.
>>
>> Ouch.
>>
>>> * When I saw sort running in the background for the first parts of 
>>> the job,
>>> I looked at the `sort` command. I noticed from some online sources that
>>> setting the environment variable LC_ALL=C improves speed for `sort`. 
>>> Could
>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>> break/change something? I see the warning from the man page for `sort`.
>>>
>>>         *** WARNING *** The locale specified by the environment affects
>>>         sort order.  Set LC_ALL=C to get the traditional sort order that
>>>         uses native byte values.
>>
>> It shouldn't matter but, yes, better to set it and export it in the 
>> control script and propagate to forked processes.
>>
>> The sort is doing a binary sort except because it a text sort program, 
>> the binary is turned into hex (!!). hex is in the ASCII subset and 
>> shoule be locale safe.
>>
>> But better to set LC_ALL=C.
>>
>>     Andy
>>
>>
>>>
>>> Links:
>>> https://access.redhat.com/solutions/445233
>>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram 
>>>
>>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort 
>>>
>>>
>>> Best regards,
>>> Øyvind
>>>

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.

On 16/12/2021 10:08, Marco Neumann wrote:
> thank you Lorenz, I am running this test myself now again with a larger
> disk. You may want to consider running a full load of wikidata as well. The
> timing info and disk space you have should be sufficient.
Full Wikidata (WD).

I've tried to gather a summary as of 2021-12.
The video [2] is most up-to-date at the time of writing.

WD about 16B currently.
It's growing at 1B/3 months [1]

The query service is 6 active machines (11 in total)
    58 queries per server per second average (no figures for peak)
    12 wikidata updates/330 triples per second  (no figures for peak)

It seems this mixed workload is causing the most pain.

Wikidata will want at least a 5 year strategy, and planning for 10 years 
isn't unreasonable. The service is 6 year old.

That's 36B and 56B as a baseline *minimum*.
That's without new streams of data.

They can't design for a minimum. "plan for success"
    --> towards 100B triples.

     Andy

[1] 
https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/ScalingStrategy

[2] Scaling the Wikidata Query Service
https://www.youtube.com/watch?v=oV4qelj9fxM

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
thank you Lorenz, I am running this test myself now again with a larger
disk. You may want to consider running a full load of wikidata as well. The
timing info and disk space you have should be sufficient.

Did we figure out a place to post the parser messages?

Marco


On Thu, Dec 16, 2021 at 10:01 AM LB <co...@googlemail.com.invalid>
wrote:

> Sure
>
> > wikidata-tdb/Data-0001:
> > total 524G
> > -rw-r--r-- 1   24 Dez 15 05:41 GOSP.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GPOS.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GPU.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 GSPO.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
> > -rw-r--r-- 1    0 Dez 15 05:41 journal.jrnl
> > -rw-r--r-- 1   24 Dez 15 05:41 nodes.bpt
> > -rw-r--r-- 1  36G Dez 15 05:41 nodes.dat
> > -rw-r--r-- 1   16 Dez 15 05:41 nodes-data.bdf
> > -rw-r--r-- 1  44G Dez 15 05:41 nodes-data.obj
> > -rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 OSP.bpt
> > -rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 OSPG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
> > -rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 POS.bpt
> > -rw-r--r-- 1 148G Dez 15 21:17 POS.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 POSG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
> > -rw-r--r-- 1 528M Dez 15 21:17 POS.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 prefixes.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
> > -rw-r--r-- 1   16 Dez 15 05:41 prefixes-data.bdf
> > -rw-r--r-- 1    0 Dez 14 12:21 prefixes-data.obj
> > -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
> > -rw-r--r-- 1   24 Dez 15 05:41 SPO.bpt
> > -rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
> > -rw-r--r-- 1   24 Dez 15 05:41 SPOG.bpt
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
> > -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
> > -rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
> > -rw-r--r-- 1    8 Dez 15 21:17 tdb.lock
>
> On 16.12.21 10:27, Marco Neumann wrote:
> > Thank you Lorenz, can you please post a directory list for Data-0001 with
> > file sizes.
> >
> >
> > On Thu, Dec 16, 2021 at 8:49 AM LB <co...@googlemail.com.invalid>
> > wrote:
> >
> >> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
> >>
> >> Server:
> >>
> >> AMD Ryzen 9 5950X  (16C/32T)
> >> 128 GB DDR4 ECC RAM
> >> 2 x 3.84 TB NVMe SSD
> >>
> >>
> >> Environment:
> >>
> >> - Ubuntu 20.04.3 LTS
> >> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> >> - Jena 4.3.1
> >>
> >>
> >> Command:
> >>
> >>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> >>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
> >>
> >> Log summary:
> >>
> >>> 04:14:28 INFO  Load node table  = 36600 seconds
> >>> 04:14:28 INFO  Load ingest data = 25811 seconds
> >>> 04:14:28 INFO  Build index SPO  = 20688 seconds
> >>> 04:14:28 INFO  Build index POS  = 35466 seconds
> >>> 04:14:28 INFO  Build index OSP  = 25042 seconds
> >>> 04:14:28 INFO  Overall          143607 seconds
> >>> 04:14:28 INFO  Overall          39h 53m 27s
> >>> 04:14:28 INFO  Triples loaded   = 6.610.055.778
> >>> 04:14:28 INFO  Quads loaded     = 0
> >>> 04:14:28 INFO  Overall Rate     46.028 tuples per second
> >>
> >> Disk space usage according to
> >>
> >>> du -sh datasets/wikidata-tdb
> >>    is
> >>
> >>> 524G    datasets/wikidata-tdb
> >> During loading I could see ~90GB of RAM occupied (50% of total memory
> >> got to sort and it used 2 threads - is it intended to stick to 2 threads
> >> with --parallel 2?)
> >>
> >>
> >> Cheers,
> >> Lorenz
> >>
> >>
> >> On 12.12.21 13:07, Andy Seaborne wrote:
> >>> Hi, Øyvind,
> >>>
> >>> This is all very helpful feedback. Thank you.
> >>>
> >>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> >>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >>>> truthy
> >>>> dump downloaded 2021-12-09.
> >>> This is the 4.3.0 xloader?
> >>>
> >>> There are improvements in 4.3.1. Since that release was going out,
> >>> including using less temporary space, the development version got
> >>> merged in. It has had some testing.
> >>>
> >>> It compresses the triples.tmp and intermediate sort files in the index
> >>> stage making the peak usage much smaller.
> >>>
> >>>> The instance is a vm created on the Norwegian Research and Education
> >>>> Cloud,
> >>>> an openstack cloud provider.
> >>>>
> >>>> Instance type:
> >>>> 32 GB memory
> >>>> 4 CPU
> >>> I using similar on a 7 year old desktop machine, SATA disk.
> >>>
> >>> I haven't got a machine I can dedicate to the multi-day load. I'll try
> >>> to find a way to at least push it though building the node table.
> >>>
> >>> Loading the first 1B of truthy:
> >>>
> >>> 1B triples , 40kTPS , 06h 54m 10s
> >>>
> >>> The database is 81G and building needs an addition 11.6G for workspace
> >>> for a total of 92G (+ the data file).
> >>>
> >>> While smaller, its seems bz2 files are much slower to decompress so
> >>> I've been using gz files.
> >>>
> >>> My current best guess for 6.4B truthy is
> >>>
> >>> Temp        96G
> >>> Database   540G
> >>> Data        48G
> >>> Total:     684G  -- peak disk needed
> >>>
> >>> based on scaling up 1B truthy. Personally, I would make sure there was
> >>> more space. Also - I don't know if the shape of the data is
> >>> sufficiently uniform to make scaling predictable.  The time doesn't
> >>> scale so simply.
> >>>
> >>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
> >>>
> >>> Compression reduces the size of triples.tmp -- the related sort
> >>> temporary files which add up to the same again -- 1/6 of the size.
> >>>
> >>>> The storage used for dump + temp files  is mounted as a separate
> 900GB
> >>>> volume and is mounted on /var/fuseki/databases
> >>>> .The type of storage is described as
> >>>>>    *mass-storage-default*: Storage backed by spinning hard drives,
> >>>> available to everybody and is the default type.
> >>>> with ext4 configured. At the moment I don't have access to the faster
> >>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> >>>> can be
> >>>> overcommitted.
> >>> "overcommitted" may be a problem.
> >>>
> >>> While it's not "tdb2 loader parallel" it does use a continuous CPU in
> >>> several threads.
> >>>
> >>> For memory - "it's complicated".
> >>>
> >>> The java parts only need say 2G. The sort is set to "buffer 50%
> >>> --parallel=2" and the java pipes into sort, that's another thread. I
> >>> think the effective peak is 3 active threads and they'll all be at
> >>> 100% for some of the time.
> >>>
> >>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
> >>>
> >>> It does not need space for memory mapped files (they aren't used at
> >>> all in the loading process and I/O is sequential.
> >>>
> >>> If that triggers over commitment swap out, the performance may go down
> >>> a lot.
> >>>
> >>> For disk - if that is physically remote, it should not a problem
> >>> (famous last words). I/O is sequential and in large continuous chunks
> >>> - typical for batch processing jobs.
> >>>
> >>>> OS for the instance is a clean Rocky Linux image, with no services
> >>>> except
> >>>> jena/fuseki installed. The systemd service
> >>>   set up for fuseki is stopped.
> >>>> jena and fuseki version is 4.3.0.
> >>>>
> >>>> openjdk 11.0.13 2021-10-19 LTS
> >>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
> sharing)
> >>> Just FYI: Java17 is a little faster. Some java improvements have
> >>> improved RDF parsing speed by up to 10%. in xloader that not
> >>> significant to the overall time.
> >>>
> >>>> I'm running from a tmux session to avoid connectivity issues and to
> >>>> capture
> >>>> the output.
> >>> I use
> >>>
> >>> tdb2.xloader .... |& tee LOG-FILE-NAME
> >>>
> >>> to capture the logs and see them. ">&" and "tail -f" would achieve
> >>> much the same effect
> >>>
> >>>> I think the output is stored in memory and not on disk.
> >>>> On First run I tried to have the tmpdir on the root partition, to
> >>>> separate
> >>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> >>>> disk
> >>>> full. For the second (current run) all directories are under
> >>>> /var/fuseki/databases.
> >>> Yes - after making that mistake myself, the new version ignores system
> >>> TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data
> >>> directory.
> >>>
> >>>>    $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >>>> --tmpdir
> >>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>>>
> >>>> The import is so far at the "ingest data" stage where it has really
> >>>> slowed
> >>>> down.
> >>> FYI: The first line of ingest is always very slow. It is not measuring
> >>> the start point correctly.
> >>>
> >>>> Current output is:
> >>>>
> >>>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356
> /
> >>>> Avg: 7,593)
> >>>>
> >>>> See full log so far at
> >>>>
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>> The earlier first pass also slows down and that should be fairly
> >>> constant-ish speed step once everything settles down.
> >>>
> >>>> Some notes:
> >>>>
> >>>> * There is a (time/info) lapse in the output log between the end of
> >>>> 'parse' and the start of 'index' for Terms.  It is unclear to me what
> is
> >>>> happening in the 1h13 minutes between the lines.
> >>> There is "sort" going on. "top" should show it.
> >>>
> >>> For each index there is also a very long pause for exactly the same
> >>> reason.  It would be good to have some something go "tick" and log a
> >>> message occasionally.
> >>>
> >>>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> >>>> [2021/12/10
> >>>> 22:33:46 CET]
> >>>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> >>>> 6,560,468,631 triples/quads 129,331 TPS
> >>>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch:
> 237,755 /
> >>>> Avg: 237,755)
> >>>>
> >>>> * The ingest data step really slows down on the "ingest data stage":
> >>>> At the
> >>>> current rate, if I calculated correctly, it looks like
> >>>> PKG.CmdxIngestData
> >>>> has 10 days left before it finishes.
> >>> Ouch.
> >>>
> >>>> * When I saw sort running in the background for the first parts of
> >>>> the job,
> >>>> I looked at the `sort` command. I noticed from some online sources
> that
> >>>> setting the environment variable LC_ALL=C improves speed for `sort`.
> >>>> Could
> >>>> this be set on the ProcessBuilder for the `sort` process? Could it
> >>>> break/change something? I see the warning from the man page for
> `sort`.
> >>>>
> >>>>          *** WARNING *** The locale specified by the environment
> affects
> >>>>          sort order.  Set LC_ALL=C to get the traditional sort order
> that
> >>>>          uses native byte values.
> >>> It shouldn't matter but, yes, better to set it and export it in the
> >>> control script and propagate to forked processes.
> >>>
> >>> The sort is doing a binary sort except because it a text sort program,
> >>> the binary is turned into hex (!!). hex is in the ASCII subset and
> >>> shoule be locale safe.
> >>>
> >>> But better to set LC_ALL=C.
> >>>
> >>>      Andy
> >>>
> >>>
> >>>> Links:
> >>>> https://access.redhat.com/solutions/445233
> >>>>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>>>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>>>
> >>>> Best regards,
> >>>> Øyvind
> >>>>
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by LB <co...@googlemail.com.INVALID>.
Sure

> wikidata-tdb/Data-0001:
> total 524G
> -rw-r--r-- 1   24 Dez 15 05:41 GOSP.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
> -rw-r--r-- 1   24 Dez 15 05:41 GPOS.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
> -rw-r--r-- 1   24 Dez 15 05:41 GPU.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
> -rw-r--r-- 1   24 Dez 15 05:41 GSPO.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
> -rw-r--r-- 1    0 Dez 15 05:41 journal.jrnl
> -rw-r--r-- 1   24 Dez 15 05:41 nodes.bpt
> -rw-r--r-- 1  36G Dez 15 05:41 nodes.dat
> -rw-r--r-- 1   16 Dez 15 05:41 nodes-data.bdf
> -rw-r--r-- 1  44G Dez 15 05:41 nodes-data.obj
> -rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
> -rw-r--r-- 1   24 Dez 15 05:41 OSP.bpt
> -rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
> -rw-r--r-- 1   24 Dez 15 05:41 OSPG.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
> -rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
> -rw-r--r-- 1   24 Dez 15 05:41 POS.bpt
> -rw-r--r-- 1 148G Dez 15 21:17 POS.dat
> -rw-r--r-- 1   24 Dez 15 05:41 POSG.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
> -rw-r--r-- 1 528M Dez 15 21:17 POS.idn
> -rw-r--r-- 1   24 Dez 15 05:41 prefixes.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
> -rw-r--r-- 1   16 Dez 15 05:41 prefixes-data.bdf
> -rw-r--r-- 1    0 Dez 14 12:21 prefixes-data.obj
> -rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
> -rw-r--r-- 1   24 Dez 15 05:41 SPO.bpt
> -rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
> -rw-r--r-- 1   24 Dez 15 05:41 SPOG.bpt
> -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
> -rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
> -rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
> -rw-r--r-- 1    8 Dez 15 21:17 tdb.lock

On 16.12.21 10:27, Marco Neumann wrote:
> Thank you Lorenz, can you please post a directory list for Data-0001 with
> file sizes.
>
>
> On Thu, Dec 16, 2021 at 8:49 AM LB <co...@googlemail.com.invalid>
> wrote:
>
>> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>>
>> Server:
>>
>> AMD Ryzen 9 5950X  (16C/32T)
>> 128 GB DDR4 ECC RAM
>> 2 x 3.84 TB NVMe SSD
>>
>>
>> Environment:
>>
>> - Ubuntu 20.04.3 LTS
>> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
>> - Jena 4.3.1
>>
>>
>> Command:
>>
>>> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
>>> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>>
>> Log summary:
>>
>>> 04:14:28 INFO  Load node table  = 36600 seconds
>>> 04:14:28 INFO  Load ingest data = 25811 seconds
>>> 04:14:28 INFO  Build index SPO  = 20688 seconds
>>> 04:14:28 INFO  Build index POS  = 35466 seconds
>>> 04:14:28 INFO  Build index OSP  = 25042 seconds
>>> 04:14:28 INFO  Overall          143607 seconds
>>> 04:14:28 INFO  Overall          39h 53m 27s
>>> 04:14:28 INFO  Triples loaded   = 6.610.055.778
>>> 04:14:28 INFO  Quads loaded     = 0
>>> 04:14:28 INFO  Overall Rate     46.028 tuples per second
>>
>> Disk space usage according to
>>
>>> du -sh datasets/wikidata-tdb
>>    is
>>
>>> 524G    datasets/wikidata-tdb
>> During loading I could see ~90GB of RAM occupied (50% of total memory
>> got to sort and it used 2 threads - is it intended to stick to 2 threads
>> with --parallel 2?)
>>
>>
>> Cheers,
>> Lorenz
>>
>>
>> On 12.12.21 13:07, Andy Seaborne wrote:
>>> Hi, Øyvind,
>>>
>>> This is all very helpful feedback. Thank you.
>>>
>>> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>>>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
>>>> truthy
>>>> dump downloaded 2021-12-09.
>>> This is the 4.3.0 xloader?
>>>
>>> There are improvements in 4.3.1. Since that release was going out,
>>> including using less temporary space, the development version got
>>> merged in. It has had some testing.
>>>
>>> It compresses the triples.tmp and intermediate sort files in the index
>>> stage making the peak usage much smaller.
>>>
>>>> The instance is a vm created on the Norwegian Research and Education
>>>> Cloud,
>>>> an openstack cloud provider.
>>>>
>>>> Instance type:
>>>> 32 GB memory
>>>> 4 CPU
>>> I using similar on a 7 year old desktop machine, SATA disk.
>>>
>>> I haven't got a machine I can dedicate to the multi-day load. I'll try
>>> to find a way to at least push it though building the node table.
>>>
>>> Loading the first 1B of truthy:
>>>
>>> 1B triples , 40kTPS , 06h 54m 10s
>>>
>>> The database is 81G and building needs an addition 11.6G for workspace
>>> for a total of 92G (+ the data file).
>>>
>>> While smaller, its seems bz2 files are much slower to decompress so
>>> I've been using gz files.
>>>
>>> My current best guess for 6.4B truthy is
>>>
>>> Temp        96G
>>> Database   540G
>>> Data        48G
>>> Total:     684G  -- peak disk needed
>>>
>>> based on scaling up 1B truthy. Personally, I would make sure there was
>>> more space. Also - I don't know if the shape of the data is
>>> sufficiently uniform to make scaling predictable.  The time doesn't
>>> scale so simply.
>>>
>>> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>>>
>>> Compression reduces the size of triples.tmp -- the related sort
>>> temporary files which add up to the same again -- 1/6 of the size.
>>>
>>>> The storage used for dump + temp files  is mounted as a separate  900GB
>>>> volume and is mounted on /var/fuseki/databases
>>>> .The type of storage is described as
>>>>>    *mass-storage-default*: Storage backed by spinning hard drives,
>>>> available to everybody and is the default type.
>>>> with ext4 configured. At the moment I don't have access to the faster
>>>> volume type mass-storage-ssd. CPU and memory are not dedicated, and
>>>> can be
>>>> overcommitted.
>>> "overcommitted" may be a problem.
>>>
>>> While it's not "tdb2 loader parallel" it does use a continuous CPU in
>>> several threads.
>>>
>>> For memory - "it's complicated".
>>>
>>> The java parts only need say 2G. The sort is set to "buffer 50%
>>> --parallel=2" and the java pipes into sort, that's another thread. I
>>> think the effective peak is 3 active threads and they'll all be at
>>> 100% for some of the time.
>>>
>>> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>>>
>>> It does not need space for memory mapped files (they aren't used at
>>> all in the loading process and I/O is sequential.
>>>
>>> If that triggers over commitment swap out, the performance may go down
>>> a lot.
>>>
>>> For disk - if that is physically remote, it should not a problem
>>> (famous last words). I/O is sequential and in large continuous chunks
>>> - typical for batch processing jobs.
>>>
>>>> OS for the instance is a clean Rocky Linux image, with no services
>>>> except
>>>> jena/fuseki installed. The systemd service
>>>   set up for fuseki is stopped.
>>>> jena and fuseki version is 4.3.0.
>>>>
>>>> openjdk 11.0.13 2021-10-19 LTS
>>>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>>>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>>> Just FYI: Java17 is a little faster. Some java improvements have
>>> improved RDF parsing speed by up to 10%. in xloader that not
>>> significant to the overall time.
>>>
>>>> I'm running from a tmux session to avoid connectivity issues and to
>>>> capture
>>>> the output.
>>> I use
>>>
>>> tdb2.xloader .... |& tee LOG-FILE-NAME
>>>
>>> to capture the logs and see them. ">&" and "tail -f" would achieve
>>> much the same effect
>>>
>>>> I think the output is stored in memory and not on disk.
>>>> On First run I tried to have the tmpdir on the root partition, to
>>>> separate
>>>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
>>>> disk
>>>> full. For the second (current run) all directories are under
>>>> /var/fuseki/databases.
>>> Yes - after making that mistake myself, the new version ignores system
>>> TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data
>>> directory.
>>>
>>>>    $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
>>>> --tmpdir
>>>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>>>
>>>> The import is so far at the "ingest data" stage where it has really
>>>> slowed
>>>> down.
>>> FYI: The first line of ingest is always very slow. It is not measuring
>>> the start point correctly.
>>>
>>>> Current output is:
>>>>
>>>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>>>> Avg: 7,593)
>>>>
>>>> See full log so far at
>>>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>>> The earlier first pass also slows down and that should be fairly
>>> constant-ish speed step once everything settles down.
>>>
>>>> Some notes:
>>>>
>>>> * There is a (time/info) lapse in the output log between the end of
>>>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>>>> happening in the 1h13 minutes between the lines.
>>> There is "sort" going on. "top" should show it.
>>>
>>> For each index there is also a very long pause for exactly the same
>>> reason.  It would be good to have some something go "tick" and log a
>>> message occasionally.
>>>
>>>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
>>>> [2021/12/10
>>>> 22:33:46 CET]
>>>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>>>> 6,560,468,631 triples/quads 129,331 TPS
>>>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
>>>> Avg: 237,755)
>>>>
>>>> * The ingest data step really slows down on the "ingest data stage":
>>>> At the
>>>> current rate, if I calculated correctly, it looks like
>>>> PKG.CmdxIngestData
>>>> has 10 days left before it finishes.
>>> Ouch.
>>>
>>>> * When I saw sort running in the background for the first parts of
>>>> the job,
>>>> I looked at the `sort` command. I noticed from some online sources that
>>>> setting the environment variable LC_ALL=C improves speed for `sort`.
>>>> Could
>>>> this be set on the ProcessBuilder for the `sort` process? Could it
>>>> break/change something? I see the warning from the man page for `sort`.
>>>>
>>>>          *** WARNING *** The locale specified by the environment affects
>>>>          sort order.  Set LC_ALL=C to get the traditional sort order that
>>>>          uses native byte values.
>>> It shouldn't matter but, yes, better to set it and export it in the
>>> control script and propagate to forked processes.
>>>
>>> The sort is doing a binary sort except because it a text sort program,
>>> the binary is turned into hex (!!). hex is in the ASCII subset and
>>> shoule be locale safe.
>>>
>>> But better to set LC_ALL=C.
>>>
>>>      Andy
>>>
>>>
>>>> Links:
>>>> https://access.redhat.com/solutions/445233
>>>>
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
>>>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
>>>>
>>>> Best regards,
>>>> Øyvind
>>>>
>

Re: Testing tdb2.xloader

Posted by Marco Neumann <ma...@gmail.com>.
Thank you Lorenz, can you please post a directory list for Data-0001 with
file sizes.


On Thu, Dec 16, 2021 at 8:49 AM LB <co...@googlemail.com.invalid>
wrote:

> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>
> Server:
>
> AMD Ryzen 9 5950X  (16C/32T)
> 128 GB DDR4 ECC RAM
> 2 x 3.84 TB NVMe SSD
>
>
> Environment:
>
> - Ubuntu 20.04.3 LTS
> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> - Jena 4.3.1
>
>
> Command:
>
> > tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> > datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>
>
> Log summary:
>
> > 04:14:28 INFO  Load node table  = 36600 seconds
> > 04:14:28 INFO  Load ingest data = 25811 seconds
> > 04:14:28 INFO  Build index SPO  = 20688 seconds
> > 04:14:28 INFO  Build index POS  = 35466 seconds
> > 04:14:28 INFO  Build index OSP  = 25042 seconds
> > 04:14:28 INFO  Overall          143607 seconds
> > 04:14:28 INFO  Overall          39h 53m 27s
> > 04:14:28 INFO  Triples loaded   = 6.610.055.778
> > 04:14:28 INFO  Quads loaded     = 0
> > 04:14:28 INFO  Overall Rate     46.028 tuples per second
>
>
> Disk space usage according to
>
> > du -sh datasets/wikidata-tdb
>
>   is
>
> > 524G    datasets/wikidata-tdb
>
> During loading I could see ~90GB of RAM occupied (50% of total memory
> got to sort and it used 2 threads - is it intended to stick to 2 threads
> with --parallel 2?)
>
>
> Cheers,
> Lorenz
>
>
> On 12.12.21 13:07, Andy Seaborne wrote:
> > Hi, Øyvind,
> >
> > This is all very helpful feedback. Thank you.
> >
> > On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >
> > This is the 4.3.0 xloader?
> >
> > There are improvements in 4.3.1. Since that release was going out,
> > including using less temporary space, the development version got
> > merged in. It has had some testing.
> >
> > It compresses the triples.tmp and intermediate sort files in the index
> > stage making the peak usage much smaller.
> >
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >
> > I using similar on a 7 year old desktop machine, SATA disk.
> >
> > I haven't got a machine I can dedicate to the multi-day load. I'll try
> > to find a way to at least push it though building the node table.
> >
> > Loading the first 1B of truthy:
> >
> > 1B triples , 40kTPS , 06h 54m 10s
> >
> > The database is 81G and building needs an addition 11.6G for workspace
> > for a total of 92G (+ the data file).
> >
> > While smaller, its seems bz2 files are much slower to decompress so
> > I've been using gz files.
> >
> > My current best guess for 6.4B truthy is
> >
> > Temp        96G
> > Database   540G
> > Data        48G
> > Total:     684G  -- peak disk needed
> >
> > based on scaling up 1B truthy. Personally, I would make sure there was
> > more space. Also - I don't know if the shape of the data is
> > sufficiently uniform to make scaling predictable.  The time doesn't
> > scale so simply.
> >
> > This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
> >
> > Compression reduces the size of triples.tmp -- the related sort
> > temporary files which add up to the same again -- 1/6 of the size.
> >
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> >> can be
> >> overcommitted.
> >
> > "overcommitted" may be a problem.
> >
> > While it's not "tdb2 loader parallel" it does use a continuous CPU in
> > several threads.
> >
> > For memory - "it's complicated".
> >
> > The java parts only need say 2G. The sort is set to "buffer 50%
> > --parallel=2" and the java pipes into sort, that's another thread. I
> > think the effective peak is 3 active threads and they'll all be at
> > 100% for some of the time.
> >
> > So it's going to need 50% of RAM + 2G for a java proces, +OS.
> >
> > It does not need space for memory mapped files (they aren't used at
> > all in the loading process and I/O is sequential.
> >
> > If that triggers over commitment swap out, the performance may go down
> > a lot.
> >
> > For disk - if that is physically remote, it should not a problem
> > (famous last words). I/O is sequential and in large continuous chunks
> > - typical for batch processing jobs.
> >
> >> OS for the instance is a clean Rocky Linux image, with no services
> >> except
> >> jena/fuseki installed. The systemd service
> >
> >  set up for fuseki is stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >
> > Just FYI: Java17 is a little faster. Some java improvements have
> > improved RDF parsing speed by up to 10%. in xloader that not
> > significant to the overall time.
> >
> >> I'm running from a tmux session to avoid connectivity issues and to
> >> capture
> >> the output.
> >
> > I use
> >
> > tdb2.xloader .... |& tee LOG-FILE-NAME
> >
> > to capture the logs and see them. ">&" and "tail -f" would achieve
> > much the same effect
> >
> >> I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> >> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> >> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >
> > Yes - after making that mistake myself, the new version ignores system
> > TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data
> > directory.
> >
> >>
> >>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> >> slowed
> >> down.
> >
> > FYI: The first line of ingest is always very slow. It is not measuring
> > the start point correctly.
> >
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >
> > The earlier first pass also slows down and that should be fairly
> > constant-ish speed step once everything settles down.
> >
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the end of
> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >
> > There is "sort" going on. "top" should show it.
> >
> > For each index there is also a very long pause for exactly the same
> > reason.  It would be good to have some something go "tick" and log a
> > message occasionally.
> >
> >>
> >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> >> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage":
> >> At the
> >> current rate, if I calculated correctly, it looks like
> >> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >
> > Ouch.
> >
> >> * When I saw sort running in the background for the first parts of
> >> the job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> >> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >>         *** WARNING *** The locale specified by the environment affects
> >>         sort order.  Set LC_ALL=C to get the traditional sort order that
> >>         uses native byte values.
> >
> > It shouldn't matter but, yes, better to set it and export it in the
> > control script and propagate to forked processes.
> >
> > The sort is doing a binary sort except because it a text sort program,
> > the binary is turned into hex (!!). hex is in the ASCII subset and
> > shoule be locale safe.
> >
> > But better to set LC_ALL=C.
> >
> >     Andy
> >
> >
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >>
> >> Best regards,
> >> Øyvind
> >>
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
I've updated:

https://www.w3.org/wiki/LargeTripleStores#Apache_Jena_TDB_.286.6B.29

for Lorenz's first run.

     Andy

On 16/12/2021 08:49, LB wrote:
> 39h 53m 27s
> 04:14:28 INFO  Triples loaded   = 6.610.055.778
> 04:14:28 INFO  Quads loaded     = 0
> 04:14:28 INFO  Overall Rate     46.028 tuples per second

Re: Testing tdb2.xloader

Posted by LB <co...@googlemail.com.INVALID>.
Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:

> tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc 
> datasets/wikidata-tdb datasets/latest-truthy.nt.bz2


Log summary:

> 04:14:28 INFO  Load node table  = 36600 seconds
> 04:14:28 INFO  Load ingest data = 25811 seconds
> 04:14:28 INFO  Build index SPO  = 20688 seconds
> 04:14:28 INFO  Build index POS  = 35466 seconds
> 04:14:28 INFO  Build index OSP  = 25042 seconds
> 04:14:28 INFO  Overall          143607 seconds
> 04:14:28 INFO  Overall          39h 53m 27s
> 04:14:28 INFO  Triples loaded   = 6.610.055.778
> 04:14:28 INFO  Quads loaded     = 0
> 04:14:28 INFO  Overall Rate     46.028 tuples per second


Disk space usage according to

> du -sh datasets/wikidata-tdb

  is

> 524G    datasets/wikidata-tdb

During loading I could see ~90GB of RAM occupied (50% of total memory 
got to sort and it used 2 threads - is it intended to stick to 2 threads 
with --parallel 2?)


Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:
> Hi, Øyvind,
>
> This is all very helpful feedback. Thank you.
>
> On 11/12/2021 21:45, Øyvind Gjesdal wrote:
>> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata 
>> truthy
>> dump downloaded 2021-12-09.
>
> This is the 4.3.0 xloader?
>
> There are improvements in 4.3.1. Since that release was going out, 
> including using less temporary space, the development version got 
> merged in. It has had some testing.
>
> It compresses the triples.tmp and intermediate sort files in the index 
> stage making the peak usage much smaller.
>
>> The instance is a vm created on the Norwegian Research and Education 
>> Cloud,
>> an openstack cloud provider.
>>
>> Instance type:
>> 32 GB memory
>> 4 CPU
>
> I using similar on a 7 year old desktop machine, SATA disk.
>
> I haven't got a machine I can dedicate to the multi-day load. I'll try 
> to find a way to at least push it though building the node table.
>
> Loading the first 1B of truthy:
>
> 1B triples , 40kTPS , 06h 54m 10s
>
> The database is 81G and building needs an addition 11.6G for workspace 
> for a total of 92G (+ the data file).
>
> While smaller, its seems bz2 files are much slower to decompress so 
> I've been using gz files.
>
> My current best guess for 6.4B truthy is
>
> Temp        96G
> Database   540G
> Data        48G
> Total:     684G  -- peak disk needed
>
> based on scaling up 1B truthy. Personally, I would make sure there was 
> more space. Also - I don't know if the shape of the data is 
> sufficiently uniform to make scaling predictable.  The time doesn't 
> scale so simply.
>
> This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
>
> Compression reduces the size of triples.tmp -- the related sort 
> temporary files which add up to the same again -- 1/6 of the size.
>
>> The storage used for dump + temp files  is mounted as a separate  900GB
>> volume and is mounted on /var/fuseki/databases
>> .The type of storage is described as
>>>   *mass-storage-default*: Storage backed by spinning hard drives,
>> available to everybody and is the default type.
>> with ext4 configured. At the moment I don't have access to the faster
>> volume type mass-storage-ssd. CPU and memory are not dedicated, and 
>> can be
>> overcommitted.
>
> "overcommitted" may be a problem.
>
> While it's not "tdb2 loader parallel" it does use a continuous CPU in 
> several threads.
>
> For memory - "it's complicated".
>
> The java parts only need say 2G. The sort is set to "buffer 50% 
> --parallel=2" and the java pipes into sort, that's another thread. I 
> think the effective peak is 3 active threads and they'll all be at 
> 100% for some of the time.
>
> So it's going to need 50% of RAM + 2G for a java proces, +OS.
>
> It does not need space for memory mapped files (they aren't used at 
> all in the loading process and I/O is sequential.
>
> If that triggers over commitment swap out, the performance may go down 
> a lot.
>
> For disk - if that is physically remote, it should not a problem 
> (famous last words). I/O is sequential and in large continuous chunks 
> - typical for batch processing jobs.
>
>> OS for the instance is a clean Rocky Linux image, with no services 
>> except
>> jena/fuseki installed. The systemd service 
>
>  set up for fuseki is stopped.
>> jena and fuseki version is 4.3.0.
>>
>> openjdk 11.0.13 2021-10-19 LTS
>> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
>> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
>
> Just FYI: Java17 is a little faster. Some java improvements have 
> improved RDF parsing speed by up to 10%. in xloader that not 
> significant to the overall time.
>
>> I'm running from a tmux session to avoid connectivity issues and to 
>> capture
>> the output. 
>
> I use
>
> tdb2.xloader .... |& tee LOG-FILE-NAME
>
> to capture the logs and see them. ">&" and "tail -f" would achieve 
> much the same effect
>
>> I think the output is stored in memory and not on disk.
>> On First run I tried to have the tmpdir on the root partition, to 
>> separate
>> temp dir and data dir, but with only 19 GB free, the tmpdir soon was 
>> disk
>> full. For the second (current run) all directories are under
>> /var/fuseki/databases.
>
> Yes - after making that mistake myself, the new version ignores system 
> TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data 
> directory.
>
>>
>>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy 
>> --tmpdir
>> /var/fuseki/databases/tmp latest-truthy.nt.gz
>>
>> The import is so far at the "ingest data" stage where it has really 
>> slowed
>> down.
>
> FYI: The first line of ingest is always very slow. It is not measuring 
> the start point correctly.
>
>>
>> Current output is:
>>
>> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
>> Avg: 7,593)
>>
>> See full log so far at
>> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
>
> The earlier first pass also slows down and that should be fairly 
> constant-ish speed step once everything settles down.
>
>> Some notes:
>>
>> * There is a (time/info) lapse in the output log between the end of
>> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
>> happening in the 1h13 minutes between the lines.
>
> There is "sort" going on. "top" should show it.
>
> For each index there is also a very long pause for exactly the same 
> reason.  It would be good to have some something go "tick" and log a 
> message occasionally.
>
>>
>> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds 
>> [2021/12/10
>> 22:33:46 CET]
>> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
>> 6,560,468,631 triples/quads 129,331 TPS
>> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
>> Avg: 237,755)
>>
>> * The ingest data step really slows down on the "ingest data stage": 
>> At the
>> current rate, if I calculated correctly, it looks like 
>> PKG.CmdxIngestData
>> has 10 days left before it finishes.
>
> Ouch.
>
>> * When I saw sort running in the background for the first parts of 
>> the job,
>> I looked at the `sort` command. I noticed from some online sources that
>> setting the environment variable LC_ALL=C improves speed for `sort`. 
>> Could
>> this be set on the ProcessBuilder for the `sort` process? Could it
>> break/change something? I see the warning from the man page for `sort`.
>>
>>         *** WARNING *** The locale specified by the environment affects
>>         sort order.  Set LC_ALL=C to get the traditional sort order that
>>         uses native byte values.
>
> It shouldn't matter but, yes, better to set it and export it in the 
> control script and propagate to forked processes.
>
> The sort is doing a binary sort except because it a text sort program, 
> the binary is turned into hex (!!). hex is in the ASCII subset and 
> shoule be locale safe.
>
> But better to set LC_ALL=C.
>
>     Andy
>
>
>>
>> Links:
>> https://access.redhat.com/solutions/445233
>> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram 
>>
>> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort 
>>
>>
>> Best regards,
>> Øyvind
>>

Re: Testing tdb2.xloader

Posted by Andy Seaborne <an...@apache.org>.
Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
> dump downloaded 2021-12-09.

This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out, 
including using less temporary space, the development version got merged 
in. It has had some testing.

It compresses the triples.tmp and intermediate sort files in the index 
stage making the peak usage much smaller.

> The instance is a vm created on the Norwegian Research and Education Cloud,
> an openstack cloud provider.
> 
> Instance type:
> 32 GB memory
> 4 CPU

I using similar on a 7 year old desktop machine, SATA disk.

I haven't got a machine I can dedicate to the multi-day load. I'll try 
to find a way to at least push it though building the node table.

Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s

The database is 81G and building needs an addition 11.6G for workspace 
for a total of 92G (+ the data file).

While smaller, its seems bz2 files are much slower to decompress so I've 
been using gz files.

My current best guess for 6.4B truthy is

Temp        96G
Database   540G
Data        48G
Total:     684G  -- peak disk needed

based on scaling up 1B truthy. Personally, I would make sure there was 
more space. Also - I don't know if the shape of the data is sufficiently 
uniform to make scaling predictable.  The time doesn't scale so simply.

This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.

Compression reduces the size of triples.tmp -- the related sort 
temporary files which add up to the same again -- 1/6 of the size.

> The storage used for dump + temp files  is mounted as a separate  900GB
> volume and is mounted on /var/fuseki/databases
> .The type of storage is described as
>>   *mass-storage-default*: Storage backed by spinning hard drives,
> available to everybody and is the default type.
> with ext4 configured. At the moment I don't have access to the faster
> volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
> overcommitted.

"overcommitted" may be a problem.

While it's not "tdb2 loader parallel" it does use a continuous CPU in 
several threads.

For memory - "it's complicated".

The java parts only need say 2G. The sort is set to "buffer 50% 
--parallel=2" and the java pipes into sort, that's another thread. I 
think the effective peak is 3 active threads and they'll all be at 100% 
for some of the time.

So it's going to need 50% of RAM + 2G for a java proces, +OS.

It does not need space for memory mapped files (they aren't used at all 
in the loading process and I/O is sequential.

If that triggers over commitment swap out, the performance may go down a 
lot.

For disk - if that is physically remote, it should not a problem (famous 
last words). I/O is sequential and in large continuous chunks - typical 
for batch processing jobs.

> OS for the instance is a clean Rocky Linux image, with no services except
> jena/fuseki installed. The systemd service 

  set up for fuseki is stopped.
> jena and fuseki version is 4.3.0.
> 
> openjdk 11.0.13 2021-10-19 LTS
> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)

Just FYI: Java17 is a little faster. Some java improvements have 
improved RDF parsing speed by up to 10%. in xloader that not significant 
to the overall time.

> I'm running from a tmux session to avoid connectivity issues and to capture
> the output. 

I use

tdb2.xloader .... |& tee LOG-FILE-NAME

to capture the logs and see them. ">&" and "tail -f" would achieve much 
the same effect

> I think the output is stored in memory and not on disk.
> On First run I tried to have the tmpdir on the root partition, to separate
> temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
> full. For the second (current run) all directories are under
> /var/fuseki/databases.

Yes - after making that mistake myself, the new version ignores system 
TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data 
directory.

> 
>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
> /var/fuseki/databases/tmp latest-truthy.nt.gz
> 
> The import is so far at the "ingest data" stage where it has really slowed
> down.

FYI: The first line of ingest is always very slow. It is not measuring 
the start point correctly.

> 
> Current output is:
> 
> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
> Avg: 7,593)
> 
> See full log so far at
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab

The earlier first pass also slows down and that should be fairly 
constant-ish speed step once everything settles down.

> Some notes:
> 
> * There is a (time/info) lapse in the output log between the  end of
> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> happening in the 1h13 minutes between the lines.

There is "sort" going on. "top" should show it.

For each index there is also a very long pause for exactly the same 
reason.  It would be good to have some something go "tick" and log a 
message occasionally.

> 
> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds [2021/12/10
> 22:33:46 CET]
> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> 6,560,468,631 triples/quads 129,331 TPS
> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
> Avg: 237,755)
> 
> * The ingest data step really slows down on the "ingest data stage": At the
> current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
> has 10 days left before it finishes.

Ouch.

> * When I saw sort running in the background for the first parts of the job,
> I looked at the `sort` command. I noticed from some online sources that
> setting the environment variable LC_ALL=C improves speed for `sort`. Could
> this be set on the ProcessBuilder for the `sort` process? Could it
> break/change something? I see the warning from the man page for `sort`.
> 
>         *** WARNING *** The locale specified by the environment affects
>         sort order.  Set LC_ALL=C to get the traditional sort order that
>         uses native byte values.

It shouldn't matter but, yes, better to set it and export it in the 
control script and propagate to forked processes.

The sort is doing a binary sort except because it a text sort program, 
the binary is turned into hex (!!). hex is in the ASCII subset and 
shoule be locale safe.

But better to set LC_ALL=C.

     Andy


> 
> Links:
> https://access.redhat.com/solutions/445233
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> 
> Best regards,
> Øyvind
>