You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Laura Morales <la...@mail.com> on 2017/12/14 20:09:21 UTC

Re: Report on loading wikidata (errata)

ERRATA:

> I don't know why then. Maybe SSD is making all the difference. Try to load it (or "latest-all") on a comparable machine using a single SATA disk instead of SSD.

s/SATA/HDD



----------------------------

> I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
> I/O path to SSD isn't very quick).

I don't know why then. Maybe SSD is making all the difference. Try to load it (or "latest-all") on a comparable machine using a single SATA disk instead of SSD. Around 100-150M my computer slows dows significantly, and then always down from here. All I know is that it's either because of too little RAM, or because the disk can't keep up.

> If RAM really is at 1G , even on your small 8G server, suggests your
> setup is configured in the OS to restrict the RAM for mapping. RAM per
> process should be > real RAM (remember memory mapped files are used) or
> the VM is setup in some odd way. Or 32bit java.

Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB and virtual ~5.5GB. Process started with 150K triples per second, now after 250M triples processed is at 50K triples/second and slowing down (processing batches of 25K). I don't know what to say, I think the conclusion is simply that tdbloader (any version) just doesn't work with large graphs on HDDs. So the only solution has to be to use an SSD, or find a way to split the graph into smaller stores, or simply give up.

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 31370
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 31370
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 95
-N 15: unlimited

Re: Report on loading wikidata

Posted by Laura Morales <la...@mail.com>.

> The loaders work on empty databases.

Yes my test is on a new empty dataset. The command that I use is `tdbloader2 --loc wikidata wikidata.ttl`

> If you are splitting files, and doing partial loads, things are rather different.

No I'm using the whole file. I'd only consider splitting it if there were a way to use "FROM <wikidata>" as an alias for "FROM <wd-store1> FROM <wd-store2> FROM <wd-store3> ..."

> Maybe swappiness is set to keep a %-age of RAM free.

My swappiness is set to 10.
Disk read speed: 2-3MB/s | Disk write speed: 40-50MB/s  (slowing down over time). I think what Dick said is correct; that is, as the index and stored data grows, the disk can't keep up. I think a single HDD just doesn't cut it. Perhaps a SSD can do it, I don't know because I don't have one. Maybe I should try with many hard disks... one to host the 200GB source, one to handle data-triples.tmp, one for node2id.net, one for nodes.dat, and so forth...

Re: Report on loading wikidata (errata)

Posted by Andy Seaborne <an...@apache.org>.

 >> (processing batches of 25K)

The loaders work on empty databases.

tdbloader will load into a existing none but it does not do anything 
special and you'll get RAM contention.

If you are splitting files, and doing partial loads, things are rather 
different.

 >> Right now resident memory is ~3.5GB and virtual ~5.5GB

Maybe swappiness is set to keep a %-age of RAM free.

     Andy


On 14/12/17 20:38, dandh988 wrote:
> Your IO doesn't know whether it's coming or going!
> You're reading from a 250GB file whilst writing to two .tmp files and the id to node files. Then you are reading the data-triple.tmp to sort it which will be writing to tmp whilst chewing RAM because it's too big to sort in memory and writing the sorted file to then read it whilst writing the index files. Repeat three times.
> Your HDD heads can only be in one place at a time and I suspect you've only got maximum 128MB cache on the drive. The queues on the drive will go through the roof and if the OS decides to page it'll be properly screwed!
> SSD can service deep queues because it can be in more than one place at a time, as an analogy.
> Stick the 250GB file on a USB drive to get that read load off the internal IO as a start.
> The loader works on HDD's you just need to be a little smart in understanding the limits of the hardware you're using and laptops are not known for IO chipsets. Even my Dell M3800 which is supposed to be a workstation grade laptop has one drive and an external SATA connection to help out.
> 
> 
> Dick
> -------- Original message --------From: Laura Morales <la...@mail.com> Date: 14/12/2017  20:09  (GMT+00:00) To: jena-users-ml <us...@jena.apache.org> Subject: Re: Report on loading wikidata (errata)
> ERRATA:
> 
>> I don't know why then. Maybe SSD is making all the difference. Try to load it (or "latest-all") on a comparable machine using a single SATA disk instead of SSD.
> 
> s/SATA/HDD
> 
> 
> 
> ----------------------------
> 
>> I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
>> I/O path to SSD isn't very quick).
> 
> I don't know why then. Maybe SSD is making all the difference. Try to load it (or "latest-all") on a comparable machine using a single SATA disk instead of SSD. Around 100-150M my computer slows dows significantly, and then always down from here. All I know is that it's either because of too little RAM, or because the disk can't keep up.
> 
>> If RAM really is at 1G , even on your small 8G server, suggests your
>> setup is configured in the OS to restrict the RAM for mapping. RAM per
>> process should be > real RAM (remember memory mapped files are used) or
>> the VM is setup in some odd way. Or 32bit java.
> 
> Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB and virtual ~5.5GB. Process started with 150K triples per second, now after 250M triples processed is at 50K triples/second and slowing down (processing batches of 25K). I don't know what to say, I think the conclusion is simply that tdbloader (any version) just doesn't work with large graphs on HDDs. So the only solution has to be to use an SSD, or find a way to split the graph into smaller stores, or simply give up.
> 
> $ java -version
> openjdk version "1.8.0_151"
> OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
> OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
> 
> $ ulimit -a
> -t: cpu time (seconds) unlimited
> -f: file size (blocks) unlimited
> -d: data seg size (kbytes) unlimited
> -s: stack size (kbytes) 8192
> -c: core file size (blocks) 0
> -m: resident set size (kbytes) unlimited
> -u: processes 31370
> -n: file descriptors 1024
> -l: locked-in-memory size (kbytes) unlimited
> -v: address space (kbytes) unlimited
> -x: file locks unlimited
> -i: pending signals 31370
> -q: bytes in POSIX msg queues 819200
> -e: max nice 0
> -r: max rt priority 95
> -N 15: unlimited
>

Re: Report on loading wikidata (errata)

Posted by dandh988 <da...@gmail.com>.

Your IO doesn't know whether it's coming or going! 
You're reading from a 250GB file whilst writing to two .tmp files and the id to node files. Then you are reading the data-triple.tmp to sort it which will be writing to tmp whilst chewing RAM because it's too big to sort in memory and writing the sorted file to then read it whilst writing the index files. Repeat three times.
Your HDD heads can only be in one place at a time and I suspect you've only got maximum 128MB cache on the drive. The queues on the drive will go through the roof and if the OS decides to page it'll be properly screwed!
SSD can service deep queues because it can be in more than one place at a time, as an analogy. 
Stick the 250GB file on a USB drive to get that read load off the internal IO as a start.
The loader works on HDD's you just need to be a little smart in understanding the limits of the hardware you're using and laptops are not known for IO chipsets. Even my Dell M3800 which is supposed to be a workstation grade laptop has one drive and an external SATA connection to help out.


Dick
-------- Original message --------From: Laura Morales <la...@mail.com> Date: 14/12/2017  20:09  (GMT+00:00) To: jena-users-ml <us...@jena.apache.org> Subject: Re: Report on loading wikidata (errata) 
ERRATA:

> I don't know why then. Maybe SSD is making all the difference. Try to load it (or "latest-all") on a comparable machine using a single SATA disk instead of SSD.

s/SATA/HDD



----------------------------

> I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
> I/O path to SSD isn't very quick).

I don't know why then. Maybe SSD is making all the difference. Try to load it (or "latest-all") on a comparable machine using a single SATA disk instead of SSD. Around 100-150M my computer slows dows significantly, and then always down from here. All I know is that it's either because of too little RAM, or because the disk can't keep up.

> If RAM really is at 1G , even on your small 8G server, suggests your
> setup is configured in the OS to restrict the RAM for mapping. RAM per
> process should be > real RAM (remember memory mapped files are used) or
> the VM is setup in some odd way. Or 32bit java.

Yeah sorry I was looking at shared memory. Right now resident memory is ~3.5GB and virtual ~5.5GB. Process started with 150K triples per second, now after 250M triples processed is at 50K triples/second and slowing down (processing batches of 25K). I don't know what to say, I think the conclusion is simply that tdbloader (any version) just doesn't work with large graphs on HDDs. So the only solution has to be to use an SSD, or find a way to split the graph into smaller stores, or simply give up.

$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

$ ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 31370
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) unlimited
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 31370
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 95
-N 15: unlimited