You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Rick Moynihan <ri...@swirrl.com> on 2013/12/12 12:03:16 UTC

Fuseki v1.0.0 data import problems

Hi all,

I have a script which dumps 2 modestly sized n-triples files into fuseki
via curl and a HTTP PUT.

e.g. the script does the following 2 actions:

curl -X PUT --data-binary @data/file-1.nt -H 'Content-Type: text/plain' '
http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph1'

curl -X PUT --data-binary @data/file-2.nt -H 'Content-Type: text/plain' '
http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph2'

File 1 is 162mb
File 2 is 223mb

Sometimes this imports fine, other times the import takes minutes, Fuseki
consumes 380% CPU and I have to kill it after a few minutes.

At least once the import finished after a few minutes, but Fuseki continued
to consume 380% CPU for about 20 minutes afterwards (despite their being no
load on it at all).  After a short while longer it crashed with an
OutOfMemoryException.

I'm using TDB for storage.

I'm a little concerned with the non-deterministic nature of the issue, but
it seems to occur frequently... infact it seems to have problems more often
than not.

Any help or suggestions much appreciated.

R.

Re: Fuseki v1.0.0 data import problems

Posted by Andy Seaborne <an...@apache.org>.

On 14/12/13 22:18, Rick Moynihan wrote:
> On Thu, Dec 12, 2013 at 2:22 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> Hi Rick,
>>
>> On 12/12/13 11:03, Rick Moynihan wrote:
>>
>>> Hi all,
>>>
>>> I have a script which dumps 2 modestly sized n-triples files into fuseki
>>> via curl and a HTTP PUT.
>>>
>>> e.g. the script does the following 2 actions:
>>>
>>> curl -X PUT --data-binary @data/file-1.nt -H 'Content-Type: text/plain' '
>>> http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph1
>>
>>
>>   curl -X PUT --data-binary @data/file-2.nt -H 'Content-Type: text/plain' '
>>> http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph2
>>> '
>>>
>>
>> And it does them one after the other, never in parallel?
>
>
> Yes they're sequential, never parallel.  Is parallel update an issue?

No, shouldn't be - there can be only one actively running write 
transactions mediated by an internal lock.

>
>
>>
>>
>>> File 1 is 162mb
>>> File 2 is 223mb
>>>
>>
>> so about 1.6 and 2.2 million triples?
>
>
> 740,000 and 1.6 million.
>
>>
>>
>>   Sometimes this imports fine, other times the import takes minutes, Fuseki
>>> consumes 380% CPU and I have to kill it after a few minutes.
>>>
>>
>> When its fine, how long does it take?
>>
>>
> Approximately 2m 40s for both datasets.
>
>
>> It might be GC pressure and its GC's very hard but not making signifcant
>> progress - tis can show as very high CPU, nothing happening and then
>> OOME. How much heap have you given the java process?
>>
>> The other thing to look at memory mapped files. TDB uses mmapped files
>> which are not part of the Java heap.  Don't give the Fuseki all of RAM
>> for the heap - leave as much for the OS to use for file system cache as
>> possible (but Fuseki still needs a decent heap to manage transactions).
>>
>
> Thanks for the advice.  Raising the heap from 1.2gb to 4gb seems to have
> made the problem disappear.

OK then - looks like it was close to out-of-memory and a GC was getting 
scheduled very frequently.

>
>>
>> I assume it's a 64 bit machine but which OS? (Even amongst Linuxes
>> handling of mmap varies for reason I don't understand.)
>>
>
> It's a Mac.
>
>
>
> R.
>

Re: Fuseki v1.0.0 data import problems

Posted by Rick Moynihan <ri...@swirrl.com>.

On Thu, Dec 12, 2013 at 2:22 PM, Andy Seaborne <an...@apache.org> wrote:

> Hi Rick,
>
> On 12/12/13 11:03, Rick Moynihan wrote:
>
>> Hi all,
>>
>> I have a script which dumps 2 modestly sized n-triples files into fuseki
>> via curl and a HTTP PUT.
>>
>> e.g. the script does the following 2 actions:
>>
>> curl -X PUT --data-binary @data/file-1.nt -H 'Content-Type: text/plain' '
>> http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph1
>> '
>>
>>
> Unrelated, aesthetically better:
> Content-Type: application/n-triples
>
> (Fuseki/RIOT ignores text/plain and uses the file extension - text/plain
> is wrong so much it's unrelaible).



Good point.


>
>
>  curl -X PUT --data-binary @data/file-2.nt -H 'Content-Type: text/plain' '
>> http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph2
>> '
>>
>
> And it does them one after the other, never in parallel?


Yes they're sequential, never parallel.  Is parallel update an issue?


>
>
>> File 1 is 162mb
>> File 2 is 223mb
>>
>
> so about 1.6 and 2.2 million triples?


740,000 and 1.6 million.

>
>
>  Sometimes this imports fine, other times the import takes minutes, Fuseki
>> consumes 380% CPU and I have to kill it after a few minutes.
>>
>
> When its fine, how long does it take?
>
>
Approximately 2m 40s for both datasets.


> It might be GC pressure and its GC's very hard but not making signifcant
> progress - tis can show as very high CPU, nothing happening and then
> OOME. How much heap have you given the java process?
>
> The other thing to look at memory mapped files. TDB uses mmapped files
> which are not part of the Java heap.  Don't give the Fuseki all of RAM
> for the heap - leave as much for the OS to use for file system cache as
> possible (but Fuseki still needs a decent heap to manage transactions).
>

Thanks for the advice.  Raising the heap from 1.2gb to 4gb seems to have
made the problem disappear.

>
> I assume it's a 64 bit machine but which OS? (Even amongst Linuxes
> handling of mmap varies for reason I don't understand.)
>

It's a Mac.



R.

Re: Fuseki v1.0.0 data import problems

Posted by Andy Seaborne <an...@apache.org>.

Hi Rick,

On 12/12/13 11:03, Rick Moynihan wrote:
> Hi all,
>
> I have a script which dumps 2 modestly sized n-triples files into fuseki
> via curl and a HTTP PUT.
>
> e.g. the script does the following 2 actions:
>
> curl -X PUT --data-binary @data/file-1.nt -H 'Content-Type: text/plain' '
> http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph1'
>

Unrelated, aesthetically better:
Content-Type: application/n-triples

(Fuseki/RIOT ignores text/plain and uses the file extension - text/plain 
is wrong so much it's unrelaible).

> curl -X PUT --data-binary @data/file-2.nt -H 'Content-Type: text/plain' '
> http://localhost:3030/linkeddev-test/data?graph=http://foo-bar.org/graph2'

And it does them one after the other, never in parallel?

>
> File 1 is 162mb
> File 2 is 223mb

so about 1.6 and 2.2 million triples?

> Sometimes this imports fine, other times the import takes minutes, Fuseki
> consumes 380% CPU and I have to kill it after a few minutes.

When its fine, how long does it take?

> At least once the import finished after a few minutes, but Fuseki continued
> to consume 380% CPU for about 20 minutes afterwards (despite their being no
> load on it at all).  After a short while longer it crashed with an
> OutOfMemoryException.
 >
> I'm using TDB for storage.
>
> I'm a little concerned with the non-deterministic nature of the issue, but
> it seems to occur frequently... infact it seems to have problems more often
> than not.
>
> Any help or suggestions much appreciated.
>
> R.

It might be GC pressure and its GC's very hard but not making signifcant
progress - tis can show as very high CPU, nothing happening and then
OOME. How much heap have you given the java process?

The other thing to look at memory mapped files. TDB uses mmapped files
which are not part of the Java heap.  Don't give the Fuseki all of RAM
for the heap - leave as much for the OS to use for file system cache as 
possible (but Fuseki still needs a decent heap to manage transactions).

I assume it's a 64 bit machine but which OS? (Even amongst Linuxes
handling of mmap varies for reason I don't understand.)

(Linux specific question:)

What does top show for the process in terms of real and virtual memory?

	Andy