You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Paul Gearon <ge...@ieee.org> on 2012/11/02 22:24:06 UTC

large load errors

This is probably pushing Jena beyond it's design limits, but I thought I'd
report on it anyway.

I needed to test some things with large data sets, so I tried to load the
data from http://basekb.com/

Once extracted from the tar.gz file, it creates a directory called baseKB
filled with 1024 gzipped nt files.

On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started it
with TDB storage. I didn't want to individually load 1024 files from the
control panel, so I used zcat to dump everything into one file and tried
loading from the GUI. This failed in short order with RIOT complaining of
memory:

13:24:31 WARN  Fuseki               :: [1] RC = 500 : Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:234)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
...etc...

I'm wondering if RIOT really needed to run out of memory?

Anyway, I went back to the individual files. That meant using a non-gui
approach. I wasn't sure about using a media type for nt, but that's
compatible with Turtle, so I used test/turtle.

I threw away the DB directory and started again. This time I tried to load
the files with the following bash:

for i in *.nt.gz; do
  echo "Loading $i"
  zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file - "
http://localhost:3030/dataset/data?default"
done

This started reasonably well. A number of warnings showed up on the server
side, due to bad language tags and invalid IRIs, but it kept going.
However, on the 20th file I started seeing these:
Loading triples0000.nt.gz
Loading triples0001.nt.gz
Loading triples0002.nt.gz
Loading triples0003.nt.gz
Loading triples0004.nt.gz
Loading triples0005.nt.gz
Loading triples0006.nt.gz
Loading triples0007.nt.gz
Loading triples0008.nt.gz
Loading triples0009.nt.gz
Loading triples0010.nt.gz
Loading triples0011.nt.gz
Loading triples0012.nt.gz
Loading triples0013.nt.gz
Loading triples0014.nt.gz
Loading triples0015.nt.gz
Loading triples0016.nt.gz
Loading triples0017.nt.gz
Loading triples0018.nt.gz
Loading triples0019.nt.gz
Error 500: GC overhead limit exceeded


Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
Loading triples0020.nt.gz
Error 500: GC overhead limit exceeded


Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
Loading triples0021.nt.gz
Error 500: GC overhead limit exceeded


Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)

This kept going until triples0042.nt.gz where it hung for hours.

Meanwhile, on the server, I was still seeing parser warnings, but also
messages like:
17:01:26 WARN  SPARQL_REST$HttpActionREST :: Transaction still active in
endWriter - no commit or abort seen (forced abort)
17:01:26 WARN  Fuseki               :: [33] RC = 500 : GC overhead limit
exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded

When I finally killed it (with ctrl-C), I got several stack traces in the
stdout log. They appeared to indicate a bad state, so I've saved them and
put them up at:  http://pastebin.com/yar5Pq85

While OOM is very hard to deal with, I'm still surprised to see it hit this
way, so I thought you might be interested to see it.

Regards,
Paul Gearon

Re: large load errors

Posted by Andy Seaborne <an...@apache.org>.

Paul,

The default heap size is quite small (1.2G) because it has to work for 
32-bit systems as well.  Set JVM_ARGS if using the fuseki-server script 
or run the server directly with java --jar.

(Hmm - the "fuseki" script does

JAVA_OPTIONS+=("-Dlog4j.configuration=log4j.properties" "-Xmx1200M")

so it is not checking if -Xmx is already set.
)

but the bulk loader, with it's file manipulation (and very 
non-transactional!) tricks is significantly faster.

	Andy

On 02/11/12 23:45, Rob Vesse wrote:
> In the meantime you might want to try using tdbloader/tdbloader2
> (http://jena.apache.org/documentation/tdb/commands.html#tdbloader2) to
> create the TDB dataset offline instead
>
>
> You can then start up a Fuseki server and connect to the TDB dataset you
> created
>
> Rob
>
> On 11/2/12 3:41 PM, "Stephen Allen" <sa...@apache.org> wrote:
>
>> Hi Paul,
>>
>> Thanks for the report.  This is a known issue in Fuseki (see JENA-309
>> [1]).  I have plans to work on this soon.  Also I'm a little surprised
>> that your second attempt after breaking it into chunks failed, I'll
>> take a look at that.
>>
>> I am also working on a related issue (JENA-330 [2]) that will
>> eliminate limits on SPARQL Update queries.  I hope to have that
>> checked into the trunk soon.
>>
>> -Stephen
>>
>> [1] https://issues.apache.org/jira/browse/JENA-309
>> [2] https://issues.apache.org/jira/browse/JENA-330
>>
>>
>>
>> On Fri, Nov 2, 2012 at 5:24 PM, Paul Gearon <ge...@ieee.org> wrote:
>>> This is probably pushing Jena beyond it's design limits, but I thought
>>> I'd
>>> report on it anyway.
>>>
>>> I needed to test some things with large data sets, so I tried to load
>>> the
>>> data from http://basekb.com/
>>>
>>> Once extracted from the tar.gz file, it creates a directory called
>>> baseKB
>>> filled with 1024 gzipped nt files.
>>>
>>> On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started
>>> it
>>> with TDB storage. I didn't want to individually load 1024 files from the
>>> control panel, so I used zcat to dump everything into one file and tried
>>> loading from the GUI. This failed in short order with RIOT complaining
>>> of
>>> memory:
>>>
>>> 13:24:31 WARN  Fuseki               :: [1] RC = 500 : Java heap space
>>> java.lang.OutOfMemoryError: Java heap space
>>> at java.util.Arrays.copyOfRange(Arrays.java:2694)
>>> at java.lang.String.<init>(String.java:234)
>>> at java.lang.StringBuilder.toString(StringBuilder.java:405)
>>> at
>>> org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
>>> ...etc...
>>>
>>> I'm wondering if RIOT really needed to run out of memory?
>>>
>>> Anyway, I went back to the individual files. That meant using a non-gui
>>> approach. I wasn't sure about using a media type for nt, but that's
>>> compatible with Turtle, so I used test/turtle.
>>>
>>> I threw away the DB directory and started again. This time I tried to
>>> load
>>> the files with the following bash:
>>>
>>> for i in *.nt.gz; do
>>>    echo "Loading $i"
>>>    zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file -
>>> "
>>> http://localhost:3030/dataset/data?default"
>>> done
>>>
>>> This started reasonably well. A number of warnings showed up on the
>>> server
>>> side, due to bad language tags and invalid IRIs, but it kept going.
>>> However, on the 20th file I started seeing these:
>>> Loading triples0000.nt.gz
>>> Loading triples0001.nt.gz
>>> Loading triples0002.nt.gz
>>> Loading triples0003.nt.gz
>>> Loading triples0004.nt.gz
>>> Loading triples0005.nt.gz
>>> Loading triples0006.nt.gz
>>> Loading triples0007.nt.gz
>>> Loading triples0008.nt.gz
>>> Loading triples0009.nt.gz
>>> Loading triples0010.nt.gz
>>> Loading triples0011.nt.gz
>>> Loading triples0012.nt.gz
>>> Loading triples0013.nt.gz
>>> Loading triples0014.nt.gz
>>> Loading triples0015.nt.gz
>>> Loading triples0016.nt.gz
>>> Loading triples0017.nt.gz
>>> Loading triples0018.nt.gz
>>> Loading triples0019.nt.gz
>>> Error 500: GC overhead limit exceeded
>>>
>>>
>>> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>>> Loading triples0020.nt.gz
>>> Error 500: GC overhead limit exceeded
>>>
>>>
>>> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>>> Loading triples0021.nt.gz
>>> Error 500: GC overhead limit exceeded
>>>
>>>
>>> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>>>
>>> This kept going until triples0042.nt.gz where it hung for hours.
>>>
>>> Meanwhile, on the server, I was still seeing parser warnings, but also
>>> messages like:
>>> 17:01:26 WARN  SPARQL_REST$HttpActionREST :: Transaction still active in
>>> endWriter - no commit or abort seen (forced abort)
>>> 17:01:26 WARN  Fuseki               :: [33] RC = 500 : GC overhead limit
>>> exceeded
>>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> When I finally killed it (with ctrl-C), I got several stack traces in
>>> the
>>> stdout log. They appeared to indicate a bad state, so I've saved them
>>> and
>>> put them up at:  http://pastebin.com/yar5Pq85
>>>
>>> While OOM is very hard to deal with, I'm still surprised to see it hit
>>> this
>>> way, so I thought you might be interested to see it.
>>>
>>> Regards,
>>> Paul Gearon
>

Re: large load errors

Posted by Rob Vesse <rv...@yarcdata.com>.

In the meantime you might want to try using tdbloader/tdbloader2
(http://jena.apache.org/documentation/tdb/commands.html#tdbloader2) to
create the TDB dataset offline instead


You can then start up a Fuseki server and connect to the TDB dataset you
created

Rob

On 11/2/12 3:41 PM, "Stephen Allen" <sa...@apache.org> wrote:

>Hi Paul,
>
>Thanks for the report.  This is a known issue in Fuseki (see JENA-309
>[1]).  I have plans to work on this soon.  Also I'm a little surprised
>that your second attempt after breaking it into chunks failed, I'll
>take a look at that.
>
>I am also working on a related issue (JENA-330 [2]) that will
>eliminate limits on SPARQL Update queries.  I hope to have that
>checked into the trunk soon.
>
>-Stephen
>
>[1] https://issues.apache.org/jira/browse/JENA-309
>[2] https://issues.apache.org/jira/browse/JENA-330
>
>
>
>On Fri, Nov 2, 2012 at 5:24 PM, Paul Gearon <ge...@ieee.org> wrote:
>> This is probably pushing Jena beyond it's design limits, but I thought
>>I'd
>> report on it anyway.
>>
>> I needed to test some things with large data sets, so I tried to load
>>the
>> data from http://basekb.com/
>>
>> Once extracted from the tar.gz file, it creates a directory called
>>baseKB
>> filled with 1024 gzipped nt files.
>>
>> On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started
>>it
>> with TDB storage. I didn't want to individually load 1024 files from the
>> control panel, so I used zcat to dump everything into one file and tried
>> loading from the GUI. This failed in short order with RIOT complaining
>>of
>> memory:
>>
>> 13:24:31 WARN  Fuseki               :: [1] RC = 500 : Java heap space
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOfRange(Arrays.java:2694)
>> at java.lang.String.<init>(String.java:234)
>> at java.lang.StringBuilder.toString(StringBuilder.java:405)
>> at 
>>org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
>> ...etc...
>>
>> I'm wondering if RIOT really needed to run out of memory?
>>
>> Anyway, I went back to the individual files. That meant using a non-gui
>> approach. I wasn't sure about using a media type for nt, but that's
>> compatible with Turtle, so I used test/turtle.
>>
>> I threw away the DB directory and started again. This time I tried to
>>load
>> the files with the following bash:
>>
>> for i in *.nt.gz; do
>>   echo "Loading $i"
>>   zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file -
>>"
>> http://localhost:3030/dataset/data?default"
>> done
>>
>> This started reasonably well. A number of warnings showed up on the
>>server
>> side, due to bad language tags and invalid IRIs, but it kept going.
>> However, on the 20th file I started seeing these:
>> Loading triples0000.nt.gz
>> Loading triples0001.nt.gz
>> Loading triples0002.nt.gz
>> Loading triples0003.nt.gz
>> Loading triples0004.nt.gz
>> Loading triples0005.nt.gz
>> Loading triples0006.nt.gz
>> Loading triples0007.nt.gz
>> Loading triples0008.nt.gz
>> Loading triples0009.nt.gz
>> Loading triples0010.nt.gz
>> Loading triples0011.nt.gz
>> Loading triples0012.nt.gz
>> Loading triples0013.nt.gz
>> Loading triples0014.nt.gz
>> Loading triples0015.nt.gz
>> Loading triples0016.nt.gz
>> Loading triples0017.nt.gz
>> Loading triples0018.nt.gz
>> Loading triples0019.nt.gz
>> Error 500: GC overhead limit exceeded
>>
>>
>> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>> Loading triples0020.nt.gz
>> Error 500: GC overhead limit exceeded
>>
>>
>> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>> Loading triples0021.nt.gz
>> Error 500: GC overhead limit exceeded
>>
>>
>> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>>
>> This kept going until triples0042.nt.gz where it hung for hours.
>>
>> Meanwhile, on the server, I was still seeing parser warnings, but also
>> messages like:
>> 17:01:26 WARN  SPARQL_REST$HttpActionREST :: Transaction still active in
>> endWriter - no commit or abort seen (forced abort)
>> 17:01:26 WARN  Fuseki               :: [33] RC = 500 : GC overhead limit
>> exceeded
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>> When I finally killed it (with ctrl-C), I got several stack traces in
>>the
>> stdout log. They appeared to indicate a bad state, so I've saved them
>>and
>> put them up at:  http://pastebin.com/yar5Pq85
>>
>> While OOM is very hard to deal with, I'm still surprised to see it hit
>>this
>> way, so I thought you might be interested to see it.
>>
>> Regards,
>> Paul Gearon

Re: large load errors

Posted by Stephen Allen <sa...@apache.org>.

Hi Paul,

Thanks for the report.  This is a known issue in Fuseki (see JENA-309
[1]).  I have plans to work on this soon.  Also I'm a little surprised
that your second attempt after breaking it into chunks failed, I'll
take a look at that.

I am also working on a related issue (JENA-330 [2]) that will
eliminate limits on SPARQL Update queries.  I hope to have that
checked into the trunk soon.

-Stephen

[1] https://issues.apache.org/jira/browse/JENA-309
[2] https://issues.apache.org/jira/browse/JENA-330



On Fri, Nov 2, 2012 at 5:24 PM, Paul Gearon <ge...@ieee.org> wrote:
> This is probably pushing Jena beyond it's design limits, but I thought I'd
> report on it anyway.
>
> I needed to test some things with large data sets, so I tried to load the
> data from http://basekb.com/
>
> Once extracted from the tar.gz file, it creates a directory called baseKB
> filled with 1024 gzipped nt files.
>
> On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started it
> with TDB storage. I didn't want to individually load 1024 files from the
> control panel, so I used zcat to dump everything into one file and tried
> loading from the GUI. This failed in short order with RIOT complaining of
> memory:
>
> 13:24:31 WARN  Fuseki               :: [1] RC = 500 : Java heap space
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Arrays.java:2694)
> at java.lang.String.<init>(String.java:234)
> at java.lang.StringBuilder.toString(StringBuilder.java:405)
> at org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
> ...etc...
>
> I'm wondering if RIOT really needed to run out of memory?
>
> Anyway, I went back to the individual files. That meant using a non-gui
> approach. I wasn't sure about using a media type for nt, but that's
> compatible with Turtle, so I used test/turtle.
>
> I threw away the DB directory and started again. This time I tried to load
> the files with the following bash:
>
> for i in *.nt.gz; do
>   echo "Loading $i"
>   zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file - "
> http://localhost:3030/dataset/data?default"
> done
>
> This started reasonably well. A number of warnings showed up on the server
> side, due to bad language tags and invalid IRIs, but it kept going.
> However, on the 20th file I started seeing these:
> Loading triples0000.nt.gz
> Loading triples0001.nt.gz
> Loading triples0002.nt.gz
> Loading triples0003.nt.gz
> Loading triples0004.nt.gz
> Loading triples0005.nt.gz
> Loading triples0006.nt.gz
> Loading triples0007.nt.gz
> Loading triples0008.nt.gz
> Loading triples0009.nt.gz
> Loading triples0010.nt.gz
> Loading triples0011.nt.gz
> Loading triples0012.nt.gz
> Loading triples0013.nt.gz
> Loading triples0014.nt.gz
> Loading triples0015.nt.gz
> Loading triples0016.nt.gz
> Loading triples0017.nt.gz
> Loading triples0018.nt.gz
> Loading triples0019.nt.gz
> Error 500: GC overhead limit exceeded
>
>
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
> Loading triples0020.nt.gz
> Error 500: GC overhead limit exceeded
>
>
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
> Loading triples0021.nt.gz
> Error 500: GC overhead limit exceeded
>
>
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
>
> This kept going until triples0042.nt.gz where it hung for hours.
>
> Meanwhile, on the server, I was still seeing parser warnings, but also
> messages like:
> 17:01:26 WARN  SPARQL_REST$HttpActionREST :: Transaction still active in
> endWriter - no commit or abort seen (forced abort)
> 17:01:26 WARN  Fuseki               :: [33] RC = 500 : GC overhead limit
> exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> When I finally killed it (with ctrl-C), I got several stack traces in the
> stdout log. They appeared to indicate a bad state, so I've saved them and
> put them up at:  http://pastebin.com/yar5Pq85
>
> While OOM is very hard to deal with, I'm still surprised to see it hit this
> way, so I thought you might be interested to see it.
>
> Regards,
> Paul Gearon

RE: large load errors

Posted by John Fereira <ja...@cornell.edu>.

That doesn't sound so much as a jena issue but simply not allocating enough memory for the JVM to run a java application.   You can increase the amount of memory available to the application simply by adding the -XmxNNNNM option when starting up fuseki (where NNNN is the amount of memory to use in Megabytes).  I start it up on my laptop (which doesn't have a huge amount of memory) using a bash script that looks something like this:

#!/bin/sh
port=3030
java -cp ./fuseki-server.jar:lib:lib/sdb-1.3.4.jar:lib/mysql-connector-java-5.1.16-bi
n.jar:lib/arq-2.8.8.jar  -Xmx1024M org.apache.jena.fuseki.FusekiCmd --desc fuseki.ttl --port=$port /ds > fuseki.log 2>&
1 &

Note that I'm telling the JVM to use 1024M of memory.  In order to load very large datasets you may need a machine with a lot of memory and then you can increase the memory allocation  as necessary.  

> -----Original Message-----
> From: gearon@gmail.com [mailto:gearon@gmail.com] On Behalf Of Paul
> Gearon
> Sent: Friday, November 02, 2012 5:24 PM
> To: jena-users@incubator.apache.org
> Subject: large load errors
> 
> This is probably pushing Jena beyond it's design limits, but I thought
> I'd report on it anyway.
> 
> I needed to test some things with large data sets, so I tried to load
> the data from http://basekb.com/
> 
> Once extracted from the tar.gz file, it creates a directory called
> baseKB filled with 1024 gzipped nt files.
> 
> On my first attempt, I grabbed a fresh copy of Fuseki 0.2.5 and started
> it with TDB storage. I didn't want to individually load 1024 files from
> the control panel, so I used zcat to dump everything into one file and
> tried loading from the GUI. This failed in short order with RIOT
> complaining of
> memory:
> 
> 13:24:31 WARN  Fuseki               :: [1] RC = 500 : Java heap space
> java.lang.OutOfMemoryError: Java heap space at
> java.util.Arrays.copyOfRange(Arrays.java:2694)
> at java.lang.String.<init>(String.java:234)
> at java.lang.StringBuilder.toString(StringBuilder.java:405)
> at
> org.openjena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:476)
> ...etc...
> 
> I'm wondering if RIOT really needed to run out of memory?
> 
> Anyway, I went back to the individual files. That meant using a non-gui
> approach. I wasn't sure about using a media type for nt, but that's
> compatible with Turtle, so I used test/turtle.
> 
> I threw away the DB directory and started again. This time I tried to
> load the files with the following bash:
> 
> for i in *.nt.gz; do
>   echo "Loading $i"
>   zcat $i | curl -x POST -H "Content-Type: text/turtle" --upload-file -
> "
> http://localhost:3030/dataset/data?default"
> done
> 
> This started reasonably well. A number of warnings showed up on the
> server side, due to bad language tags and invalid IRIs, but it kept
> going.
> However, on the 20th file I started seeing these:
> Loading triples0000.nt.gz
> Loading triples0001.nt.gz
> Loading triples0002.nt.gz
> Loading triples0003.nt.gz
> Loading triples0004.nt.gz
> Loading triples0005.nt.gz
> Loading triples0006.nt.gz
> Loading triples0007.nt.gz
> Loading triples0008.nt.gz
> Loading triples0009.nt.gz
> Loading triples0010.nt.gz
> Loading triples0011.nt.gz
> Loading triples0012.nt.gz
> Loading triples0013.nt.gz
> Loading triples0014.nt.gz
> Loading triples0015.nt.gz
> Loading triples0016.nt.gz
> Loading triples0017.nt.gz
> Loading triples0018.nt.gz
> Loading triples0019.nt.gz
> Error 500: GC overhead limit exceeded
> 
> 
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) Loading
> triples0020.nt.gz Error 500: GC overhead limit exceeded
> 
> 
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100) Loading
> triples0021.nt.gz Error 500: GC overhead limit exceeded
> 
> 
> Fuseki - version 0.2.5 (Build date: 2012-10-20T17:03:29+0100)
> 
> This kept going until triples0042.nt.gz where it hung for hours.
> 
> Meanwhile, on the server, I was still seeing parser warnings, but also
> messages like:
> 17:01:26 WARN  SPARQL_REST$HttpActionREST :: Transaction still active
> in endWriter - no commit or abort seen (forced abort)
> 17:01:26 WARN  Fuseki               :: [33] RC = 500 : GC overhead
> limit
> exceeded
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 
> When I finally killed it (with ctrl-C), I got several stack traces in
> the stdout log. They appeared to indicate a bad state, so I've saved
> them and put them up at:  http://pastebin.com/yar5Pq85
> 
> While OOM is very hard to deal with, I'm still surprised to see it hit
> this way, so I thought you might be interested to see it.
> 
> Regards,
> Paul Gearon