You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Timothy Lebo <le...@rpi.edu> on 2014/03/28 15:41:01 UTC

OutOfMemoryError with tdbquery

Jena,

I have a TDB with 4.2 billion triples that I created with tdbloader.
It’s taken from the 2012 Billion Triples Challenge.
I assert three triples for each URL they retrieved (“context”),
e.g. for the URL http://www.hyphen.info/rdf/30.xml:

<http://www.hyphen.info/rdf/30.xml> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> .
<http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> <http://hyphen.info> .
<http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> .


When I submit the following query with tdbquery:

select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> <http://dbpedia.org>.}

The following Exception is thrown.

I’m assuming that Jena is trying to build up all of the results before reporting them.
Is there a way to just get “the stream” to avoid the memory issue?

Thanks,
Tim Lebo

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87)
	at com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122)
	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107)
	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53)
	at com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130)
	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
	at com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119)
	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
	at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181)
	at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825)
	at org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58)
	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
	at com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
	at arq.query.queryExec(query.java:186)
	at arq.query.exec(query.java:145)

Re: OutOfMemoryError with tdbquery

Posted by Andy Seaborne <an...@apache.org>.

On 28/03/14 15:18, Timothy Lebo wrote:
> Great. Thanks, Andy!
>
> I moved from CSV to the text output because I was fighting a phantom newline that was messing up the downstream processing.
> The phantom won that round, but there’s more days ahead.

Raw new lines - a wonder feature of CSV.

Try TSV - it escapes the newline.

	Andy
>
> Regards,
> Tim
>
> On Mar 28, 2014, at 11:15 AM, Andy Seaborne <an...@apache.org> wrote:
>
>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
>>> 	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
>>
>> Looks like you are trying to output as formatted text.
>>
>> For text format aligns column widths so it needs to scan the entire result set to find column widths, then go back and actually write stuff.
>>
>> It takes a copy of the whole results to do that.
>>
>> You can use a streaming format like JSON, TSV, CSV (the last two can be thought of as unformatted text).
>>
>> 	Andy
>>
>> On 28/03/14 15:10, Timothy Lebo wrote:
>>> Thanks, David.
>>>
>>> Bumping it from 1 GB to 4 GB handled it to produce:
>>>
>>> 38 MB of gzipped dbpedia URLs,
>>> 8 MB of gzipped freebase URLs, and
>>> 7 MB of gzipped reference.data.gov.uk URLs.
>>> (the only three “big” domains)
>>>
>>> I’ll put the streaming question on hold until I run out of memory :-)
>>>
>>> Regards,
>>> Tim
>>>
>>> On Mar 28, 2014, at 10:44 AM, David Jordan <Da...@sas.com> wrote:
>>>
>>>> The first question to answer is how much memory have you allocated in the Java heap. You can control this. The default JVM heap size will very likely be too small.
>>>>
>>>> -----Original Message-----
>>>> From: Timothy Lebo [mailto:lebot@rpi.edu]
>>>> Sent: Friday, March 28, 2014 10:41 AM
>>>> To: users@jena.apache.org
>>>> Subject: OutOfMemoryError with tdbquery
>>>>
>>>> Jena,
>>>>
>>>> I have a TDB with 4.2 billion triples that I created with tdbloader.
>>>> It's taken from the 2012 Billion Triples Challenge.
>>>> I assert three triples for each URL they retrieved ("context"), e.g. for the URL http://www.hyphen.info/rdf/30.xml:
>>>>
>>>> <http://www.hyphen.info/rdf/30.xml> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> .
>>>> <http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> <http://hyphen.info> .
>>>> <http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> .
>>>>
>>>>
>>>> When I submit the following query with tdbquery:
>>>>
>>>> select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> <http://dbpedia.org>.}
>>>>
>>>> The following Exception is thrown.
>>>>
>>>> I'm assuming that Jena is trying to build up all of the results before reporting them.
>>>> Is there a way to just get "the stream" to avoid the memory issue?
>>>>
>>>> Thanks,
>>>> Tim Lebo
>>>>
>>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>>> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87)
>>>> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122)
>>>> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107)
>>>> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53)
>>>> 	at com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130)
>>>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>>> 	at com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119)
>>>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>>> 	at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181)
>>>> 	at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825)
>>>> 	at org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58)
>>>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>>> 	at com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
>>>> 	at com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95)
>>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147)
>>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130)
>>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118)
>>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
>>>> 	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
>>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
>>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
>>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
>>>> 	at arq.query.queryExec(query.java:186)
>>>> 	at arq.query.exec(query.java:145)
>>>>
>>>>
>>>
>>
>>
>

Re: OutOfMemoryError with tdbquery

Posted by Timothy Lebo <le...@rpi.edu>.

Great. Thanks, Andy!

I moved from CSV to the text output because I was fighting a phantom newline that was messing up the downstream processing.
The phantom won that round, but there’s more days ahead.

Regards,
Tim

On Mar 28, 2014, at 11:15 AM, Andy Seaborne <an...@apache.org> wrote:

>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
>> 	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
> 
> Looks like you are trying to output as formatted text.
> 
> For text format aligns column widths so it needs to scan the entire result set to find column widths, then go back and actually write stuff.
> 
> It takes a copy of the whole results to do that.
> 
> You can use a streaming format like JSON, TSV, CSV (the last two can be thought of as unformatted text).
> 
> 	Andy
> 
> On 28/03/14 15:10, Timothy Lebo wrote:
>> Thanks, David.
>> 
>> Bumping it from 1 GB to 4 GB handled it to produce:
>> 
>> 38 MB of gzipped dbpedia URLs,
>> 8 MB of gzipped freebase URLs, and
>> 7 MB of gzipped reference.data.gov.uk URLs.
>> (the only three “big” domains)
>> 
>> I’ll put the streaming question on hold until I run out of memory :-)
>> 
>> Regards,
>> Tim
>> 
>> On Mar 28, 2014, at 10:44 AM, David Jordan <Da...@sas.com> wrote:
>> 
>>> The first question to answer is how much memory have you allocated in the Java heap. You can control this. The default JVM heap size will very likely be too small.
>>> 
>>> -----Original Message-----
>>> From: Timothy Lebo [mailto:lebot@rpi.edu]
>>> Sent: Friday, March 28, 2014 10:41 AM
>>> To: users@jena.apache.org
>>> Subject: OutOfMemoryError with tdbquery
>>> 
>>> Jena,
>>> 
>>> I have a TDB with 4.2 billion triples that I created with tdbloader.
>>> It's taken from the 2012 Billion Triples Challenge.
>>> I assert three triples for each URL they retrieved ("context"), e.g. for the URL http://www.hyphen.info/rdf/30.xml:
>>> 
>>> <http://www.hyphen.info/rdf/30.xml> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> .
>>> <http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> <http://hyphen.info> .
>>> <http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> .
>>> 
>>> 
>>> When I submit the following query with tdbquery:
>>> 
>>> select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> <http://dbpedia.org>.}
>>> 
>>> The following Exception is thrown.
>>> 
>>> I'm assuming that Jena is trying to build up all of the results before reporting them.
>>> Is there a way to just get "the stream" to avoid the memory issue?
>>> 
>>> Thanks,
>>> Tim Lebo
>>> 
>>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>>> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87)
>>> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122)
>>> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107)
>>> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53)
>>> 	at com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130)
>>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>> 	at com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119)
>>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>> 	at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181)
>>> 	at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825)
>>> 	at org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58)
>>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>>> 	at com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
>>> 	at com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95)
>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147)
>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130)
>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118)
>>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
>>> 	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
>>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
>>> 	at arq.query.queryExec(query.java:186)
>>> 	at arq.query.exec(query.java:145)
>>> 
>>> 
>> 
> 
>

Re: OutOfMemoryError with tdbquery

Posted by Andy Seaborne <an...@apache.org>.

> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
> 	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)

Looks like you are trying to output as formatted text.

For text format aligns column widths so it needs to scan the entire 
result set to find column widths, then go back and actually write stuff.

It takes a copy of the whole results to do that.

You can use a streaming format like JSON, TSV, CSV (the last two can be 
thought of as unformatted text).

	Andy

On 28/03/14 15:10, Timothy Lebo wrote:
> Thanks, David.
>
> Bumping it from 1 GB to 4 GB handled it to produce:
>
> 38 MB of gzipped dbpedia URLs,
> 8 MB of gzipped freebase URLs, and
> 7 MB of gzipped reference.data.gov.uk URLs.
> (the only three “big” domains)
>
> I’ll put the streaming question on hold until I run out of memory :-)
>
> Regards,
> Tim
>
> On Mar 28, 2014, at 10:44 AM, David Jordan <Da...@sas.com> wrote:
>
>> The first question to answer is how much memory have you allocated in the Java heap. You can control this. The default JVM heap size will very likely be too small.
>>
>> -----Original Message-----
>> From: Timothy Lebo [mailto:lebot@rpi.edu]
>> Sent: Friday, March 28, 2014 10:41 AM
>> To: users@jena.apache.org
>> Subject: OutOfMemoryError with tdbquery
>>
>> Jena,
>>
>> I have a TDB with 4.2 billion triples that I created with tdbloader.
>> It's taken from the 2012 Billion Triples Challenge.
>> I assert three triples for each URL they retrieved ("context"), e.g. for the URL http://www.hyphen.info/rdf/30.xml:
>>
>> <http://www.hyphen.info/rdf/30.xml> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> .
>> <http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> <http://hyphen.info> .
>> <http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> .
>>
>>
>> When I submit the following query with tdbquery:
>>
>> select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> <http://dbpedia.org>.}
>>
>> The following Exception is thrown.
>>
>> I'm assuming that Jena is trying to build up all of the results before reporting them.
>> Is there a way to just get "the stream" to avoid the memory issue?
>>
>> Thanks,
>> Tim Lebo
>>
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87)
>> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122)
>> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107)
>> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53)
>> 	at com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130)
>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>> 	at com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119)
>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>> 	at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181)
>> 	at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825)
>> 	at org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58)
>> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
>> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
>> 	at com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
>> 	at com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95)
>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147)
>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130)
>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118)
>> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
>> 	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
>> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
>> 	at arq.query.queryExec(query.java:186)
>> 	at arq.query.exec(query.java:145)
>>
>>
>

Re: OutOfMemoryError with tdbquery

Posted by Timothy Lebo <le...@rpi.edu>.

Thanks, David.

Bumping it from 1 GB to 4 GB handled it to produce:

38 MB of gzipped dbpedia URLs, 
8 MB of gzipped freebase URLs, and 
7 MB of gzipped reference.data.gov.uk URLs.
(the only three “big” domains)

I’ll put the streaming question on hold until I run out of memory :-)

Regards,
Tim

On Mar 28, 2014, at 10:44 AM, David Jordan <Da...@sas.com> wrote:

> The first question to answer is how much memory have you allocated in the Java heap. You can control this. The default JVM heap size will very likely be too small.
> 
> -----Original Message-----
> From: Timothy Lebo [mailto:lebot@rpi.edu] 
> Sent: Friday, March 28, 2014 10:41 AM
> To: users@jena.apache.org
> Subject: OutOfMemoryError with tdbquery
> 
> Jena,
> 
> I have a TDB with 4.2 billion triples that I created with tdbloader.
> It's taken from the 2012 Billion Triples Challenge.
> I assert three triples for each URL they retrieved ("context"), e.g. for the URL http://www.hyphen.info/rdf/30.xml:
> 
> <http://www.hyphen.info/rdf/30.xml> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> .
> <http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> <http://hyphen.info> .
> <http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> .
> 
> 
> When I submit the following query with tdbquery:
> 
> select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> <http://dbpedia.org>.}
> 
> The following Exception is thrown.
> 
> I'm assuming that Jena is trying to build up all of the results before reporting them.
> Is there a way to just get "the stream" to avoid the memory issue?
> 
> Thanks,
> Tim Lebo
> 
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87)
> 	at com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122)
> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107)
> 	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53)
> 	at com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130)
> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
> 	at com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119)
> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
> 	at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181)
> 	at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825)
> 	at org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58)
> 	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
> 	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
> 	at com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
> 	at com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95)
> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147)
> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130)
> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118)
> 	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
> 	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
> 	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
> 	at arq.query.queryExec(query.java:186)
> 	at arq.query.exec(query.java:145)
> 
>

RE: OutOfMemoryError with tdbquery

Posted by David Jordan <Da...@sas.com>.

The first question to answer is how much memory have you allocated in the Java heap. You can control this. The default JVM heap size will very likely be too small.

-----Original Message-----
From: Timothy Lebo [mailto:lebot@rpi.edu] 
Sent: Friday, March 28, 2014 10:41 AM
To: users@jena.apache.org
Subject: OutOfMemoryError with tdbquery

Jena,

I have a TDB with 4.2 billion triples that I created with tdbloader.
It's taken from the 2012 Billion Triples Challenge.
I assert three triples for each URL they retrieved ("context"), e.g. for the URL http://www.hyphen.info/rdf/30.xml:

<http://www.hyphen.info/rdf/30.xml> <http://purl.org/twc/vocab/between-the-edges/root> <http://www.hyphen.info> .
<http://www.hyphen.info> <http://purl.org/twc/vocab/between-the-edges/pld> <http://hyphen.info> .
<http://hyphen.info> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/twc/vocab/between-the-edges/PayLevelDomain> .


When I submit the following query with tdbquery:

select ?url where{?url <http://purl.org/twc/vocab/between-the-edges/root> <http://dbpedia.org>.}

The following Exception is thrown.

I'm assuming that Jena is trying to build up all of the results before reporting them.
Is there a way to just get "the stream" to avoid the memory issue?

Thanks,
Tim Lebo

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
	at com.hp.hpl.jena.tdb.base.record.RecordFactory.create(RecordFactory.java:87)
	at com.hp.hpl.jena.tdb.base.record.RecordFactory.buildFrom(RecordFactory.java:122)
	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer._get(RecordBuffer.java:107)
	at com.hp.hpl.jena.tdb.base.buffer.RecordBuffer.get(RecordBuffer.java:53)
	at com.hp.hpl.jena.tdb.base.recordbuffer.RecordRangeIterator.hasNext(RecordRangeIterator.java:130)
	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
	at com.hp.hpl.jena.tdb.sys.DatasetControlMRSW$IteratorCheckNotConcurrent.hasNext(DatasetControlMRSW.java:119)
	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
	at org.openjena.atlas.iterator.Iter$3.hasNext(Iter.java:181)
	at org.openjena.atlas.iterator.Iter.hasNext(Iter.java:825)
	at org.openjena.atlas.iterator.RepeatApplyIterator.hasNext(RepeatApplyIterator.java:58)
	at org.openjena.atlas.iterator.Iter$4.hasNext(Iter.java:295)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterPlainWrapper.hasNextBinding(QueryIterPlainWrapper.java:54)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIterConvert.hasNextBinding(QueryIterConvert.java:59)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorWrapper.hasNextBinding(QueryIteratorWrapper.java:40)
	at com.hp.hpl.jena.sparql.engine.iterator.QueryIteratorBase.hasNext(QueryIteratorBase.java:108)
	at com.hp.hpl.jena.sparql.engine.ResultSetStream.hasNext(ResultSetStream.java:72)
	at com.hp.hpl.jena.sparql.resultset.ResultSetMem.<init>(ResultSetMem.java:95)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:147)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:130)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.write(TextOutput.java:118)
	at com.hp.hpl.jena.sparql.resultset.TextOutput.format(TextOutput.java:65)
	at com.hp.hpl.jena.query.ResultSetFormatter.out(ResultSetFormatter.java:135)
	at com.hp.hpl.jena.sparql.util.QueryExecUtils.outputResultSet(QueryExecUtils.java:157)
	at com.hp.hpl.jena.sparql.util.QueryExecUtils.doSelectQuery(QueryExecUtils.java:199)
	at com.hp.hpl.jena.sparql.util.QueryExecUtils.executeQuery(QueryExecUtils.java:75)
	at arq.query.queryExec(query.java:186)
	at arq.query.exec(query.java:145)