You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Michael Brunnbauer <br...@netestate.de> on 2012/07/24 13:13:59 UTC
Re: tdbdump Exception
Hello Andy,
On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
> >I guess it would be a good idea to look at the end of the dump and check
> >the
> >corresponding named graph for bad datetimes ?
>
> Yes - my best guess at the moment is that a dateTime can get in (they
> are encoded into 56 bits, not recorded using the lexical form) but there
> was a problem on the recreation of the lexical form. Whether the
> encoding or decoding is wrong, I can't tell.
I was not able to find the named graph causing the problem so I recreated the
TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
apache-jena-2.7.2 immediately after that. The result is that I seem to run
into the same problem:
Exception in thread "main" org.openjena.atlas.AtlasException: formatInt: overflow
at org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
at com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
at com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
at org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
at tdb.tdbdump.exec(tdbdump.java:49)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at tdb.tdbdump.main(tdbdump.java:31)
This seems to be a serious issue.
BTW: Here is some output from tdbloader2 for this TDB which shows that
the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
run into memory problems.
12:39:17 -- TDB Bulk Loader Start
12:39:17 Data phase
...
INFO Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
...
INFO Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
...
INFO Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
...
INFO Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
...
INFO Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
...
INFO Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52 tuples/sec [2012/07/22 03:53:36 CEST]
...
20:24:24 -- TDB Bulk Loader Finish
20:24:24 -- 373477 seconds
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: tdbloader2 performance for 1B+ triples
Posted by Andy Seaborne <an...@apache.org>.
On 10/08/12 18:40, Michael Brunnbauer wrote:
> It's an nquads file with 4,9 mio named graphs
OK - that's a difference - may be related. I'll take a look although I
need to create some test data first.
If it is this, then a smaller heap and more manageable size of data is
something I can do.
- data we crawled from the web
> for foaf-search.net with the file url as graph. Many foaf profiles with blank
> nodes but also other data that uses foaf:name. There are also other nquad
> files with DBpedia quads but these should be processed much later.
>
>> Many long literals? (might explain why the default setting was not enough)
>
> I don't know. I can check with a SPARQL query on an older TDB version
> of the data if you want.
It would be useful to know because the node table cache growing might
(guess) be squeezing the rest of the system. It's count based so large
literals mean the cache takes more bytes.
Andy
>
> Regards,
>
> Michael Brunnbauer
>
Re: tdbloader2 performance for 1B+ triples
Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,
On Fri, Aug 10, 2012 at 01:57:39PM +0100, Andy Seaborne wrote:
> Previously, with 32G heap:
No. with 2048M heap.
> What is the data like? The data shape should only affect the building
> of the node table.
It's an nquads file with 4,9 mio named graphs - data we crawled from the web
for foaf-search.net with the file url as graph. Many foaf profiles with blank
nodes but also other data that uses foaf:name. There are also other nquad
files with DBpedia quads but these should be processed much later.
> Many long literals? (might explain why the default setting was not enough)
I don't know. I can check with a SPARQL query on an older TDB version
of the data if you want.
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: tdbloader2 performance for 1B+ triples
Posted by Andy Seaborne <an...@apache.org>.
On 10/08/12 06:34, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> [tdbloader2]
>
> On Thu, Aug 09, 2012 at 06:53:59PM +0200, Michael Brunnbauer wrote:
>> INFO Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> Any idea what a good value for -Xmx for 1B+ triples would be ?
>> I will try with 16384 now.
>
> -Xmx16384M throws the memory error after 478 mio triples:
>
> INFO Add: 478,600,000 Data (Batch: 247 / Avg: 13,627)
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
478 million / 16G heap
This is bizarre.
Previously, with 32G heap:
> INFO Add: 55,500,000 Data (Batch: 98 / Avg: 10,335)
> INFO Elapsed: 5,369.59 seconds [2012/08/09 17:45:44 CEST]
> INFO Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
which is 55 million, a lot less than when you decreased the heap size.
This morning, I have loaded (this is the end of the data phase,
rformatted.):
11:46:38 INFO loader ::
Add: 747,400,000 Data (Batch: 187,969 / Avg: 131,924)
11:46:43 INFO loader ::
Total: 747,436,151 tuples : 5,669.75 seconds :
131,828.81 tuples/sec [2012/08/10 11:46:43 UTC]
with no change to tdbloader2 other than fix the classpath setting bug so
it's -Xmx1200M.
The machine is a 34G machine in Amazon - I even forgot to halt the large
dataset it is hosting but it's not public yet and only the odd developer
is testing against it.
What is the data like? The data shape should only affect the building
of the node table.
Many long literals? (might explain why the default setting was not enough)
but that does not explain why decreasing the heap size means it gets
further.
Unrelated:
I have noticed the parameters to sort(1) could be a lot better ...
e.g.
--buffer-size=50% --parallel=3
I'll try that out but you're crashing out in the data phase before index
creation.
Andy
>
> Regards,
>
> Michael Brunnbauer
>
Re: tdbdump Exception
Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,
[tdbloader2]
On Thu, Aug 09, 2012 at 06:53:59PM +0200, Michael Brunnbauer wrote:
> INFO Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> Any idea what a good value for -Xmx for 1B+ triples would be ?
> I will try with 16384 now.
-Xmx16384M throws the memory error after 478 mio triples:
INFO Add: 478,600,000 Data (Batch: 247 / Avg: 13,627)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: tdbdump Exception
Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,
[tdbloader2 performance for 1B+ triples]
On Mon, Jul 30, 2012 at 05:06:55PM +0100, Andy Seaborne wrote:
> >>How big are the node* files (node2id.dat, .idn, nodes.dat) in the
> >>resulting database in this case?
> >
> >node2id.dat 9470738432 bytes
>
> 9,470,738,432 => 9G
>
> >node2id.idn 50331648 bytes
>
> 50,331,648 => 50M
>
> Much less than RAM size.
>
> >nodes.dat 20182577027 bytes
>
> This file is written sequentially and isn't read during loading so
> should not be an issue.
>
> In 64 bit mode, the B+Tree node2id is a memory mapped file and the OS
> takes care of paging+caching the data.
>
> I think that use of
>
> JVM_ARGS="-Xmx32768M -server"
>
> is in fact making things worse: the heap grows to 32G, reducing the
> space available to the OS for mmap files. So it is squeezing out the OS
> managed mmap files and the result is that there is little real RAM
> devoted to caching the node table.
>
> 2G heap should be enough IIRC (caveat long literals).
The -Xmx32768M is not there without reason. I've had out of memory errors with
much higher values and earlier Jena versions. I tried JVM_ARGS="-Xmx2048M"
with tdbloader2 from apache-jena-2.7.3 and the error came after 55mio triples:
INFO Add: 55,300,000 Data (Batch: 281 / Avg: 13,794)
INFO Add: 55,350,000 Data (Batch: 227 / Avg: 13,088)
INFO Add: 55,400,000 Data (Batch: 192 / Avg: 12,342)
INFO Add: 55,450,000 Data (Batch: 134 / Avg: 11,406)
INFO Add: 55,500,000 Data (Batch: 98 / Avg: 10,335)
INFO Elapsed: 5,369.59 seconds [2012/08/09 17:45:44 CEST]
INFO Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuilder.append(StringBuilder.java:119)
at com.hp.hpl.jena.tdb.lib.NodeLib.hash(NodeLib.java:160)
at com.hp.hpl.jena.tdb.lib.NodeLib.setHash(NodeLib.java:116)
at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.accessIndex(NodeTableNative.java:124)
at com.hp.hpl.jena.tdb.nodetable.NodeTableNative._idForNode(NodeTableNative.java:117)
at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getAllocateNodeId(NodeTableNative.java:83)
at com.hp.hpl.jena.tdb.nodetable.NodeTableCache._idForNode(NodeTableCache.java:123)
at com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getAllocateNodeId(NodeTableCache.java:83)
at com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getAllocateNodeId(NodeTableWrapper.java:43)
at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:51)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:223)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:190)
at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:71)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
at org.openjena.riot.RiotLoader.readQuads(RiotLoader.java:206)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:168)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)
Any idea what a good value for -Xmx for 1B+ triples would be ?
I will try with 16384 now.
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: tdbdump Exception
Posted by Andy Seaborne <an...@apache.org>.
On 30/07/12 13:33, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Sun, Jul 29, 2012 at 05:22:57PM +0100, Andy Seaborne wrote:
>> How big are the node* files (node2id.dat, .idn, nodes.dat) in the
>> resulting database in this case?
>
> node2id.dat 9470738432 bytes
9,470,738,432 => 9G
> node2id.idn 50331648 bytes
50,331,648 => 50M
Much less than RAM size.
> nodes.dat 20182577027 bytes
This file is written sequentially and isn't read during loading so
should not be an issue.
In 64 bit mode, the B+Tree node2id is a memory mapped file and the OS
takes care of paging+caching the data.
I think that use of
JVM_ARGS="-Xmx32768M -server"
is in fact making things worse: the heap grows to 32G, reducing the
space available to the OS for mmap files. So it is squeezing out the OS
managed mmap files and the result is that there is little real RAM
devoted to caching the node table.
2G heap should be enough IIRC (caveat long literals).
Andy
>
> Regards,
>
> Michael Brunnbauer
>
Re: tdbdump Exception
Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,
On Sun, Jul 29, 2012 at 05:22:57PM +0100, Andy Seaborne wrote:
> How big are the node* files (node2id.dat, .idn, nodes.dat) in the
> resulting database in this case?
node2id.dat 9470738432 bytes
node2id.idn 50331648 bytes
nodes.dat 20182577027 bytes
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: tdbdump Exception
Posted by Andy Seaborne <an...@apache.org>.
On 24/07/12 12:24, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Tue, Jul 24, 2012 at 01:13:59PM +0200, Michael Brunnbauer wrote:
>> BTW: Here is some output from tdbloader2 for this TDB which shows that
>> the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
>> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
>> run into memory problems.
>
> I should be more specific here: Whenever I watched it after 10^9 quads it was
> doing disk IO (i think mostly writes, probably to node2id.dat and nodes.dat).
> Would it be possible to generate node2id.dat and nodes.dat without random
> access ?
(see also tdbloader4)
Yes - it looks like the node file, part of which is a B+Tree of hash
(128 bits) to NodeId. This is used to see if the node has already been
encountered. There is a cache - maybe this needs greatly increasing in
size or a more explicit in-memory structure fronting the node table for
bulk loading. At query time, this isn't such an important lookup.
How big are the node* files (node2id.dat, .idn, nodes.dat) in the
resulting database in this case?
Andy
> Regards,
>
> Michael Brunnbauer
>
Re: tdbdump Exception
Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,
On Tue, Jul 24, 2012 at 01:13:59PM +0200, Michael Brunnbauer wrote:
> BTW: Here is some output from tdbloader2 for this TDB which shows that
> the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
> run into memory problems.
I should be more specific here: Whenever I watched it after 10^9 quads it was
doing disk IO (i think mostly writes, probably to node2id.dat and nodes.dat).
Would it be possible to generate node2id.dat and nodes.dat without random
access ?
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: tdbdump Exception
Posted by Andy Seaborne <an...@apache.org>.
On 07/08/12 11:30, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Sun, Jul 29, 2012 at 09:28:47PM +0100, Andy Seaborne wrote:
>>> (A) problem found : timezones with non-zero minutes.
>>> Recorded as JENA-287
>> Fixed - data on disk is not affected. It was a bug in reconstructing
>> the date time.
>
> Thank you again!
>
> Will there be a new Jena release soon ? Is it safe to replace
> lib/jena-tdb-0.9.2.jar in the Jena 2.7.2 distribution with a development
> snapshot (for the tools in bin/) ?
There is a release later this week. The build is done and approved, the
release manager will push it to maven and the mirrors soon.
This does not include the stats.opt fix. That is in the next round of
snapshots - make sure it's dated today or later e.g.
jena-fuseki-0.2.5-20120807.122151-4-distribution.zip
The best way of working is to take consistent build (apache-jena or
jena-fuseki), e.g. all one SNAPSHOT in case theer are any cross-module
changes (unusual but possible).
If you want to stick to formal releases, then it would be best to
upgrade to Jena 2.7.3 then add in jena-tdb-0.9.4-SNAPSHOT -- or use
jena-fuseki-SNAPSHOT.jar which can be used as a single Jar of all jena
and it's dependencies like xerces and slf4j. That's I work on a newly
built remote machine with the minimum of setup - copy in Fuseki and use
e.g.
java -cp jena-fuseki.jar tdb.tdbquery ....
Andy
>
> Regards,
>
> Michael Brunnbauer
>
Re: tdbdump Exception
Posted by Michael Brunnbauer <br...@netestate.de>.
Hello Andy,
On Sun, Jul 29, 2012 at 09:28:47PM +0100, Andy Seaborne wrote:
> >(A) problem found : timezones with non-zero minutes.
> >Recorded as JENA-287
> Fixed - data on disk is not affected. It was a bug in reconstructing
> the date time.
Thank you again!
Will there be a new Jena release soon ? Is it safe to replace
lib/jena-tdb-0.9.2.jar in the Jena 2.7.2 distribution with a development
snapshot (for the tools in bin/) ?
Regards,
Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail brunni@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Re: tdbdump Exception
Posted by Andy Seaborne <an...@apache.org>.
On 29/07/12 17:34, Andy Seaborne wrote:
> On 29/07/12 17:05, Andy Seaborne wrote:
>> I've put some debugging in so that the term being unpacked it printed
>> out.
>>
>> It looks like it is the timezone.
>>
>> Andy
>
> (A) problem found : timezones with non-zero minutes.
>
> Recorded as JENA-287
Fixed - data on disk is not affected. It was a bug in reconstructing
the date time.
It affects negative offsets with non-zero minutes.
Andy
No, TDB is not table driven here but the web says apparently, there are
4 of them:
HNT = NST (Heure Normale de Terre-Neuve == Newfoundland Standard Time)
HAT = NDT (Heure Avancée de Terre-Neuve == Newfoundland Daylight Time)
HLV = VET (Hora Legal de Venezuela == Venezuelan Standard Time)
MART (Marquesas Time)
http://www.timeanddate.com/library/abbreviations/timezones/
>
> Andy
>
>>
>> On 24/07/12 12:13, Michael Brunnbauer wrote:
>>>
>>> Hello Andy,
>>>
>>> On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
>>>>> I guess it would be a good idea to look at the end of the dump and
>>>>> check
>>>>> the
>>>>> corresponding named graph for bad datetimes ?
>>>>
>>>> Yes - my best guess at the moment is that a dateTime can get in (they
>>>> are encoded into 56 bits, not recorded using the lexical form) but
>>>> there
>>>> was a problem on the recreation of the lexical form. Whether the
>>>> encoding or decoding is wrong, I can't tell.
>>>
>>> I was not able to find the named graph causing the problem so I
>>> recreated the
>>> TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
>>> apache-jena-2.7.2 immediately after that. The result is that I seem to
>>> run
>>> into the same problem:
>>>
>>> Exception in thread "main" org.openjena.atlas.AtlasException:
>>> formatInt: overflow
>>> at
>>> org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
>>>
>>>
>>> at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
>>> at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
>>> at
>>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
>>> at
>>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
>>>
>>>
>>> at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
>>> at
>>> com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
>>>
>>>
>>> at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
>>> at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
>>> at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
>>> at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
>>> at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
>>> at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
>>> at
>>> org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
>>> at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
>>> at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
>>> at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
>>> at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
>>> at tdb.tdbdump.exec(tdbdump.java:49)
>>> at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
>>> at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>>> at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>>> at tdb.tdbdump.main(tdbdump.java:31)
>>>
>>> This seems to be a serious issue.
>>>
>>> BTW: Here is some output from tdbloader2 for this TDB which shows that
>>> the tdbloader2 data phase runtime gets quite non-linear for very big
>>> datasets.
>>> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not
>>> seem to
>>> run into memory problems.
>>>
>>> 12:39:17 -- TDB Bulk Loader Start
>>> 12:39:17 Data phase
>>> ...
>>> INFO Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
>>> ...
>>> INFO Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
>>> ...
>>> INFO Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
>>> ...
>>> INFO Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
>>> ...
>>> INFO Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
>>> ...
>>> INFO Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52
>>> tuples/sec [2012/07/22 03:53:36 CEST]
>>> ...
>>> 20:24:24 -- TDB Bulk Loader Finish
>>> 20:24:24 -- 373477 seconds
>>>
>>> Regards,
>>>
>>> Michael Brunnbauer
>>>
>>
>
Re: tdbdump Exception
Posted by Andy Seaborne <an...@apache.org>.
On 29/07/12 17:05, Andy Seaborne wrote:
> I've put some debugging in so that the term being unpacked it printed out.
>
> It looks like it is the timezone.
>
> Andy
(A) problem found : timezones with non-zero minutes.
Recorded as JENA-287
Andy
>
> On 24/07/12 12:13, Michael Brunnbauer wrote:
>>
>> Hello Andy,
>>
>> On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
>>>> I guess it would be a good idea to look at the end of the dump and
>>>> check
>>>> the
>>>> corresponding named graph for bad datetimes ?
>>>
>>> Yes - my best guess at the moment is that a dateTime can get in (they
>>> are encoded into 56 bits, not recorded using the lexical form) but there
>>> was a problem on the recreation of the lexical form. Whether the
>>> encoding or decoding is wrong, I can't tell.
>>
>> I was not able to find the named graph causing the problem so I
>> recreated the
>> TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
>> apache-jena-2.7.2 immediately after that. The result is that I seem to
>> run
>> into the same problem:
>>
>> Exception in thread "main" org.openjena.atlas.AtlasException:
>> formatInt: overflow
>> at
>> org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
>>
>> at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
>> at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
>> at
>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
>> at
>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
>>
>> at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
>> at
>> com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
>>
>> at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
>> at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
>> at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
>> at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
>> at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
>> at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
>> at
>> org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
>> at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
>> at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
>> at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
>> at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
>> at tdb.tdbdump.exec(tdbdump.java:49)
>> at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
>> at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>> at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>> at tdb.tdbdump.main(tdbdump.java:31)
>>
>> This seems to be a serious issue.
>>
>> BTW: Here is some output from tdbloader2 for this TDB which shows that
>> the tdbloader2 data phase runtime gets quite non-linear for very big
>> datasets.
>> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not
>> seem to
>> run into memory problems.
>>
>> 12:39:17 -- TDB Bulk Loader Start
>> 12:39:17 Data phase
>> ...
>> INFO Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
>> ...
>> INFO Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
>> ...
>> INFO Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
>> ...
>> INFO Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
>> ...
>> INFO Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
>> ...
>> INFO Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52
>> tuples/sec [2012/07/22 03:53:36 CEST]
>> ...
>> 20:24:24 -- TDB Bulk Loader Finish
>> 20:24:24 -- 373477 seconds
>>
>> Regards,
>>
>> Michael Brunnbauer
>>
>
Re: tdbdump Exception
Posted by Andy Seaborne <an...@apache.org>.
I've put some debugging in so that the term being unpacked it printed out.
It looks like it is the timezone.
Andy
On 24/07/12 12:13, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
>>> I guess it would be a good idea to look at the end of the dump and check
>>> the
>>> corresponding named graph for bad datetimes ?
>>
>> Yes - my best guess at the moment is that a dateTime can get in (they
>> are encoded into 56 bits, not recorded using the lexical form) but there
>> was a problem on the recreation of the lexical form. Whether the
>> encoding or decoding is wrong, I can't tell.
>
> I was not able to find the named graph causing the problem so I recreated the
> TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
> apache-jena-2.7.2 immediately after that. The result is that I seem to run
> into the same problem:
>
> Exception in thread "main" org.openjena.atlas.AtlasException: formatInt: overflow
> at org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
> at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
> at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
> at com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
> at com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
> at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
> at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
> at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
> at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
> at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
> at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
> at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
> at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> at org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
> at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
> at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
> at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
> at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
> at tdb.tdbdump.exec(tdbdump.java:49)
> at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
> at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
> at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
> at tdb.tdbdump.main(tdbdump.java:31)
>
> This seems to be a serious issue.
>
> BTW: Here is some output from tdbloader2 for this TDB which shows that
> the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
> run into memory problems.
>
> 12:39:17 -- TDB Bulk Loader Start
> 12:39:17 Data phase
> ...
> INFO Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
> ...
> INFO Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
> ...
> INFO Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
> ...
> INFO Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
> ...
> INFO Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
> ...
> INFO Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52 tuples/sec [2012/07/22 03:53:36 CEST]
> ...
> 20:24:24 -- TDB Bulk Loader Finish
> 20:24:24 -- 373477 seconds
>
> Regards,
>
> Michael Brunnbauer
>