You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Michael Brunnbauer <br...@netestate.de> on 2012/07/24 13:13:59 UTC

Re: tdbdump Exception

Hello Andy,

On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
> >I guess it would be a good idea to look at the end of the dump and check 
> >the
> >corresponding named graph for bad datetimes ?
> 
> Yes - my best guess at the moment is that a dateTime can get in (they 
> are encoded into 56 bits, not recorded using the lexical form) but there 
> was a problem on the recreation of the lexical form.  Whether the 
> encoding or decoding is wrong, I can't tell.

I was not able to find the named graph causing the problem so I recreated the
TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
apache-jena-2.7.2 immediately after that. The result is that I seem to run
into the same problem:

Exception in thread "main" org.openjena.atlas.AtlasException: formatInt: overflow
	at org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
	at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
	at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
	at com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
	at com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
	at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
	at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
	at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
	at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
	at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
	at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
	at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
	at org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
	at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
	at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
	at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
	at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
	at tdb.tdbdump.exec(tdbdump.java:49)
	at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
	at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
	at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
	at tdb.tdbdump.main(tdbdump.java:31)

This seems to be a serious issue.

BTW: Here is some output from tdbloader2 for this TDB which shows that 
the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
run into memory problems.

 12:39:17 -- TDB Bulk Loader Start
 12:39:17 Data phase
...
INFO  Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
...
INFO  Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
...
INFO  Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
...
INFO  Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
...
INFO  Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
...
INFO  Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52 tuples/sec [2012/07/22 03:53:36 CEST]
...
 20:24:24 -- TDB Bulk Loader Finish
 20:24:24 -- 373477 seconds

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbloader2 performance for 1B+ triples

Posted by Andy Seaborne <an...@apache.org>.

On 10/08/12 18:40, Michael Brunnbauer wrote:

> It's an nquads file with 4,9 mio named graphs

OK - that's a difference - may be related.  I'll take a look although I 
need to create some test data first.

If it is this, then a smaller heap and more manageable size of data is 
something I can do.

- data we crawled from the web
> for foaf-search.net with the file url as graph. Many foaf profiles with blank
> nodes but also other data that uses foaf:name. There are also other nquad
> files with DBpedia quads but these should be processed much later.
>
>> Many long literals? (might explain why the default setting was not enough)
>
> I don't know. I can check with a SPARQL query on an older TDB version
> of the data if you want.

It would be useful to know because the node table cache growing might 
(guess) be squeezing the rest of the system.  It's count based so large 
literals mean the cache takes more bytes.

	Andy

>
> Regards,
>
> Michael Brunnbauer
>

Re: tdbloader2 performance for 1B+ triples

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

On Fri, Aug 10, 2012 at 01:57:39PM +0100, Andy Seaborne wrote:
> Previously, with 32G heap:

No. with 2048M heap.

> What is the data like?  The data shape should only affect the building 
> of the node table.

It's an nquads file with 4,9 mio named graphs - data we crawled from the web
for foaf-search.net with the file url as graph. Many foaf profiles with blank 
nodes but also other data that uses foaf:name. There are also other nquad 
files with DBpedia quads but these should be processed much later.

> Many long literals? (might explain why the default setting was not enough)

I don't know. I can check with a SPARQL query on an older TDB version
of the data if you want.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbloader2 performance for 1B+ triples

Posted by Andy Seaborne <an...@apache.org>.

On 10/08/12 06:34, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> [tdbloader2]
>
> On Thu, Aug 09, 2012 at 06:53:59PM +0200, Michael Brunnbauer wrote:
>> INFO  Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> Any idea what a good value for -Xmx for 1B+ triples would be ?
>> I will try with 16384 now.
>
> -Xmx16384M throws the memory error after 478 mio triples:
>
> INFO  Add: 478,600,000 Data (Batch: 247 / Avg: 13,627)
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

478 million / 16G heap

This is bizarre.

Previously, with 32G heap:
> INFO  Add: 55,500,000 Data (Batch: 98 / Avg: 10,335)
> INFO    Elapsed: 5,369.59 seconds [2012/08/09 17:45:44 CEST]
> INFO  Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

which is 55 million, a lot less than when you decreased the heap size.

This morning, I have loaded (this is the end of the data phase, 
rformatted.):

11:46:38 INFO  loader               ::
    Add: 747,400,000 Data (Batch: 187,969 / Avg: 131,924)
11:46:43 INFO  loader               ::
    Total: 747,436,151 tuples : 5,669.75 seconds :
    131,828.81 tuples/sec [2012/08/10 11:46:43 UTC]

with no change to tdbloader2 other than fix the classpath setting bug so 
it's -Xmx1200M.

The machine is a 34G machine in Amazon - I even forgot to halt the large 
dataset it is hosting but it's not public yet and only the odd developer 
is testing against it.

What is the data like?  The data shape should only affect the building 
of the node table.

Many long literals? (might explain why the default setting was not enough)

but that does not explain why decreasing the heap size means it gets 
further.

Unrelated:

I have noticed the parameters to sort(1) could be a lot better ...

e.g.
--buffer-size=50%  --parallel=3

I'll try that out but you're crashing out in the data phase before index 
creation.

	Andy

>
> Regards,
>
> Michael Brunnbauer
>

Re: tdbdump Exception

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

[tdbloader2]

On Thu, Aug 09, 2012 at 06:53:59PM +0200, Michael Brunnbauer wrote:
> INFO  Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> Any idea what a good value for -Xmx for 1B+ triples would be ?
> I will try with 16384 now.

-Xmx16384M throws the memory error after 478 mio triples:

INFO  Add: 478,600,000 Data (Batch: 247 / Avg: 13,627)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbdump Exception

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

[tdbloader2 performance for 1B+ triples]

On Mon, Jul 30, 2012 at 05:06:55PM +0100, Andy Seaborne wrote:
> >>How big are the node* files (node2id.dat, .idn, nodes.dat) in the
> >>resulting database in this case?
> >
> >node2id.dat 9470738432 bytes
> 
> 9,470,738,432 => 9G
> 
> >node2id.idn 50331648 bytes
> 
> 50,331,648 => 50M
> 
> Much less than RAM size.
> 
> >nodes.dat 20182577027 bytes
> 
> This file is written sequentially and isn't read during loading so 
> should not be an issue.
> 
> In 64 bit mode, the B+Tree node2id is a memory mapped file and the OS 
> takes care of paging+caching the data.
> 
> I think that use of
> 
>   JVM_ARGS="-Xmx32768M -server"
> 
> is in fact making things worse: the heap grows to 32G, reducing the 
> space available to the OS for mmap files.  So it is squeezing out the OS 
> managed mmap files and the result is that there is little real RAM 
> devoted to caching the node table.
> 
> 2G heap should be enough IIRC (caveat long literals).

The -Xmx32768M is not there without reason. I've had out of memory errors with 
much higher values and earlier Jena versions. I tried JVM_ARGS="-Xmx2048M"
with tdbloader2 from apache-jena-2.7.3 and the error came after 55mio triples:

INFO  Add: 55,300,000 Data (Batch: 281 / Avg: 13,794)
INFO  Add: 55,350,000 Data (Batch: 227 / Avg: 13,088)
INFO  Add: 55,400,000 Data (Batch: 192 / Avg: 12,342)
INFO  Add: 55,450,000 Data (Batch: 134 / Avg: 11,406)
INFO  Add: 55,500,000 Data (Batch: 98 / Avg: 10,335)
INFO    Elapsed: 5,369.59 seconds [2012/08/09 17:45:44 CEST]
INFO  Add: 55,550,000 Data (Batch: 52 / Avg: 8,785)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2882)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
        at java.lang.StringBuilder.append(StringBuilder.java:119)
        at com.hp.hpl.jena.tdb.lib.NodeLib.hash(NodeLib.java:160)
        at com.hp.hpl.jena.tdb.lib.NodeLib.setHash(NodeLib.java:116)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.accessIndex(NodeTableNative.java:124)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableNative._idForNode(NodeTableNative.java:117)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getAllocateNodeId(NodeTableNative.java:83)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableCache._idForNode(NodeTableCache.java:123)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getAllocateNodeId(NodeTableCache.java:83)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getAllocateNodeId(NodeTableWrapper.java:43)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:51)
        at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:223)
        at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder$NodeTableBuilder.send(CmdNodeTableBuilder.java:190)
        at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:71)
        at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
        at org.openjena.riot.RiotLoader.readQuads(RiotLoader.java:206)
        at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:168)
        at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
        at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
        at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)

Any idea what a good value for -Xmx for 1B+ triples would be ?

I will try with 16384 now.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbdump Exception

Posted by Andy Seaborne <an...@apache.org>.

On 30/07/12 13:33, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Sun, Jul 29, 2012 at 05:22:57PM +0100, Andy Seaborne wrote:
>> How big are the node* files (node2id.dat, .idn, nodes.dat) in the
>> resulting database in this case?
>
> node2id.dat 9470738432 bytes

9,470,738,432 => 9G

> node2id.idn 50331648 bytes

50,331,648 => 50M

Much less than RAM size.

> nodes.dat 20182577027 bytes

This file is written sequentially and isn't read during loading so 
should not be an issue.

In 64 bit mode, the B+Tree node2id is a memory mapped file and the OS 
takes care of paging+caching the data.

I think that use of

   JVM_ARGS="-Xmx32768M -server"

is in fact making things worse: the heap grows to 32G, reducing the 
space available to the OS for mmap files.  So it is squeezing out the OS 
managed mmap files and the result is that there is little real RAM 
devoted to caching the node table.

2G heap should be enough IIRC (caveat long literals).

	Andy

>
> Regards,
>
> Michael Brunnbauer
>

Re: tdbdump Exception

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

On Sun, Jul 29, 2012 at 05:22:57PM +0100, Andy Seaborne wrote:
> How big are the node* files (node2id.dat, .idn, nodes.dat) in the 
> resulting database in this case?

node2id.dat 9470738432 bytes
node2id.idn 50331648 bytes
nodes.dat 20182577027 bytes

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbdump Exception

Posted by Andy Seaborne <an...@apache.org>.

On 24/07/12 12:24, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Tue, Jul 24, 2012 at 01:13:59PM +0200, Michael Brunnbauer wrote:
>> BTW: Here is some output from tdbloader2 for this TDB which shows that
>> the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
>> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
>> run into memory problems.
>
> I should be more specific here: Whenever I watched it after 10^9 quads it was
> doing disk IO (i think mostly writes, probably to node2id.dat and nodes.dat).
> Would it be possible to generate node2id.dat and nodes.dat without random
> access ?

(see also tdbloader4)

Yes - it looks like the node file, part of which is a B+Tree of hash 
(128 bits) to NodeId.  This is used to see if the node has already been 
encountered.  There is a cache - maybe this needs greatly increasing in 
size or a more explicit in-memory structure fronting the node table for 
bulk loading.  At query time, this isn't such an important lookup.

How big are the node* files (node2id.dat, .idn, nodes.dat) in the 
resulting database in this case?

	Andy

> Regards,
>
> Michael Brunnbauer
>

Re: tdbdump Exception

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

On Tue, Jul 24, 2012 at 01:13:59PM +0200, Michael Brunnbauer wrote:
> BTW: Here is some output from tdbloader2 for this TDB which shows that 
> the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
> run into memory problems.

I should be more specific here: Whenever I watched it after 10^9 quads it was
doing disk IO (i think mostly writes, probably to node2id.dat and nodes.dat).
Would it be possible to generate node2id.dat and nodes.dat without random 
access ?

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbdump Exception

Posted by Andy Seaborne <an...@apache.org>.

On 07/08/12 11:30, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Sun, Jul 29, 2012 at 09:28:47PM +0100, Andy Seaborne wrote:
>>> (A) problem found : timezones with non-zero minutes.
>>> Recorded as JENA-287
>> Fixed - data on disk is not affected.  It was a bug in reconstructing
>> the date time.
>
> Thank you again!
>
> Will there be a new Jena release soon ? Is it safe to replace
> lib/jena-tdb-0.9.2.jar in the Jena 2.7.2 distribution with a development
> snapshot (for the tools in bin/) ?

There is a release later this week.  The build is done and approved, the 
release manager will push it to maven and the mirrors soon.

This does not include the stats.opt fix.  That is in the next round of 
snapshots - make sure it's dated today or later e.g.

jena-fuseki-0.2.5-20120807.122151-4-distribution.zip

The best way of working is to take consistent build (apache-jena or 
jena-fuseki), e.g. all one SNAPSHOT in case theer are any cross-module 
changes (unusual but possible).

If you want to stick to formal releases, then it would be best to 
upgrade to Jena 2.7.3 then add in jena-tdb-0.9.4-SNAPSHOT -- or use 
jena-fuseki-SNAPSHOT.jar which can be used as a single Jar of all jena 
and it's dependencies like xerces and slf4j.  That's I work on a newly 
built remote machine with the minimum of setup - copy in Fuseki and use 
e.g.

    java -cp jena-fuseki.jar tdb.tdbquery ....

	Andy

>
> Regards,
>
> Michael Brunnbauer
>

Re: tdbdump Exception

Posted by Michael Brunnbauer <br...@netestate.de>.

Hello Andy,

On Sun, Jul 29, 2012 at 09:28:47PM +0100, Andy Seaborne wrote:
> >(A) problem found : timezones with non-zero minutes.
> >Recorded as JENA-287
> Fixed - data on disk is not affected.  It was a bug in reconstructing 
> the date time.

Thank you again!

Will there be a new Jena release soon ? Is it safe to replace 
lib/jena-tdb-0.9.2.jar in the Jena 2.7.2 distribution with a development
snapshot (for the tools in bin/) ?

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

Re: tdbdump Exception

Posted by Andy Seaborne <an...@apache.org>.

On 29/07/12 17:34, Andy Seaborne wrote:
> On 29/07/12 17:05, Andy Seaborne wrote:
>> I've put some debugging in so that the term being unpacked it printed
>> out.
>>
>> It looks like it is the timezone.
>>
>>      Andy
>
> (A) problem found : timezones with non-zero minutes.
>
> Recorded as JENA-287

Fixed - data on disk is not affected.  It was a bug in reconstructing 
the date time.

It affects negative offsets with non-zero minutes.

	Andy

No, TDB is not table driven here but the web says apparently, there are 
4 of them:

HNT = NST (Heure Normale de Terre-Neuve == Newfoundland Standard Time)

HAT = NDT (Heure Avancée de Terre-Neuve == Newfoundland Daylight  Time)

HLV = VET (Hora Legal de Venezuela == Venezuelan Standard Time)

MART (Marquesas Time)

http://www.timeanddate.com/library/abbreviations/timezones/

>
>      Andy
>
>>
>> On 24/07/12 12:13, Michael Brunnbauer wrote:
>>>
>>> Hello Andy,
>>>
>>> On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
>>>>> I guess it would be a good idea to look at the end of the dump and
>>>>> check
>>>>> the
>>>>> corresponding named graph for bad datetimes ?
>>>>
>>>> Yes - my best guess at the moment is that a dateTime can get in (they
>>>> are encoded into 56 bits, not recorded using the lexical form) but
>>>> there
>>>> was a problem on the recreation of the lexical form.  Whether the
>>>> encoding or decoding is wrong, I can't tell.
>>>
>>> I was not able to find the named graph causing the problem so I
>>> recreated the
>>> TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
>>> apache-jena-2.7.2 immediately after that. The result is that I seem to
>>> run
>>> into the same problem:
>>>
>>> Exception in thread "main" org.openjena.atlas.AtlasException:
>>> formatInt: overflow
>>>     at
>>> org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
>>>
>>>
>>>     at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
>>>     at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
>>>     at
>>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
>>>     at
>>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
>>>
>>>
>>>     at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
>>>     at
>>> com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
>>>
>>>
>>>     at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
>>>     at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
>>>     at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
>>>     at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
>>>     at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
>>>     at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
>>>     at
>>> org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
>>>     at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
>>>     at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
>>>     at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
>>>     at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
>>>     at tdb.tdbdump.exec(tdbdump.java:49)
>>>     at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
>>>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>>>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>>>     at tdb.tdbdump.main(tdbdump.java:31)
>>>
>>> This seems to be a serious issue.
>>>
>>> BTW: Here is some output from tdbloader2 for this TDB which shows that
>>> the tdbloader2 data phase runtime gets quite non-linear for very big
>>> datasets.
>>> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not
>>> seem to
>>> run into memory problems.
>>>
>>>   12:39:17 -- TDB Bulk Loader Start
>>>   12:39:17 Data phase
>>> ...
>>> INFO  Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
>>> ...
>>> INFO  Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
>>> ...
>>> INFO  Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
>>> ...
>>> INFO  Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
>>> ...
>>> INFO  Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
>>> ...
>>> INFO  Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52
>>> tuples/sec [2012/07/22 03:53:36 CEST]
>>> ...
>>>   20:24:24 -- TDB Bulk Loader Finish
>>>   20:24:24 -- 373477 seconds
>>>
>>> Regards,
>>>
>>> Michael Brunnbauer
>>>
>>
>

Re: tdbdump Exception

Posted by Andy Seaborne <an...@apache.org>.

On 29/07/12 17:05, Andy Seaborne wrote:
> I've put some debugging in so that the term being unpacked it printed out.
>
> It looks like it is the timezone.
>
>      Andy

(A) problem found : timezones with non-zero minutes.

Recorded as JENA-287

	Andy

>
> On 24/07/12 12:13, Michael Brunnbauer wrote:
>>
>> Hello Andy,
>>
>> On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
>>>> I guess it would be a good idea to look at the end of the dump and
>>>> check
>>>> the
>>>> corresponding named graph for bad datetimes ?
>>>
>>> Yes - my best guess at the moment is that a dateTime can get in (they
>>> are encoded into 56 bits, not recorded using the lexical form) but there
>>> was a problem on the recreation of the lexical form.  Whether the
>>> encoding or decoding is wrong, I can't tell.
>>
>> I was not able to find the named graph causing the problem so I
>> recreated the
>> TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
>> apache-jena-2.7.2 immediately after that. The result is that I seem to
>> run
>> into the same problem:
>>
>> Exception in thread "main" org.openjena.atlas.AtlasException:
>> formatInt: overflow
>>     at
>> org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
>>
>>     at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
>>     at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
>>     at
>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
>>     at
>> com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
>>
>>     at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
>>     at
>> com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
>>
>>     at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
>>     at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
>>     at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
>>     at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
>>     at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
>>     at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
>>     at
>> org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
>>     at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
>>     at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
>>     at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
>>     at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
>>     at tdb.tdbdump.exec(tdbdump.java:49)
>>     at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
>>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>>     at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>>     at tdb.tdbdump.main(tdbdump.java:31)
>>
>> This seems to be a serious issue.
>>
>> BTW: Here is some output from tdbloader2 for this TDB which shows that
>> the tdbloader2 data phase runtime gets quite non-linear for very big
>> datasets.
>> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not
>> seem to
>> run into memory problems.
>>
>>   12:39:17 -- TDB Bulk Loader Start
>>   12:39:17 Data phase
>> ...
>> INFO  Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
>> ...
>> INFO  Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
>> ...
>> INFO  Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
>> ...
>> INFO  Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
>> ...
>> INFO  Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
>> ...
>> INFO  Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52
>> tuples/sec [2012/07/22 03:53:36 CEST]
>> ...
>>   20:24:24 -- TDB Bulk Loader Finish
>>   20:24:24 -- 373477 seconds
>>
>> Regards,
>>
>> Michael Brunnbauer
>>
>

Re: tdbdump Exception

Posted by Andy Seaborne <an...@apache.org>.

I've put some debugging in so that the term being unpacked it printed out.

It looks like it is the timezone.

	Andy

On 24/07/12 12:13, Michael Brunnbauer wrote:
>
> Hello Andy,
>
> On Thu, Jun 14, 2012 at 01:12:25PM +0100, Andy Seaborne wrote:
>>> I guess it would be a good idea to look at the end of the dump and check
>>> the
>>> corresponding named graph for bad datetimes ?
>>
>> Yes - my best guess at the moment is that a dateTime can get in (they
>> are encoded into 56 bits, not recorded using the lexical form) but there
>> was a problem on the recreation of the lexical form.  Whether the
>> encoding or decoding is wrong, I can't tell.
>
> I was not able to find the named graph causing the problem so I recreated the
> TDB with tdbloader2 from apache-jena-2.7.2 and tried tdbdump from
> apache-jena-2.7.2 immediately after that. The result is that I seem to run
> into the same problem:
>
> Exception in thread "main" org.openjena.atlas.AtlasException: formatInt: overflow
> 	at org.openjena.atlas.lib.NumberUtils.formatUnsignedInt(NumberUtils.java:115)
> 	at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:87)
> 	at org.openjena.atlas.lib.NumberUtils.formatInt(NumberUtils.java:60)
> 	at com.hp.hpl.jena.tdb.store.DateTimeNode.unpack(DateTimeNode.java:255)
> 	at com.hp.hpl.jena.tdb.store.DateTimeNode.unpackDateTime(DateTimeNode.java:180)
> 	at com.hp.hpl.jena.tdb.store.NodeId.extract(NodeId.java:313)
> 	at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:64)
> 	at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:163)
> 	at com.hp.hpl.jena.tdb.lib.TupleLib.quad(TupleLib.java:155)
> 	at com.hp.hpl.jena.tdb.lib.TupleLib.access$100(TupleLib.java:45)
> 	at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:89)
> 	at com.hp.hpl.jena.tdb.lib.TupleLib$4.convert(TupleLib.java:85)
> 	at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
> 	at org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:94)
> 	at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:560)
> 	at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
> 	at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
> 	at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
> 	at tdb.tdbdump.exec(tdbdump.java:49)
> 	at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
> 	at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
> 	at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
> 	at tdb.tdbdump.main(tdbdump.java:31)
>
> This seems to be a serious issue.
>
> BTW: Here is some output from tdbloader2 for this TDB which shows that
> the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
> I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
> run into memory problems.
>
>   12:39:17 -- TDB Bulk Loader Start
>   12:39:17 Data phase
> ...
> INFO  Add: 100,000,000 Data (Batch: 68,027 / Avg: 57,649)
> ...
> INFO  Add: 500,000,000 Data (Batch: 55,309 / Avg: 41,446)
> ...
> INFO  Add: 1,000,000,000 Data (Batch: 27,901 / Avg: 24,119)
> ...
> INFO  Add: 1,100,000,000 Data (Batch: 335 / Avg: 6,308)
> ...
> INFO  Add: 1,138,800,000 Data (Batch: 256 / Avg: 5,038)
> ...
> INFO  Total: 1,138,845,529 tuples : 227,654.44 seconds : 5,002.52 tuples/sec [2012/07/22 03:53:36 CEST]
> ...
>   20:24:24 -- TDB Bulk Loader Finish
>   20:24:24 -- 373477 seconds
>
> Regards,
>
> Michael Brunnbauer
>