You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@epimorphics.com> on 2011/03/02 15:38:51 UTC

Re: OutOfMemoryError while loading datagraphs

Hi Frank,

On 28/02/11 14:48, Frank Budinsky wrote:
>
> Hi Andy,
>
> I did some further analysis of my OutOfMemeoryError problem, and this is
> what I've discovered. The problem seems to be that there is one instance of
> class NodeTupleTableConcrete that contains an ever growing set of tuples
> which eventually uses up all the available heap space and then crashes.
>
> To be more specific, this field in class TupleTable:
>
>      private final TupleIndex[] indexes ;
>
> seems to contain 6 continually growing TupleIndexRecord instances
> (BPlusTrees). From my measurements, this seems to eat up approximately 1G
> of heap for every 1M triples in the Dataset (i.e., about 1K per datagraph).
> So, to load my 100K datagraphs (~10M total triples) it would seem to need
> 10G of heap space.

There are 6 indexes for named graphs (see the files GSPO etc).  TDB uses 
total indexing which puts a lot of work at load time but means any 
lookup needed is always done with an index scan.  The code can run with 
less indexes - the minimum is one - but that is no exposed in the 
configuration.

Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk).  As the 
index grows the data goes to disk.  There is a finite LRU cache in front 
of each index.

Does your dataset have a location?  If has no location, it's all 
in-memory with a RAM-disk like structure.  This is for small-scale 
testing only - it really does read and write blocks out of the RAM disk 
by copy to give strict disk-like semantics.

There is also a NodeTable mapping between NodeId and Node (Jena's 
graph-level RDF Term class).  This has a cache in front of it .
readPropertiesFile
The long-ish literals maybe the problem.  The node table cache is 
fixed-number, not bounded by size.

The sizeof the caches are controlled by:

SystemTDB.Node2NodeIdCacheSize
SystemTDB.NodeId2NodeCacheSize

These are not easy to control but either (1) get the source code and 
alter the default values (2) see the code in SystemTDB that uses a 
properties file.

If you can end me a copy of the data, I can try loading it here.

> Does this make sense? How is it supposed to work? Shouldn't the triples
> from previously loaded named graphs be eligable for GC when I'm loading the
> next named graph? Could it be that I'm holding onto something that's
> preventing GC in the TupleTable?
>
> Also, after looking more carefully at the resources being indexed, I
> noticed that many of them do have relatively large literals (100s of
> characters). I also noticed that when using Fuseki to load the resources I
> get lots of warning messages like this, on the console:
>
>     Lexical form 'We are currently doing
> this:<br></br><br></br>workspaceConnection.replaceComponents
> (replaceComponents, replaceSource, falses, false,
> monitor);<br></br><br></br>the new way of doing it would be something
> like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> ArrayList&lt;IComponentOp&gt; replaceOps = new
> ArrayList&lt;IComponentOp&gt;();<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>   for (Iterator iComponents = components.iterator(); iComponents.hasNext();)
> {<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>   IComponentHandle componentHandle = (IComponentHandle) iComponents.next
> ();<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>   replaceOps.add(promotionTargetConnection.componentOpFactory
> ().replaceComponent
> (componentHandle,<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>   buildWorkspaceConnection,
> false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>   promotionTargetConnection.applyComponentOperations(replaceOps, monitor);'
> not valid for datatype
> http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
>
> Could this be part of the problem?

No - it's a different issue.  This is something coming from the parser.

RDF XMLLiterals have special rules - they must follow
exclusive canonical XML, which means, amongst a lot of other thigs, they 
have to be a single XML node.  The rules for exclusive Canonical XML are 
really quite strict (e.g. attributes in alphabetical order).

http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral

If you want to store XML or HTML fragments, you can't use RDF 
XMLLiterals very easily - you have to mangle them to conform to the 
rules.  I suggest either as strings or invent your own datatype.

You can run the parser on it's own using
"riotcmd.riot --validate FILE ..."


	Andy

>
> Thanks,
> Frank.

Re: Bad Jena TDB performance

Posted by Andy Seaborne <an...@epimorphics.com>.

On 25/03/11 18:25, Frank Budinsky wrote:
>
> Hi Andy and Brian,
>
> Thank you both for your speedy replies! Your input, combined, has enabled
> me to get my performance numbers into the right ballpark.
>
> When running Andy's version of the test, I noticed significantly better
> performance than I had previously been getting:
>
> Max mem: 1,820M
> DIRECT mode
> log4j:WARN No appenders could be found for logger
> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
> log4j:WARN Please initialize the log4j system properly.
>> Starting test: Fri Mar 25 08:21:07 PDT 2011
>> Initial number of indexed graphs: 0
> 100 at: Fri Mar 25 08:21:11 PDT 2011
> 200 at: Fri Mar 25 08:21:13 PDT 2011
> 300 at: Fri Mar 25 08:21:15 PDT 2011
> 400 at: Fri Mar 25 08:21:17 PDT 2011
> 1200 at: Fri Mar 25 08:21:34 PDT 2011
> ...
> 99700 at: Fri Mar 25 08:55:41 PDT 2011
> 99800 at: Fri Mar 25 08:55:43 PDT 2011
> 99900 at: Fri Mar 25 08:55:45 PDT 2011
> 100000 at: Fri Mar 25 08:55:47 PDT 2011
>> Done at: Fri Mar 25 08:55:47 PDT 2011
> 100,000 graphs in 2,080.58 sec
>
> Although much better than the 4 hours I was seeing previously, it's still
> 3.5 times slower than Andy's numbers. Is there any chance the warning
> messages about log4j might affect performance?

No. - can't see any reason why it would make any difference.

Drop a log4j.properties file in the current and it should go away.  TDB 
and all my Jena software, plays a game of hunt the config file for 
"log4j.properties" that includes the current directory.

> Looking at the difference between Andy's and my original test, I see this
> call is the only difference:
>
>              SystemTDB.setFileMode(FileMode.direct) ;
>
> What exactly does that do, and should I always be configuring TDB this way?
>

It forces the file handling to be in the style of the 32 bit version as
per your original setup. 64 bit uses memory mapped (32 bit mmap I/O is
limited to 1.5G total file size, all files).

If it makes a difference, then the memory mapped I/O is slow on your
machine. This might be explained by the fact it isn't Windows Server. I
guess you are running some anti-virus as well - I have no idea how that
interacts with memory mapped file I/O but it may be bad.

> I tried running without this call and, as expected, it seems to be back to
> the 4 hour performance:
>
> Max mem: 1,820M

Make the heap size larger for FileMode.direct - maybe 2-3 G
You have an 8G machine.  Might help GC but as you can see, I was using a 
1G heap. The cache sizes are fixed currently - the

>> Starting test: Fri Mar 25 09:00:50 PDT 2011
> log4j:WARN No appenders could be found for logger
> (com.hp.hpl.jena.tdb.info).
> log4j:WARN Please initialize the log4j system properly.
>> Initial number of indexed graphs: 0
> 100 at: Fri Mar 25 09:00:55 PDT 2011
> 200 at: Fri Mar 25 09:00:57 PDT 2011
> 300 at: Fri Mar 25 09:00:59 PDT 2011
> 400 at: Fri Mar 25 09:01:01 PDT 2011
> 500 at: Fri Mar 25 09:01:03 PDT 2011
> ...
> 39400 at: Fri Mar 25 09:51:23 PDT 2011
> 39500 at: Fri Mar 25 09:51:38 PDT 2011
> 39600 at: Fri Mar 25 09:51:52 PDT 2011
> 39700 at: Fri Mar 25 09:52:06 PDT 2011
> 39800 at: Fri Mar 25 09:52:21 PDT 2011
> ...
>> Done at: About 4 hours from start time
>
> One interesting difference between my slow running case and Brian's, is
> that mine always starts out pretty fast, but gradually slows down. Notice
> that it was taking about 2 seconds per 100 at the start, but it was up to
> 15 seconds at 39800. I didn't let it run to completion, this time, but I
> remember from a previous run that it was taking about 2 min per 100 at the
> end. Any idea why it might slow down over time, like this, when not using
> direct mode?

None - memory mapped I/O should be better, not worse.  (I haven't used 
it much on Windows but the server/desktop versions do make a difference 
as does AV.)

It's as if the machine is OS swapping as the caches grow which from your 
configuration really should not be happening.

>
> Anyway, the next thing I tried, was to change the code has Brian suggested.
> That also had a big impact for me:
>
> Max mem: 1,820M
> DIRECT mode
> log4j:WARN No appenders could be found for logger
> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
> log4j:WARN Please initialize the log4j system properly.
>> Starting test: Fri Mar 25 09:56:03 PDT 2011
>> Initial number of indexed graphs: 0
> 100 at: Fri Mar 25 09:56:06 PDT 2011
> 200 at: Fri Mar 25 09:56:07 PDT 2011
> 300 at: Fri Mar 25 09:56:09 PDT 2011
> 400 at: Fri Mar 25 09:56:10 PDT 2011
> 500 at: Fri Mar 25 09:56:11 PDT 2011
> ...
> 99700 at: Fri Mar 25 10:09:27 PDT 2011
> 99800 at: Fri Mar 25 10:09:27 PDT 2011
> 99900 at: Fri Mar 25 10:09:28 PDT 2011
> 100000 at: Fri Mar 25 10:09:29 PDT 2011
>> Done at: Fri Mar 25 10:09:35 PDT 2011
> 100,000 graphs in 812.30 sec

Brian's results are rather strange.  TDB does nothing special for 
model.add(Model) so my guess is that is about timing/processing effects 
or merely that the triples come out of the model in a "better" order 
then the streaming from the parser.

	Andy

>
> With that change, I'm now only about 30% slower than Andy's number. Maybe
> that's attributable to Windows vs Linux or the hardware differences. I'm
> running it on:
>
> Intel Xeon E5345 233 GHZ (2 processors)
> 8 GB RAM
> 300 GB HDD
> Windows 7 Enterprise SP1
>
> Does anybody know how a Xeon 5345 should compare to i5 or i7 processors, or
> how much difference there might be between Linux vs Windows 7.
>
> Thanks again for your help.
>
> Frank

Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.
On 27/03/2011 18:56, Brian McBride wrote:
> I haven't been able to track down the block manager code actually in 
> use as  I'm having trouble checking ARQ out of SVN, but Andy likely 
> knows off the top of his head whether this is plausible.
s/block manager/event manager/

Brian


Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.

On 28/03/2011 09:17, Andy Seaborne wrote:
[...]
>>
>> If the TDB default event handler then did something (like sync to disk?)
>> that the memory model does not, this could explain the difference in
>> performance. I have put a profiler on the test program and it reports
>> that the test program is spending a lot more time in
>> BlockManagerFile.force() when it is reading directly in to TDB than when
>> it is going via a memory model. So there is some evidence that this is
>> what is happening.
>>
>> I haven't been able to track down the block manager code actually in use
>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows off
>> the top of his head whether this is plausible.
>
> > s/block manager/event manager/
>
> Could be - the model.add(model) will go via the BulkUpdateHandler (I 
> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler 
> for insertion.
Yes.  The event is not issued by the update handler but by ARP.
>
> Could you try putting a break point in dataset.sync and see what the 
> call stack is when it gets hit?  That'll tell you who is causing the 
> sync.
Done.  ARP issues the finishRead event.  This leads to 
com.hp.hpl.jena.tdb.graph.GraphSyncListener.finishRead() which does a 
sync.  Something is(was) attaching a GraphSyncListener to the event 
manager for TDB graphs.

>
> There used to be (up to v 0.8.9? not int he last snapshot build) a 
> sync wrapper that sync() every n'000 triples added.
I think it is ARP issuing the finishRead event that is the trigger for 
the sync.
>   It's not in the development codebase.  All hidden implicit syncs 
> should now be removed  They were causing problems for a user who was 
> tracking whether the DB on disk was dirty or not.
>
> Brian, Frank, which versions are you running?
I've been using the latest from the main Maven repository:

tbd: 0..8.9
arq: 2.8.7
jena: 2.6.4

I've checked with the latest from CVS/SVN taken today.   That does not 
do the sync call and is faster when the parser is reading directly into 
the TDB dataset.  So this issue already fixed in head version.

Brian


Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.

On 28/03/2011 09:17, Andy Seaborne wrote:

> > as I'm having trouble checking ARQ out of SVN,
>
> What sort of trouble?
Checkout failed part way through, twice.  Can't remember the error.  
Decided I'd try it again on another machine this morning before 
reporting problem.   It checked out fine.  So either temporary glitch or 
something to do with my home machine /network.

Brian


Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.
Hi Frank,

On 28/03/2011 16:04, Frank Budinsky wrote:
> Brian, have you tried running it without setting direct mode? I'm wondering
> if there's a problem with memory mapped I/O on Windows?
I understand that TDB operates differently on 32 bit and 64 bit OS's.  I 
should have said earlier that I am running on a 64 bit version of 
Windows 7.   Frank, I am assuming you are running 32 bit rather than 64 
bit?  I'm afraid I don't have easy access to a 32 bit windows OS to 
trying things out on to get comparable results to yours.

With the 'current' release versions i.e.

tbd: 0.8.9
arq: 2.8.7
jena: 2.6.4

Running first without the loading through a memory model trick:

Running without direct mode switched on:
[[
Max mem: 1,820M
 > Starting test: Mon Mar 28 16:56:44 BST 2011
  WARN [main] (SetupTDB.java:755) - No BGP optimizer
 > Initial number of indexed graphs: 0
100 at: Mon Mar 28 16:57:09 BST 2011
200 at: Mon Mar 28 16:57:36 BST 2011
300 at: Mon Mar 28 16:58:03 BST 2011
400 at: Mon Mar 28 16:58:30 BST 2011
...
]]

Running with direct mode switched on:
[[
Max mem: 1,820M
DIRECT mode
 > Starting test: Mon Mar 28 16:59:32 BST 2011
  WARN [main] (SetupTDB.java:755) - No BGP optimizer
 > Initial number of indexed graphs: 0
100 at: Mon Mar 28 17:00:12 BST 2011
200 at: Mon Mar 28 17:01:07 BST 2011
300 at: Mon Mar 28 17:02:00 BST 2011
400 at: Mon Mar 28 17:02:47 BST 2011
...
]]

So, on my 64 bit Windows 7 machine, turning DIRECT mode on seems to slow 
it down with the released versions.

Now running with the load through a memory model

[[
100 at: Mon Mar 28 17:07:55 BST 2011
200 at: Mon Mar 28 17:07:56 BST 2011
300 at: Mon Mar 28 17:07:56 BST 2011
400 at: Mon Mar 28 17:07:57 BST 2011
500 at: Mon Mar 28 17:07:58 BST 2011
600 at: Mon Mar 28 17:07:58 BST 2011
700 at: Mon Mar 28 17:07:59 BST 2011
800 at: Mon Mar 28 17:07:59 BST 2011
900 at: Mon Mar 28 17:08:00 BST 2011
1000 at: Mon Mar 28 17:08:00 BST 2011
1100 at: Mon Mar 28 17:08:01 BST 2011
...
]]

[[
Max mem: 1,820M
DIRECT mode
 > Starting test: Mon Mar 28 17:05:46 BST 2011
  WARN [main] (SetupTDB.java:755) - No BGP optimizer
 > Initial number of indexed graphs: 0
100 at: Mon Mar 28 17:05:48 BST 2011
200 at: Mon Mar 28 17:05:49 BST 2011
300 at: Mon Mar 28 17:05:50 BST 2011
400 at: Mon Mar 28 17:05:51 BST 2011
500 at: Mon Mar 28 17:05:51 BST 2011
600 at: Mon Mar 28 17:05:52 BST 2011
700 at: Mon Mar 28 17:05:53 BST 2011
800 at: Mon Mar 28 17:05:53 BST 2011
...
]]

Not a lot of discernable difference.

Now with the latest development versions from SVN, loading directly into 
TDB model:

[[
Max mem: 1,820M
 > Starting test: Mon Mar 28 17:10:23 BST 2011
 > Initial number of indexed graphs: 0
100 at: Mon Mar 28 17:10:26 BST 2011
200 at: Mon Mar 28 17:10:27 BST 2011
300 at: Mon Mar 28 17:10:28 BST 2011
400 at: Mon Mar 28 17:10:28 BST 2011
500 at: Mon Mar 28 17:10:29 BST 2011
600 at: Mon Mar 28 17:10:30 BST 2011
700 at: Mon Mar 28 17:10:30 BST 2011
800 at: Mon Mar 28 17:10:31 BST 2011
900 at: Mon Mar 28 17:10:31 BST 2011
1000 at: Mon Mar 28 17:10:32 BST 2011
1100 at: Mon Mar 28 17:10:32 BST 2011
1200 at: Mon Mar 28 17:10:33 BST 2011
1300 at: Mon Mar 28 17:10:33 BST 2011
1400 at: Mon Mar 28 17:10:34 BST 2011
1500 at: Mon Mar 28 17:10:34 BST 2011
...
]]

[[
100 at: Mon Mar 28 17:11:34 BST 2011
200 at: Mon Mar 28 17:11:35 BST 2011
300 at: Mon Mar 28 17:11:35 BST 2011
400 at: Mon Mar 28 17:11:36 BST 2011
500 at: Mon Mar 28 17:11:37 BST 2011
600 at: Mon Mar 28 17:11:37 BST 2011
700 at: Mon Mar 28 17:11:38 BST 2011
800 at: Mon Mar 28 17:11:39 BST 2011
900 at: Mon Mar 28 17:11:39 BST 2011
1000 at: Mon Mar 28 17:11:40 BST 2011
1100 at: Mon Mar 28 17:11:40 BST 2011
1200 at: Mon Mar 28 17:11:41 BST 2011
1300 at: Mon Mar 28 17:11:41 BST 2011
1400 at: Mon Mar 28 17:11:42 BST 2011
1500 at: Mon Mar 28 17:11:42 BST 2011
...
]]


> Thanks,
> Frank.
>
> Andy Seaborne<an...@epimorphics.com>  wrote on 03/28/2011 04:17:49
> AM:
>
>> [image removed]
>>
>> Re: Bad Jena TDB performance
>>
>> Andy Seaborne
>>
>> to:
>>
>> jena-users
>>
>> 03/28/2011 04:18 AM
>>
>> Please respond to jena-users
>>
>>
>>
>> On 27/03/11 18:56, Brian McBride wrote:
>>> On 25/03/2011 18:25, Frank Budinsky wrote:
>>> [...]
>>>> One interesting difference between my slow running case and Brian's,
> is
>>>> that mine always starts out pretty fast, but gradually slows down.
> Notice
>>>> that it was taking about 2 seconds per 100 at the start, but it was up
> to
>>>> 15 seconds at 39800. I didn't let it run to completion, this time, but
> I
>>>> remember from a previous run that it was taking about 2 min per 100 at
>>>> the
>>>> end. Any idea why it might slow down over time, like this, when not
> using
>>>> direct mode?
>>>>
>>>> Anyway, the next thing I tried, was to change the code has Brian
>>>> suggested.
>>>> That also had a big impact for me:
>>> I have a hypothesis for why.
>>>
>>> When ARP, the RDF/XML parser, finishes reading a file it issues a
>>> finishedRead event to the event handler in the graph it is reading
> into.
>>> If the TDB default event handler then did something (like sync to
> disk?)
>>> that the memory model does not, this could explain the difference in
>>> performance. I have put a profiler on the test program and it reports
>>> that the test program is spending a lot more time in
>>> BlockManagerFile.force() when it is reading directly in to TDB than
> when
>>> it is going via a memory model. So there is some evidence that this is
>>> what is happening.
>>>
>>> I haven't been able to track down the block manager code actually in
> use
>>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows
> off
>>> the top of his head whether this is plausible.
>>   >  s/block manager/event manager/
>>
>> Could be - the model.add(model) will go via the BulkUpdateHandler (I
>> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
>> for insertion.
>>
>> Could you try putting a break point in dataset.sync and see what the
>> call stack is when it gets hit?  That'll tell you who is causing the
> sync.
>> There used to be (up to v 0.8.9? not int he last snapshot build) a sync
>> wrapper that sync() every n'000 triples added.  It's not in the
>> development codebase.  All hidden implicit syncs should now be removed
>>    They were causing problems for a user who was tracking whether the DB
>> on disk was dirty or not.
>>
>> Brian, Frank, which versions are you running?
>>
>>   >  as I'm having trouble checking ARQ out of SVN,
>>
>> What sort of trouble?
>>
>>     Andy
>>
>>
>>> Brian
>>>
>>>
>>>
>>>> Max mem: 1,820M
>>>> DIRECT mode
>>>> log4j:WARN No appenders could be found for logger
>>>> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
>>>> log4j:WARN Please initialize the log4j system properly.
>>>>> Starting test: Fri Mar 25 09:56:03 PDT 2011
>>>>> Initial number of indexed graphs: 0
>>>> 100 at: Fri Mar 25 09:56:06 PDT 2011
>>>> 200 at: Fri Mar 25 09:56:07 PDT 2011
>>>> 300 at: Fri Mar 25 09:56:09 PDT 2011
>>>> 400 at: Fri Mar 25 09:56:10 PDT 2011
>>>> 500 at: Fri Mar 25 09:56:11 PDT 2011
>>>> ...
>>>> 99700 at: Fri Mar 25 10:09:27 PDT 2011
>>>> 99800 at: Fri Mar 25 10:09:27 PDT 2011
>>>> 99900 at: Fri Mar 25 10:09:28 PDT 2011
>>>> 100000 at: Fri Mar 25 10:09:29 PDT 2011
>>>>> Done at: Fri Mar 25 10:09:35 PDT 2011
>>>> 100,000 graphs in 812.30 sec
>>>>
>>>> With that change, I'm now only about 30% slower than Andy's number.
> Maybe
>>>> that's attributable to Windows vs Linux or the hardware differences.
> I'm
>>>> running it on:
>>>>
>>>> Intel Xeon E5345 233 GHZ (2 processors)
>>>> 8 GB RAM
>>>> 300 GB HDD
>>>> Windows 7 Enterprise SP1
>>>>
>>>> Does anybody know how a Xeon 5345 should compare to i5 or i7
>>>> processors, or
>>>> how much difference there might be between Linux vs Windows 7.
>>>>
>>>> Thanks again for your help.
>>>>
>>>> Frank
>>>>
>>>> Brian McBride<br...@epimorphics.com>  wrote on 03/25/2011 12:15:30 PM:
>>>>
>>>>> [image removed]
>>>>>
>>>>> Re: Bad Jena TDB performance
>>>>>
>>>>> Brian McBride
>>>>>
>>>>> to:
>>>>>
>>>>> jena-users
>>>>>
>>>>> 03/25/2011 12:16 PM
>>>>>
>>>>> Please respond to jena-users
>>>>>
>>>>> Hi Andy, Frank,
>>>>>
>>>>> On 25/03/2011 14:13, Andy Seaborne wrote:
>>>>>> I ran today:
>>>>>>
>>>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
>>>>> reports/ReportOutOfMemoryManyGraphsTDB.java
>>>>> [[
>>>>>
>>>>> I would very much appreciate it if others on this mailing list could
>>>>> also
>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>> colleagues),
>>>>> or is there some kind of pattern to explain this huge difference.
>>>>> ]]
>>>>>
>>>>>
>>>>> Having a windows 7 box and a couple of spare minutes I tried running
>>>>> Andy's version of the test code. I got performance akin to what Frank
>>>>> reports.
>>>>>
>>>>> [[
>>>>> Max mem: 1,820M
>>>>> DIRECT mode
>>>>>> Starting test: Fri Mar 25 16:06:36 GMT 2011
>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>> Initial number of indexed graphs: 0
>>>>> 100 at: Fri Mar 25 16:07:17 GMT 2011
>>>>> 200 at: Fri Mar 25 16:08:05 GMT 2011
>>>>> 300 at: Fri Mar 25 16:08:56 GMT 2011
>>>>> ...
>>>>> ]]
>>>>>
>>>>> However, it runs a lot faster, if instead of reading the input stream
>>>>> directly into TDB, I read it into a memory model first and then add
> that
>>>>> model to a TDB model. Then I get performance similar to that which
> Andy
>>>>> reports. i.e.
>>>>>
>>>>> [[
>>>>> Max mem: 1,820M
>>>>> DIRECT mode
>>>>>> Starting test: Fri Mar 25 15:55:48 GMT 2011
>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>> Initial number of indexed graphs: 0
>>>>> 100 at: Fri Mar 25 15:55:51 GMT 2011
>>>>> 200 at: Fri Mar 25 15:55:51 GMT 2011
>>>>> 300 at: Fri Mar 25 15:55:52 GMT 2011
>>>>> ...
>>>>>
>>>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
>>>>>> Done at: Fri Mar 25 16:04:55 GMT 2011
>>>>> 100,000 graphs in 547.00 sec
>>>>> ]]
>>>>>
>>>>> Frank, you might try replacing
>>>>>
>>>>> [[
>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>> load
>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>> try {
>>>>> Model model =
>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>> com.ibm.team.workitem.WorkItem/"
>>>>> + i);
>>>>> model.read(instream, null);
>>>>> //model.close();
>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>> " + new java.util.Date());
>>>>> instream.close();
>>>>> }
>>>>> ]]
>>>>>
>>>>> with
>>>>> [[
>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>> load
>>>>> Model m = ModelFactory.createDefaultModel();
>>>>> m.read(instream, null);
>>>>>
>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>> try {
>>>>> Model model =
>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>> com.ibm.team.workitem.WorkItem/"
>>>>> + i);
>>>>> // model.read(instream, null);
>>>>> model.add(m);
>>>>> //model.close();
>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>> " + new java.util.Date());
>>>>> instream.close();
>>>>> }
>>>>> ]]
>>>>>
>>>>> I am running I think the latest Maven versions of Jena etc on Windows
> 7
>>>>> on an Intel i7 with 8GB. Tests were run in Eclipse.
>>>>>
>>>>> Windows 7 SP1
>>>>> Jena 2.6.4
>>>>> ARQ 2.8.7
>>>>> TDB 0.8.9
>>>>>
>>>>> Hoping this might be helpful,
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>> in Eclispe
>>>>>> in direct mode.
>>>>>>
>>>>>> It has some configuration choices you might like to try.
>>>>>>
>>>>>> Max mem: 910M
>>>>>> DIRECT mode
>>>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
>>>>>>> Initial number of indexed graphs: 0
>>>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
>>>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
>>>>>> ....
>>>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
>>>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
>>>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
>>>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
>>>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
>>>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
>>>>>> 100,000 graphs in 601.98 sec
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 25/03/11 13:50, Frank Budinsky wrote:
>>>>>>> Hi Andy and all,
>>>>>>>
>>>>>>> I finally managed to get a relatively powerful machine set up
>>>>>> Details?
>>>>>>
>>>>>>> and I reran
>>>>>>> the test program I sent you, but unfortunately, it still runs
> orders
>>>> of
>>>>>>> magnitude slower than the numbers you got when you tried it. The
>>>>>>> hardware I
>>>>>>> used this time is similar to what you are using, the only
> significant
>>>>>>> difference is that it's running Window's 7 instead of Ubuntu. I
> know
>>>>>>> Linux
>>>>>>> is somewhat faster than Windows, but I don't expect we can blame
>>>>>>> Microsoft
>>>>>>> for a change from 573.87 seconds to about 4 hours :-)
>>>>>>>
>>>>>>> Are you sure that your numbers are correct? How big was the TDB
>>>>>>> database on
>>>>>>> disk at the end of the run?
>>>>>> 3.9G DB1
>>>>>>
>>>>>>> Do you have any other ideas what may be wrong
>>>>>>> with my configuration?
>>>>>> Windows server or desktop? (server is better at I/O)
>>>>>> 64 bit?
>>>>>> Is there a virus checker?
>>>>>>
>>>>>>> I would very much appreciate it if others on this mailing list
> could
>>>>>>> also
>>>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>>>>> colleagues),
>>>>>>> or is there some kind of pattern to explain this huge difference.
>>>>>>>
>>>>>>> Here is the simple test program again (inlined this time, since
> Apache
>>>>>>> seems to throw away attachments). To run it, just change the
> TDB_DIR
>>>>>>> constant to some empty directory. The test program loads100,000
>>>>>>> datagraphs
>>>>>>> (about 100 triples each ->  10M triples total). It prints a message
>>>>>>> on the
>>>>>>> console at every 100, so if it's taking seconds for each println,
>>>> you'll
>>>>>>> know very quickly that it will take hours to run, instead of a few
>>>>>>> minutes,
>>>>>>> like Andy has seen.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Frank.

Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.
On 28/03/2011 18:38, Frank Budinsky wrote:
> Hi Brian,
>
> Thanks for summarizing the situation so clearly. It all adds up, except for
> one thing (see below).
Yeah - but that was prety fundamental.

What JVM are you running?  Is it a 64 bit JVM?

Brian


Re: Bad Jena TDB performance

Posted by Andy Seaborne <an...@epimorphics.com>.

On 31/03/11 15:00, Frank Budinsky wrote:
> Hi Brian,
>
> On my Xeon server the number of graphs makes a big difference in the
> test, since at first (e.g., the first 5k) the performance of direct and
> memory mapped modes is basically the same, but as it continues to run,
> direct mode stays constant, but the memory mapped mode keeps getting
> slower and slower.
>
> Btw, I tried various JVMs and the times vary somewhat, but do not change
> the really slow time I see with memory mapped mode. It really looks like
> memory mapped mode just doesn't work properly on my particular hardware??
>

Brian made a interesting discovery about OS disk caching settings.

http://mail-archives.apache.org/mod_mbox/incubator-jena-users/201103.mbox/%3C4D9313D8.2030701@epimorphics.com%3E

> Frank.

I'm using ext4 on Ubuntu.

(there is various discussions about issues with ext3 and out-of-order 
writes and disk corruption).

As previously mentioned: in the development version of TDB, less sync() 
operations are done - only explicit calls to sync cause the DB to sync. 
  There is a snapshot build with those changes in.

	Andy

Re: Bad Jena TDB performance

Posted by Frank Budinsky <fr...@ca.ibm.com>.
Hi Brian,

On my Xeon server the number of graphs makes a big difference in the test,
since at first (e.g., the first 5k) the performance of direct and memory
mapped modes is basically the same, but as it continues to run, direct mode
stays constant, but the memory mapped mode keeps getting slower and slower.

Btw, I tried various JVMs and the times vary somewhat, but do not change
the really slow time I see with memory mapped mode. It really looks like
memory mapped mode just doesn't work properly on my particular hardware??

Frank.


                                                                                                                           
  From:       Brian McBride <br...@epimorphics.com>                                                                        
                                                                                                                           
  To:         jena-users@incubator.apache.org                                                                              
                                                                                                                           
  Date:       03/29/2011 09:48 AM                                                                                          
                                                                                                                           
  Subject:    Re: Bad Jena TDB performance                                                                                 
                                                                                                                           





Hi Frank,

I have rerun some of the tests on two different Windows 7 boxes.  This
is with direct reading of the file into the TDB model.  To keep the run
times down I have just loaded 5k graphs.

On my laptop:
Direct
             max-mem
             time
N
             908M
             889.03
Y
             908M
             824.93
N
             1820M
             911.90
Y
             1820M
             701.92


As you can see there is a speedup when setting direct mode.

The results I reported yesterday were from my desktop machine.  I reran
those tests today and ran them for longer (5k graphs)

Direct
             max-mem
             time
N
             1820M
             1743.83
Y
             1820M
             2397.50


These are similar to the results I reported yesterday, but they are not
all what I would expect compared to the results on my laptop.

On the same OS, same VM same application code, same heap size, my
desktop is running very much slower than my laptop and slows down rather
than speeds up when I turn on DIRECT mode.  Both machines have i7
processors.  The laptop has 4GB of RAM and the desktop 8GB.

I cannot at present account for this.

You report loading 100k graphs in 2080.58 secs in DIRECT mode.  Neither
of my machines is coming close to that speed - yours is much faster.

Brian

On 28/03/2011 18:38, Frank Budinsky wrote:
> Hi Brian,
>
> Thanks for summarizing the situation so clearly. It all adds up, except
for
> one thing (see below).
>
> Brian McBride<br...@epimorphics.com>  wrote on 03/28/2011 01:03:13 PM:
>
>> [image removed]
>>
>> Re: Bad Jena TDB performance
>>
>> Brian McBride
>>
>> to:
>>
>> jena-users
>>
>> 03/28/2011 01:03 PM
>>
>> Please respond to jena-users
>>
>> Hi Frank,
>>
>> Separating out commentary from data, I've put this in a different
> message.
>> Here is what I think is happening:
>>
>> 1.  You are on a 32 bit OS, and on a 32 bit OS DIRECT mode enables TDB
>> to go faster.
> It is really behaving like this is the case, only it's not. I'm running
on
> a 64 bit OS. Since you are also running on Windows 7 64 bit and are not
> seeing the same slow behavior in non-DIRECT mode, it seems like the issue
> must be something specific with my configuration and not a more general
> Windows 7 problem.
>
>> 2. The current released version of Jena/ARQ/TDB does a SYNC at the end
>> of every model.read into a TDB model, so with model.read() you are doing
>> 100,000 SYNC's plus one at the end.  model.add does not do a SYNC so if
>> you use that you are only doing one SYNC and that is why things go
>> faster that way.
>>
>> 3. If you do both DIRECT mode and model.add you get the benefit of both
>> faster TDB performance on 32 bit OS's and avoid 100k SYNCs.
>>
>> 4. You reported loading 100k graphs in 2080 secs in direct mode.  Andy
>> reported getting the same in 601  secs.  If Andy was running the test on
>> the current development version (plausible as he is a developer), that
>> is not doing the 100k extra SYNCs and that is why he got a figure much
>> closer to your DIRECT mode + model.add().
>>
>> 4. I'm loading 100k graphs in about 540s.  You are taking 812.   Could
>> that difference be plausibly accounted for by hardware difference plus
>> the fact I am running a 64 bit OS?  Or do you think there is more to it?
> I think the difference (540 vs 812) probably can be explained by the
> hardware difference.
>
> Thanks again for all your help with this.
>
> Frank.
>
>> Brian
>>
>>
>> On 28/03/2011 16:04, Frank Budinsky wrote:
>>>> Brian, Frank, which versions are you running?
>>> I'm running the same version as Brian.
>>>
>>> tbd: 0.8.9
>>> arq: 2.8.7
>>> jena: 2.6.4
>>>
>>> I tried disabling anti-virus - no difference.
>>>
>>> I tried bumping up the heap to 3G - no difference.
>>>
>>> I tried running without setting direct mode, but using Brian's change,
>>> i.e., model.add() instead of model.read() - this runs slower than with
> both
>>> Brian's change and direct mode, but faster than if I run with direct
> mode
>>> without Brian's change:
>>>
>>> Only direct mode:
>>> 100,000 graphs in 2,080.58 sec
>>>
>>> Only model.add():
>>> 100,000 graphs in 1,246.25 sec
>>>
>>> Both direct mode and model.add():
>>> 100,000 graphs in 812.30 sec
>>>
>>> Brian, have you tried running it without setting direct mode? I'm
> wondering
>>> if there's a problem with memory mapped I/O on Windows?
>>>
>>> Thanks,
>>> Frank.
>>>
>>> Andy Seaborne<an...@epimorphics.com>   wrote on 03/28/2011
> 04:17:49
>>> AM:
>>>
>>>> [image removed]
>>>>
>>>> Re: Bad Jena TDB performance
>>>>
>>>> Andy Seaborne
>>>>
>>>> to:
>>>>
>>>> jena-users
>>>>
>>>> 03/28/2011 04:18 AM
>>>>
>>>> Please respond to jena-users
>>>>
>>>>
>>>>
>>>> On 27/03/11 18:56, Brian McBride wrote:
>>>>> On 25/03/2011 18:25, Frank Budinsky wrote:
>>>>> [...]
>>>>>> One interesting difference between my slow running case and Brian's,
>>> is
>>>>>> that mine always starts out pretty fast, but gradually slows down.
>>> Notice
>>>>>> that it was taking about 2 seconds per 100 at the start, but it was
> up
>>> to
>>>>>> 15 seconds at 39800. I didn't let it run to completion, this time,
> but
>>> I
>>>>>> remember from a previous run that it was taking about 2 min per 100
> at
>>>>>> the
>>>>>> end. Any idea why it might slow down over time, like this, when not
>>> using
>>>>>> direct mode?
>>>>>>
>>>>>> Anyway, the next thing I tried, was to change the code has Brian
>>>>>> suggested.
>>>>>> That also had a big impact for me:
>>>>> I have a hypothesis for why.
>>>>>
>>>>> When ARP, the RDF/XML parser, finishes reading a file it issues a
>>>>> finishedRead event to the event handler in the graph it is reading
>>> into.
>>>>> If the TDB default event handler then did something (like sync to
>>> disk?)
>>>>> that the memory model does not, this could explain the difference in
>>>>> performance. I have put a profiler on the test program and it reports
>>>>> that the test program is spending a lot more time in
>>>>> BlockManagerFile.force() when it is reading directly in to TDB than
>>> when
>>>>> it is going via a memory model. So there is some evidence that this
> is
>>>>> what is happening.
>>>>>
>>>>> I haven't been able to track down the block manager code actually in
>>> use
>>>>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows
>>> off
>>>>> the top of his head whether this is plausible.
>>>>    >   s/block manager/event manager/
>>>>
>>>> Could be - the model.add(model) will go via the BulkUpdateHandler (I
>>>> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
>>>> for insertion.
>>>>
>>>> Could you try putting a break point in dataset.sync and see what the
>>>> call stack is when it gets hit?  That'll tell you who is causing the
>>> sync.
>>>> There used to be (up to v 0.8.9? not int he last snapshot build) a
> sync
>>>> wrapper that sync() every n'000 triples added.  It's not in the
>>>> development codebase.  All hidden implicit syncs should now be removed
>>>>     They were causing problems for a user who was tracking whether the
> DB
>>>> on disk was dirty or not.
>>>>
>>>> Brian, Frank, which versions are you running?
>>>>
>>>>    >   as I'm having trouble checking ARQ out of SVN,
>>>>
>>>> What sort of trouble?
>>>>
>>>>      Andy
>>>>
>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>> Max mem: 1,820M
>>>>>> DIRECT mode
>>>>>> log4j:WARN No appenders could be found for logger
>>>>>> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
>>>>>> log4j:WARN Please initialize the log4j system properly.
>>>>>>> Starting test: Fri Mar 25 09:56:03 PDT 2011
>>>>>>> Initial number of indexed graphs: 0
>>>>>> 100 at: Fri Mar 25 09:56:06 PDT 2011
>>>>>> 200 at: Fri Mar 25 09:56:07 PDT 2011
>>>>>> 300 at: Fri Mar 25 09:56:09 PDT 2011
>>>>>> 400 at: Fri Mar 25 09:56:10 PDT 2011
>>>>>> 500 at: Fri Mar 25 09:56:11 PDT 2011
>>>>>> ...
>>>>>> 99700 at: Fri Mar 25 10:09:27 PDT 2011
>>>>>> 99800 at: Fri Mar 25 10:09:27 PDT 2011
>>>>>> 99900 at: Fri Mar 25 10:09:28 PDT 2011
>>>>>> 100000 at: Fri Mar 25 10:09:29 PDT 2011
>>>>>>> Done at: Fri Mar 25 10:09:35 PDT 2011
>>>>>> 100,000 graphs in 812.30 sec
>>>>>>
>>>>>> With that change, I'm now only about 30% slower than Andy's number.
>>> Maybe
>>>>>> that's attributable to Windows vs Linux or the hardware differences.
>>> I'm
>>>>>> running it on:
>>>>>>
>>>>>> Intel Xeon E5345 233 GHZ (2 processors)
>>>>>> 8 GB RAM
>>>>>> 300 GB HDD
>>>>>> Windows 7 Enterprise SP1
>>>>>>
>>>>>> Does anybody know how a Xeon 5345 should compare to i5 or i7
>>>>>> processors, or
>>>>>> how much difference there might be between Linux vs Windows 7.
>>>>>>
>>>>>> Thanks again for your help.
>>>>>>
>>>>>> Frank
>>>>>>
>>>>>> Brian McBride<br...@epimorphics.com>   wrote on 03/25/2011 12:15:30
> PM:
>>>>>>> [image removed]
>>>>>>>
>>>>>>> Re: Bad Jena TDB performance
>>>>>>>
>>>>>>> Brian McBride
>>>>>>>
>>>>>>> to:
>>>>>>>
>>>>>>> jena-users
>>>>>>>
>>>>>>> 03/25/2011 12:16 PM
>>>>>>>
>>>>>>> Please respond to jena-users
>>>>>>>
>>>>>>> Hi Andy, Frank,
>>>>>>>
>>>>>>> On 25/03/2011 14:13, Andy Seaborne wrote:
>>>>>>>> I ran today:
>>>>>>>>
>>>>>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
>>>>>>> reports/ReportOutOfMemoryManyGraphsTDB.java
>>>>>>> [[
>>>>>>>
>>>>>>> I would very much appreciate it if others on this mailing list
> could
>>>>>>> also
>>>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>>>> colleagues),
>>>>>>> or is there some kind of pattern to explain this huge difference.
>>>>>>> ]]
>>>>>>>
>>>>>>>
>>>>>>> Having a windows 7 box and a couple of spare minutes I tried
> running
>>>>>>> Andy's version of the test code. I got performance akin to what
> Frank
>>>>>>> reports.
>>>>>>>
>>>>>>> [[
>>>>>>> Max mem: 1,820M
>>>>>>> DIRECT mode
>>>>>>>> Starting test: Fri Mar 25 16:06:36 GMT 2011
>>>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>>>> Initial number of indexed graphs: 0
>>>>>>> 100 at: Fri Mar 25 16:07:17 GMT 2011
>>>>>>> 200 at: Fri Mar 25 16:08:05 GMT 2011
>>>>>>> 300 at: Fri Mar 25 16:08:56 GMT 2011
>>>>>>> ...
>>>>>>> ]]
>>>>>>>
>>>>>>> However, it runs a lot faster, if instead of reading the input
> stream
>>>>>>> directly into TDB, I read it into a memory model first and then add
>>> that
>>>>>>> model to a TDB model. Then I get performance similar to that which
>>> Andy
>>>>>>> reports. i.e.
>>>>>>>
>>>>>>> [[
>>>>>>> Max mem: 1,820M
>>>>>>> DIRECT mode
>>>>>>>> Starting test: Fri Mar 25 15:55:48 GMT 2011
>>>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>>>> Initial number of indexed graphs: 0
>>>>>>> 100 at: Fri Mar 25 15:55:51 GMT 2011
>>>>>>> 200 at: Fri Mar 25 15:55:51 GMT 2011
>>>>>>> 300 at: Fri Mar 25 15:55:52 GMT 2011
>>>>>>> ...
>>>>>>>
>>>>>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
>>>>>>>> Done at: Fri Mar 25 16:04:55 GMT 2011
>>>>>>> 100,000 graphs in 547.00 sec
>>>>>>> ]]
>>>>>>>
>>>>>>> Frank, you might try replacing
>>>>>>>
>>>>>>> [[
>>>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>>>> load
>>>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>>>> try {
>>>>>>> Model model =
>>>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>>>> com.ibm.team.workitem.WorkItem/"
>>>>>>> + i);
>>>>>>> model.read(instream, null);
>>>>>>> //model.close();
>>>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>>>> " + new java.util.Date());
>>>>>>> instream.close();
>>>>>>> }
>>>>>>> ]]
>>>>>>>
>>>>>>> with
>>>>>>> [[
>>>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>>>> load
>>>>>>> Model m = ModelFactory.createDefaultModel();
>>>>>>> m.read(instream, null);
>>>>>>>
>>>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>>>> try {
>>>>>>> Model model =
>>>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>>>> com.ibm.team.workitem.WorkItem/"
>>>>>>> + i);
>>>>>>> // model.read(instream, null);
>>>>>>> model.add(m);
>>>>>>> //model.close();
>>>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>>>> " + new java.util.Date());
>>>>>>> instream.close();
>>>>>>> }
>>>>>>> ]]
>>>>>>>
>>>>>>> I am running I think the latest Maven versions of Jena etc on
> Windows
>>> 7
>>>>>>> on an Intel i7 with 8GB. Tests were run in Eclipse.
>>>>>>>
>>>>>>> Windows 7 SP1
>>>>>>> Jena 2.6.4
>>>>>>> ARQ 2.8.7
>>>>>>> TDB 0.8.9
>>>>>>>
>>>>>>> Hoping this might be helpful,
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> in Eclispe
>>>>>>>> in direct mode.
>>>>>>>>
>>>>>>>> It has some configuration choices you might like to try.
>>>>>>>>
>>>>>>>> Max mem: 910M
>>>>>>>> DIRECT mode
>>>>>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
>>>>>>>>> Initial number of indexed graphs: 0
>>>>>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
>>>>>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
>>>>>>>> ....
>>>>>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
>>>>>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
>>>>>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
>>>>>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
>>>>>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
>>>>>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
>>>>>>>> 100,000 graphs in 601.98 sec
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25/03/11 13:50, Frank Budinsky wrote:
>>>>>>>>> Hi Andy and all,
>>>>>>>>>
>>>>>>>>> I finally managed to get a relatively powerful machine set up
>>>>>>>> Details?
>>>>>>>>
>>>>>>>>> and I reran
>>>>>>>>> the test program I sent you, but unfortunately, it still runs
>>> orders
>>>>>> of
>>>>>>>>> magnitude slower than the numbers you got when you tried it. The
>>>>>>>>> hardware I
>>>>>>>>> used this time is similar to what you are using, the only
>>> significant
>>>>>>>>> difference is that it's running Window's 7 instead of Ubuntu. I
>>> know
>>>>>>>>> Linux
>>>>>>>>> is somewhat faster than Windows, but I don't expect we can blame
>>>>>>>>> Microsoft
>>>>>>>>> for a change from 573.87 seconds to about 4 hours :-)
>>>>>>>>>
>>>>>>>>> Are you sure that your numbers are correct? How big was the TDB
>>>>>>>>> database on
>>>>>>>>> disk at the end of the run?
>>>>>>>> 3.9G DB1
>>>>>>>>
>>>>>>>>> Do you have any other ideas what may be wrong
>>>>>>>>> with my configuration?
>>>>>>>> Windows server or desktop? (server is better at I/O)
>>>>>>>> 64 bit?
>>>>>>>> Is there a virus checker?
>>>>>>>>
>>>>>>>>> I would very much appreciate it if others on this mailing list
>>> could
>>>>>>>>> also
>>>>>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>>>>>>> colleagues),
>>>>>>>>> or is there some kind of pattern to explain this huge difference.
>>>>>>>>>
>>>>>>>>> Here is the simple test program again (inlined this time, since
>>> Apache
>>>>>>>>> seems to throw away attachments). To run it, just change the
>>> TDB_DIR
>>>>>>>>> constant to some empty directory. The test program loads100,000
>>>>>>>>> datagraphs
>>>>>>>>> (about 100 triples each ->   10M triples total). It prints a
> message
>>>>>>>>> on the
>>>>>>>>> console at every 100, so if it's taking seconds for each println,
>>>>>> you'll
>>>>>>>>> know very quickly that it will take hours to run, instead of a
> few
>>>>>>>>> minutes,
>>>>>>>>> like Andy has seen.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Frank.


Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.
Hi Frank,

I have rerun some of the tests on two different Windows 7 boxes.  This 
is with direct reading of the file into the TDB model.  To keep the run 
times down I have just loaded 5k graphs.

On my laptop:
Direct
	max-mem
	time
N
	908M
	889.03
Y
	908M
	824.93
N
	1820M
	911.90
Y
	1820M
	701.92


As you can see there is a speedup when setting direct mode.

The results I reported yesterday were from my desktop machine.  I reran 
those tests today and ran them for longer (5k graphs)

Direct
	max-mem
	time
N
	1820M
	1743.83
Y
	1820M
	2397.50


These are similar to the results I reported yesterday, but they are not 
all what I would expect compared to the results on my laptop.

On the same OS, same VM same application code, same heap size, my 
desktop is running very much slower than my laptop and slows down rather 
than speeds up when I turn on DIRECT mode.  Both machines have i7 
processors.  The laptop has 4GB of RAM and the desktop 8GB.

I cannot at present account for this.

You report loading 100k graphs in 2080.58 secs in DIRECT mode.  Neither 
of my machines is coming close to that speed - yours is much faster.

Brian

On 28/03/2011 18:38, Frank Budinsky wrote:
> Hi Brian,
>
> Thanks for summarizing the situation so clearly. It all adds up, except for
> one thing (see below).
>
> Brian McBride<br...@epimorphics.com>  wrote on 03/28/2011 01:03:13 PM:
>
>> [image removed]
>>
>> Re: Bad Jena TDB performance
>>
>> Brian McBride
>>
>> to:
>>
>> jena-users
>>
>> 03/28/2011 01:03 PM
>>
>> Please respond to jena-users
>>
>> Hi Frank,
>>
>> Separating out commentary from data, I've put this in a different
> message.
>> Here is what I think is happening:
>>
>> 1.  You are on a 32 bit OS, and on a 32 bit OS DIRECT mode enables TDB
>> to go faster.
> It is really behaving like this is the case, only it's not. I'm running on
> a 64 bit OS. Since you are also running on Windows 7 64 bit and are not
> seeing the same slow behavior in non-DIRECT mode, it seems like the issue
> must be something specific with my configuration and not a more general
> Windows 7 problem.
>
>> 2. The current released version of Jena/ARQ/TDB does a SYNC at the end
>> of every model.read into a TDB model, so with model.read() you are doing
>> 100,000 SYNC's plus one at the end.  model.add does not do a SYNC so if
>> you use that you are only doing one SYNC and that is why things go
>> faster that way.
>>
>> 3. If you do both DIRECT mode and model.add you get the benefit of both
>> faster TDB performance on 32 bit OS's and avoid 100k SYNCs.
>>
>> 4. You reported loading 100k graphs in 2080 secs in direct mode.  Andy
>> reported getting the same in 601  secs.  If Andy was running the test on
>> the current development version (plausible as he is a developer), that
>> is not doing the 100k extra SYNCs and that is why he got a figure much
>> closer to your DIRECT mode + model.add().
>>
>> 4. I'm loading 100k graphs in about 540s.  You are taking 812.   Could
>> that difference be plausibly accounted for by hardware difference plus
>> the fact I am running a 64 bit OS?  Or do you think there is more to it?
> I think the difference (540 vs 812) probably can be explained by the
> hardware difference.
>
> Thanks again for all your help with this.
>
> Frank.
>
>> Brian
>>
>>
>> On 28/03/2011 16:04, Frank Budinsky wrote:
>>>> Brian, Frank, which versions are you running?
>>> I'm running the same version as Brian.
>>>
>>> tbd: 0.8.9
>>> arq: 2.8.7
>>> jena: 2.6.4
>>>
>>> I tried disabling anti-virus - no difference.
>>>
>>> I tried bumping up the heap to 3G - no difference.
>>>
>>> I tried running without setting direct mode, but using Brian's change,
>>> i.e., model.add() instead of model.read() - this runs slower than with
> both
>>> Brian's change and direct mode, but faster than if I run with direct
> mode
>>> without Brian's change:
>>>
>>> Only direct mode:
>>> 100,000 graphs in 2,080.58 sec
>>>
>>> Only model.add():
>>> 100,000 graphs in 1,246.25 sec
>>>
>>> Both direct mode and model.add():
>>> 100,000 graphs in 812.30 sec
>>>
>>> Brian, have you tried running it without setting direct mode? I'm
> wondering
>>> if there's a problem with memory mapped I/O on Windows?
>>>
>>> Thanks,
>>> Frank.
>>>
>>> Andy Seaborne<an...@epimorphics.com>   wrote on 03/28/2011
> 04:17:49
>>> AM:
>>>
>>>> [image removed]
>>>>
>>>> Re: Bad Jena TDB performance
>>>>
>>>> Andy Seaborne
>>>>
>>>> to:
>>>>
>>>> jena-users
>>>>
>>>> 03/28/2011 04:18 AM
>>>>
>>>> Please respond to jena-users
>>>>
>>>>
>>>>
>>>> On 27/03/11 18:56, Brian McBride wrote:
>>>>> On 25/03/2011 18:25, Frank Budinsky wrote:
>>>>> [...]
>>>>>> One interesting difference between my slow running case and Brian's,
>>> is
>>>>>> that mine always starts out pretty fast, but gradually slows down.
>>> Notice
>>>>>> that it was taking about 2 seconds per 100 at the start, but it was
> up
>>> to
>>>>>> 15 seconds at 39800. I didn't let it run to completion, this time,
> but
>>> I
>>>>>> remember from a previous run that it was taking about 2 min per 100
> at
>>>>>> the
>>>>>> end. Any idea why it might slow down over time, like this, when not
>>> using
>>>>>> direct mode?
>>>>>>
>>>>>> Anyway, the next thing I tried, was to change the code has Brian
>>>>>> suggested.
>>>>>> That also had a big impact for me:
>>>>> I have a hypothesis for why.
>>>>>
>>>>> When ARP, the RDF/XML parser, finishes reading a file it issues a
>>>>> finishedRead event to the event handler in the graph it is reading
>>> into.
>>>>> If the TDB default event handler then did something (like sync to
>>> disk?)
>>>>> that the memory model does not, this could explain the difference in
>>>>> performance. I have put a profiler on the test program and it reports
>>>>> that the test program is spending a lot more time in
>>>>> BlockManagerFile.force() when it is reading directly in to TDB than
>>> when
>>>>> it is going via a memory model. So there is some evidence that this
> is
>>>>> what is happening.
>>>>>
>>>>> I haven't been able to track down the block manager code actually in
>>> use
>>>>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows
>>> off
>>>>> the top of his head whether this is plausible.
>>>>    >   s/block manager/event manager/
>>>>
>>>> Could be - the model.add(model) will go via the BulkUpdateHandler (I
>>>> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
>>>> for insertion.
>>>>
>>>> Could you try putting a break point in dataset.sync and see what the
>>>> call stack is when it gets hit?  That'll tell you who is causing the
>>> sync.
>>>> There used to be (up to v 0.8.9? not int he last snapshot build) a
> sync
>>>> wrapper that sync() every n'000 triples added.  It's not in the
>>>> development codebase.  All hidden implicit syncs should now be removed
>>>>     They were causing problems for a user who was tracking whether the
> DB
>>>> on disk was dirty or not.
>>>>
>>>> Brian, Frank, which versions are you running?
>>>>
>>>>    >   as I'm having trouble checking ARQ out of SVN,
>>>>
>>>> What sort of trouble?
>>>>
>>>>      Andy
>>>>
>>>>
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>> Max mem: 1,820M
>>>>>> DIRECT mode
>>>>>> log4j:WARN No appenders could be found for logger
>>>>>> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
>>>>>> log4j:WARN Please initialize the log4j system properly.
>>>>>>> Starting test: Fri Mar 25 09:56:03 PDT 2011
>>>>>>> Initial number of indexed graphs: 0
>>>>>> 100 at: Fri Mar 25 09:56:06 PDT 2011
>>>>>> 200 at: Fri Mar 25 09:56:07 PDT 2011
>>>>>> 300 at: Fri Mar 25 09:56:09 PDT 2011
>>>>>> 400 at: Fri Mar 25 09:56:10 PDT 2011
>>>>>> 500 at: Fri Mar 25 09:56:11 PDT 2011
>>>>>> ...
>>>>>> 99700 at: Fri Mar 25 10:09:27 PDT 2011
>>>>>> 99800 at: Fri Mar 25 10:09:27 PDT 2011
>>>>>> 99900 at: Fri Mar 25 10:09:28 PDT 2011
>>>>>> 100000 at: Fri Mar 25 10:09:29 PDT 2011
>>>>>>> Done at: Fri Mar 25 10:09:35 PDT 2011
>>>>>> 100,000 graphs in 812.30 sec
>>>>>>
>>>>>> With that change, I'm now only about 30% slower than Andy's number.
>>> Maybe
>>>>>> that's attributable to Windows vs Linux or the hardware differences.
>>> I'm
>>>>>> running it on:
>>>>>>
>>>>>> Intel Xeon E5345 233 GHZ (2 processors)
>>>>>> 8 GB RAM
>>>>>> 300 GB HDD
>>>>>> Windows 7 Enterprise SP1
>>>>>>
>>>>>> Does anybody know how a Xeon 5345 should compare to i5 or i7
>>>>>> processors, or
>>>>>> how much difference there might be between Linux vs Windows 7.
>>>>>>
>>>>>> Thanks again for your help.
>>>>>>
>>>>>> Frank
>>>>>>
>>>>>> Brian McBride<br...@epimorphics.com>   wrote on 03/25/2011 12:15:30
> PM:
>>>>>>> [image removed]
>>>>>>>
>>>>>>> Re: Bad Jena TDB performance
>>>>>>>
>>>>>>> Brian McBride
>>>>>>>
>>>>>>> to:
>>>>>>>
>>>>>>> jena-users
>>>>>>>
>>>>>>> 03/25/2011 12:16 PM
>>>>>>>
>>>>>>> Please respond to jena-users
>>>>>>>
>>>>>>> Hi Andy, Frank,
>>>>>>>
>>>>>>> On 25/03/2011 14:13, Andy Seaborne wrote:
>>>>>>>> I ran today:
>>>>>>>>
>>>>>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
>>>>>>> reports/ReportOutOfMemoryManyGraphsTDB.java
>>>>>>> [[
>>>>>>>
>>>>>>> I would very much appreciate it if others on this mailing list
> could
>>>>>>> also
>>>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>>>> colleagues),
>>>>>>> or is there some kind of pattern to explain this huge difference.
>>>>>>> ]]
>>>>>>>
>>>>>>>
>>>>>>> Having a windows 7 box and a couple of spare minutes I tried
> running
>>>>>>> Andy's version of the test code. I got performance akin to what
> Frank
>>>>>>> reports.
>>>>>>>
>>>>>>> [[
>>>>>>> Max mem: 1,820M
>>>>>>> DIRECT mode
>>>>>>>> Starting test: Fri Mar 25 16:06:36 GMT 2011
>>>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>>>> Initial number of indexed graphs: 0
>>>>>>> 100 at: Fri Mar 25 16:07:17 GMT 2011
>>>>>>> 200 at: Fri Mar 25 16:08:05 GMT 2011
>>>>>>> 300 at: Fri Mar 25 16:08:56 GMT 2011
>>>>>>> ...
>>>>>>> ]]
>>>>>>>
>>>>>>> However, it runs a lot faster, if instead of reading the input
> stream
>>>>>>> directly into TDB, I read it into a memory model first and then add
>>> that
>>>>>>> model to a TDB model. Then I get performance similar to that which
>>> Andy
>>>>>>> reports. i.e.
>>>>>>>
>>>>>>> [[
>>>>>>> Max mem: 1,820M
>>>>>>> DIRECT mode
>>>>>>>> Starting test: Fri Mar 25 15:55:48 GMT 2011
>>>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>>>> Initial number of indexed graphs: 0
>>>>>>> 100 at: Fri Mar 25 15:55:51 GMT 2011
>>>>>>> 200 at: Fri Mar 25 15:55:51 GMT 2011
>>>>>>> 300 at: Fri Mar 25 15:55:52 GMT 2011
>>>>>>> ...
>>>>>>>
>>>>>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
>>>>>>>> Done at: Fri Mar 25 16:04:55 GMT 2011
>>>>>>> 100,000 graphs in 547.00 sec
>>>>>>> ]]
>>>>>>>
>>>>>>> Frank, you might try replacing
>>>>>>>
>>>>>>> [[
>>>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>>>> load
>>>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>>>> try {
>>>>>>> Model model =
>>>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>>>> com.ibm.team.workitem.WorkItem/"
>>>>>>> + i);
>>>>>>> model.read(instream, null);
>>>>>>> //model.close();
>>>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>>>> " + new java.util.Date());
>>>>>>> instream.close();
>>>>>>> }
>>>>>>> ]]
>>>>>>>
>>>>>>> with
>>>>>>> [[
>>>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>>>> load
>>>>>>> Model m = ModelFactory.createDefaultModel();
>>>>>>> m.read(instream, null);
>>>>>>>
>>>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>>>> try {
>>>>>>> Model model =
>>>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>>>> com.ibm.team.workitem.WorkItem/"
>>>>>>> + i);
>>>>>>> // model.read(instream, null);
>>>>>>> model.add(m);
>>>>>>> //model.close();
>>>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>>>> " + new java.util.Date());
>>>>>>> instream.close();
>>>>>>> }
>>>>>>> ]]
>>>>>>>
>>>>>>> I am running I think the latest Maven versions of Jena etc on
> Windows
>>> 7
>>>>>>> on an Intel i7 with 8GB. Tests were run in Eclipse.
>>>>>>>
>>>>>>> Windows 7 SP1
>>>>>>> Jena 2.6.4
>>>>>>> ARQ 2.8.7
>>>>>>> TDB 0.8.9
>>>>>>>
>>>>>>> Hoping this might be helpful,
>>>>>>> Brian
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> in Eclispe
>>>>>>>> in direct mode.
>>>>>>>>
>>>>>>>> It has some configuration choices you might like to try.
>>>>>>>>
>>>>>>>> Max mem: 910M
>>>>>>>> DIRECT mode
>>>>>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
>>>>>>>>> Initial number of indexed graphs: 0
>>>>>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
>>>>>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
>>>>>>>> ....
>>>>>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
>>>>>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
>>>>>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
>>>>>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
>>>>>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
>>>>>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
>>>>>>>> 100,000 graphs in 601.98 sec
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25/03/11 13:50, Frank Budinsky wrote:
>>>>>>>>> Hi Andy and all,
>>>>>>>>>
>>>>>>>>> I finally managed to get a relatively powerful machine set up
>>>>>>>> Details?
>>>>>>>>
>>>>>>>>> and I reran
>>>>>>>>> the test program I sent you, but unfortunately, it still runs
>>> orders
>>>>>> of
>>>>>>>>> magnitude slower than the numbers you got when you tried it. The
>>>>>>>>> hardware I
>>>>>>>>> used this time is similar to what you are using, the only
>>> significant
>>>>>>>>> difference is that it's running Window's 7 instead of Ubuntu. I
>>> know
>>>>>>>>> Linux
>>>>>>>>> is somewhat faster than Windows, but I don't expect we can blame
>>>>>>>>> Microsoft
>>>>>>>>> for a change from 573.87 seconds to about 4 hours :-)
>>>>>>>>>
>>>>>>>>> Are you sure that your numbers are correct? How big was the TDB
>>>>>>>>> database on
>>>>>>>>> disk at the end of the run?
>>>>>>>> 3.9G DB1
>>>>>>>>
>>>>>>>>> Do you have any other ideas what may be wrong
>>>>>>>>> with my configuration?
>>>>>>>> Windows server or desktop? (server is better at I/O)
>>>>>>>> 64 bit?
>>>>>>>> Is there a virus checker?
>>>>>>>>
>>>>>>>>> I would very much appreciate it if others on this mailing list
>>> could
>>>>>>>>> also
>>>>>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>>>>>>> colleagues),
>>>>>>>>> or is there some kind of pattern to explain this huge difference.
>>>>>>>>>
>>>>>>>>> Here is the simple test program again (inlined this time, since
>>> Apache
>>>>>>>>> seems to throw away attachments). To run it, just change the
>>> TDB_DIR
>>>>>>>>> constant to some empty directory. The test program loads100,000
>>>>>>>>> datagraphs
>>>>>>>>> (about 100 triples each ->   10M triples total). It prints a
> message
>>>>>>>>> on the
>>>>>>>>> console at every 100, so if it's taking seconds for each println,
>>>>>> you'll
>>>>>>>>> know very quickly that it will take hours to run, instead of a
> few
>>>>>>>>> minutes,
>>>>>>>>> like Andy has seen.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Frank.

Re: Bad Jena TDB performance

Posted by Frank Budinsky <fr...@ca.ibm.com>.
Hi Brian,

Thanks for summarizing the situation so clearly. It all adds up, except for
one thing (see below).

Brian McBride <br...@epimorphics.com> wrote on 03/28/2011 01:03:13 PM:

> [image removed]
>
> Re: Bad Jena TDB performance
>
> Brian McBride
>
> to:
>
> jena-users
>
> 03/28/2011 01:03 PM
>
> Please respond to jena-users
>
> Hi Frank,
>
> Separating out commentary from data, I've put this in a different
message.
>
> Here is what I think is happening:
>
> 1.  You are on a 32 bit OS, and on a 32 bit OS DIRECT mode enables TDB
> to go faster.

It is really behaving like this is the case, only it's not. I'm running on
a 64 bit OS. Since you are also running on Windows 7 64 bit and are not
seeing the same slow behavior in non-DIRECT mode, it seems like the issue
must be something specific with my configuration and not a more general
Windows 7 problem.

>
> 2. The current released version of Jena/ARQ/TDB does a SYNC at the end
> of every model.read into a TDB model, so with model.read() you are doing
> 100,000 SYNC's plus one at the end.  model.add does not do a SYNC so if
> you use that you are only doing one SYNC and that is why things go
> faster that way.
>
> 3. If you do both DIRECT mode and model.add you get the benefit of both
> faster TDB performance on 32 bit OS's and avoid 100k SYNCs.
>
> 4. You reported loading 100k graphs in 2080 secs in direct mode.  Andy
> reported getting the same in 601  secs.  If Andy was running the test on
> the current development version (plausible as he is a developer), that
> is not doing the 100k extra SYNCs and that is why he got a figure much
> closer to your DIRECT mode + model.add().
>
> 4. I'm loading 100k graphs in about 540s.  You are taking 812.   Could
> that difference be plausibly accounted for by hardware difference plus
> the fact I am running a 64 bit OS?  Or do you think there is more to it?

I think the difference (540 vs 812) probably can be explained by the
hardware difference.

Thanks again for all your help with this.

Frank.

>
> Brian
>
>
> On 28/03/2011 16:04, Frank Budinsky wrote:
> >> Brian, Frank, which versions are you running?
> > I'm running the same version as Brian.
> >
> > tbd: 0.8.9
> > arq: 2.8.7
> > jena: 2.6.4
> >
> > I tried disabling anti-virus - no difference.
> >
> > I tried bumping up the heap to 3G - no difference.
> >
> > I tried running without setting direct mode, but using Brian's change,
> > i.e., model.add() instead of model.read() - this runs slower than with
both
> > Brian's change and direct mode, but faster than if I run with direct
mode
> > without Brian's change:
> >
> > Only direct mode:
> > 100,000 graphs in 2,080.58 sec
> >
> > Only model.add():
> > 100,000 graphs in 1,246.25 sec
> >
> > Both direct mode and model.add():
> > 100,000 graphs in 812.30 sec
> >
> > Brian, have you tried running it without setting direct mode? I'm
wondering
> > if there's a problem with memory mapped I/O on Windows?
> >
> > Thanks,
> > Frank.
> >
> > Andy Seaborne<an...@epimorphics.com>  wrote on 03/28/2011
04:17:49
> > AM:
> >
> >> [image removed]
> >>
> >> Re: Bad Jena TDB performance
> >>
> >> Andy Seaborne
> >>
> >> to:
> >>
> >> jena-users
> >>
> >> 03/28/2011 04:18 AM
> >>
> >> Please respond to jena-users
> >>
> >>
> >>
> >> On 27/03/11 18:56, Brian McBride wrote:
> >>> On 25/03/2011 18:25, Frank Budinsky wrote:
> >>> [...]
> >>>> One interesting difference between my slow running case and Brian's,
> > is
> >>>> that mine always starts out pretty fast, but gradually slows down.
> > Notice
> >>>> that it was taking about 2 seconds per 100 at the start, but it was
up
> > to
> >>>> 15 seconds at 39800. I didn't let it run to completion, this time,
but
> > I
> >>>> remember from a previous run that it was taking about 2 min per 100
at
> >>>> the
> >>>> end. Any idea why it might slow down over time, like this, when not
> > using
> >>>> direct mode?
> >>>>
> >>>> Anyway, the next thing I tried, was to change the code has Brian
> >>>> suggested.
> >>>> That also had a big impact for me:
> >>> I have a hypothesis for why.
> >>>
> >>> When ARP, the RDF/XML parser, finishes reading a file it issues a
> >>> finishedRead event to the event handler in the graph it is reading
> > into.
> >>> If the TDB default event handler then did something (like sync to
> > disk?)
> >>> that the memory model does not, this could explain the difference in
> >>> performance. I have put a profiler on the test program and it reports
> >>> that the test program is spending a lot more time in
> >>> BlockManagerFile.force() when it is reading directly in to TDB than
> > when
> >>> it is going via a memory model. So there is some evidence that this
is
> >>> what is happening.
> >>>
> >>> I haven't been able to track down the block manager code actually in
> > use
> >>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows
> > off
> >>> the top of his head whether this is plausible.
> >>   >  s/block manager/event manager/
> >>
> >> Could be - the model.add(model) will go via the BulkUpdateHandler (I
> >> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
> >> for insertion.
> >>
> >> Could you try putting a break point in dataset.sync and see what the
> >> call stack is when it gets hit?  That'll tell you who is causing the
> > sync.
> >> There used to be (up to v 0.8.9? not int he last snapshot build) a
sync
> >> wrapper that sync() every n'000 triples added.  It's not in the
> >> development codebase.  All hidden implicit syncs should now be removed
> >>    They were causing problems for a user who was tracking whether the
DB
> >> on disk was dirty or not.
> >>
> >> Brian, Frank, which versions are you running?
> >>
> >>   >  as I'm having trouble checking ARQ out of SVN,
> >>
> >> What sort of trouble?
> >>
> >>     Andy
> >>
> >>
> >>> Brian
> >>>
> >>>
> >>>
> >>>> Max mem: 1,820M
> >>>> DIRECT mode
> >>>> log4j:WARN No appenders could be found for logger
> >>>> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
> >>>> log4j:WARN Please initialize the log4j system properly.
> >>>>> Starting test: Fri Mar 25 09:56:03 PDT 2011
> >>>>> Initial number of indexed graphs: 0
> >>>> 100 at: Fri Mar 25 09:56:06 PDT 2011
> >>>> 200 at: Fri Mar 25 09:56:07 PDT 2011
> >>>> 300 at: Fri Mar 25 09:56:09 PDT 2011
> >>>> 400 at: Fri Mar 25 09:56:10 PDT 2011
> >>>> 500 at: Fri Mar 25 09:56:11 PDT 2011
> >>>> ...
> >>>> 99700 at: Fri Mar 25 10:09:27 PDT 2011
> >>>> 99800 at: Fri Mar 25 10:09:27 PDT 2011
> >>>> 99900 at: Fri Mar 25 10:09:28 PDT 2011
> >>>> 100000 at: Fri Mar 25 10:09:29 PDT 2011
> >>>>> Done at: Fri Mar 25 10:09:35 PDT 2011
> >>>> 100,000 graphs in 812.30 sec
> >>>>
> >>>> With that change, I'm now only about 30% slower than Andy's number.
> > Maybe
> >>>> that's attributable to Windows vs Linux or the hardware differences.
> > I'm
> >>>> running it on:
> >>>>
> >>>> Intel Xeon E5345 233 GHZ (2 processors)
> >>>> 8 GB RAM
> >>>> 300 GB HDD
> >>>> Windows 7 Enterprise SP1
> >>>>
> >>>> Does anybody know how a Xeon 5345 should compare to i5 or i7
> >>>> processors, or
> >>>> how much difference there might be between Linux vs Windows 7.
> >>>>
> >>>> Thanks again for your help.
> >>>>
> >>>> Frank
> >>>>
> >>>> Brian McBride<br...@epimorphics.com>  wrote on 03/25/2011 12:15:30
PM:
> >>>>
> >>>>> [image removed]
> >>>>>
> >>>>> Re: Bad Jena TDB performance
> >>>>>
> >>>>> Brian McBride
> >>>>>
> >>>>> to:
> >>>>>
> >>>>> jena-users
> >>>>>
> >>>>> 03/25/2011 12:16 PM
> >>>>>
> >>>>> Please respond to jena-users
> >>>>>
> >>>>> Hi Andy, Frank,
> >>>>>
> >>>>> On 25/03/2011 14:13, Andy Seaborne wrote:
> >>>>>> I ran today:
> >>>>>>
> >>>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
> >>>>> reports/ReportOutOfMemoryManyGraphsTDB.java
> >>>>> [[
> >>>>>
> >>>>> I would very much appreciate it if others on this mailing list
could
> >>>>> also
> >>>>> give it a quick try. I'd like to know if it's just me (and my
> >>>> colleagues),
> >>>>> or is there some kind of pattern to explain this huge difference.
> >>>>> ]]
> >>>>>
> >>>>>
> >>>>> Having a windows 7 box and a couple of spare minutes I tried
running
> >>>>> Andy's version of the test code. I got performance akin to what
Frank
> >>>>> reports.
> >>>>>
> >>>>> [[
> >>>>> Max mem: 1,820M
> >>>>> DIRECT mode
> >>>>>> Starting test: Fri Mar 25 16:06:36 GMT 2011
> >>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>>>>> Initial number of indexed graphs: 0
> >>>>> 100 at: Fri Mar 25 16:07:17 GMT 2011
> >>>>> 200 at: Fri Mar 25 16:08:05 GMT 2011
> >>>>> 300 at: Fri Mar 25 16:08:56 GMT 2011
> >>>>> ...
> >>>>> ]]
> >>>>>
> >>>>> However, it runs a lot faster, if instead of reading the input
stream
> >>>>> directly into TDB, I read it into a memory model first and then add
> > that
> >>>>> model to a TDB model. Then I get performance similar to that which
> > Andy
> >>>>> reports. i.e.
> >>>>>
> >>>>> [[
> >>>>> Max mem: 1,820M
> >>>>> DIRECT mode
> >>>>>> Starting test: Fri Mar 25 15:55:48 GMT 2011
> >>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>>>>> Initial number of indexed graphs: 0
> >>>>> 100 at: Fri Mar 25 15:55:51 GMT 2011
> >>>>> 200 at: Fri Mar 25 15:55:51 GMT 2011
> >>>>> 300 at: Fri Mar 25 15:55:52 GMT 2011
> >>>>> ...
> >>>>>
> >>>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
> >>>>>> Done at: Fri Mar 25 16:04:55 GMT 2011
> >>>>> 100,000 graphs in 547.00 sec
> >>>>> ]]
> >>>>>
> >>>>> Frank, you might try replacing
> >>>>>
> >>>>> [[
> >>>>> InputStream instream = getGraph(i); // the RDF graph to
> >>>> load
> >>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>>>> try {
> >>>>> Model model =
> >>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>>>> com.ibm.team.workitem.WorkItem/"
> >>>>> + i);
> >>>>> model.read(instream, null);
> >>>>> //model.close();
> >>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>>>> " + new java.util.Date());
> >>>>> instream.close();
> >>>>> }
> >>>>> ]]
> >>>>>
> >>>>> with
> >>>>> [[
> >>>>> InputStream instream = getGraph(i); // the RDF graph to
> >>>> load
> >>>>> Model m = ModelFactory.createDefaultModel();
> >>>>> m.read(instream, null);
> >>>>>
> >>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>>>> try {
> >>>>> Model model =
> >>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>>>> com.ibm.team.workitem.WorkItem/"
> >>>>> + i);
> >>>>> // model.read(instream, null);
> >>>>> model.add(m);
> >>>>> //model.close();
> >>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>>>> " + new java.util.Date());
> >>>>> instream.close();
> >>>>> }
> >>>>> ]]
> >>>>>
> >>>>> I am running I think the latest Maven versions of Jena etc on
Windows
> > 7
> >>>>> on an Intel i7 with 8GB. Tests were run in Eclipse.
> >>>>>
> >>>>> Windows 7 SP1
> >>>>> Jena 2.6.4
> >>>>> ARQ 2.8.7
> >>>>> TDB 0.8.9
> >>>>>
> >>>>> Hoping this might be helpful,
> >>>>> Brian
> >>>>>
> >>>>>
> >>>>>
> >>>>>> in Eclispe
> >>>>>> in direct mode.
> >>>>>>
> >>>>>> It has some configuration choices you might like to try.
> >>>>>>
> >>>>>> Max mem: 910M
> >>>>>> DIRECT mode
> >>>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
> >>>>>>> Initial number of indexed graphs: 0
> >>>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
> >>>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
> >>>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
> >>>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
> >>>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
> >>>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
> >>>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
> >>>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
> >>>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
> >>>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
> >>>>>> ....
> >>>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
> >>>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
> >>>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
> >>>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
> >>>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
> >>>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
> >>>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
> >>>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
> >>>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
> >>>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
> >>>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
> >>>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
> >>>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
> >>>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
> >>>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
> >>>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
> >>>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
> >>>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
> >>>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
> >>>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
> >>>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
> >>>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
> >>>>>> 100,000 graphs in 601.98 sec
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 25/03/11 13:50, Frank Budinsky wrote:
> >>>>>>> Hi Andy and all,
> >>>>>>>
> >>>>>>> I finally managed to get a relatively powerful machine set up
> >>>>>> Details?
> >>>>>>
> >>>>>>> and I reran
> >>>>>>> the test program I sent you, but unfortunately, it still runs
> > orders
> >>>> of
> >>>>>>> magnitude slower than the numbers you got when you tried it. The
> >>>>>>> hardware I
> >>>>>>> used this time is similar to what you are using, the only
> > significant
> >>>>>>> difference is that it's running Window's 7 instead of Ubuntu. I
> > know
> >>>>>>> Linux
> >>>>>>> is somewhat faster than Windows, but I don't expect we can blame
> >>>>>>> Microsoft
> >>>>>>> for a change from 573.87 seconds to about 4 hours :-)
> >>>>>>>
> >>>>>>> Are you sure that your numbers are correct? How big was the TDB
> >>>>>>> database on
> >>>>>>> disk at the end of the run?
> >>>>>> 3.9G DB1
> >>>>>>
> >>>>>>> Do you have any other ideas what may be wrong
> >>>>>>> with my configuration?
> >>>>>> Windows server or desktop? (server is better at I/O)
> >>>>>> 64 bit?
> >>>>>> Is there a virus checker?
> >>>>>>
> >>>>>>> I would very much appreciate it if others on this mailing list
> > could
> >>>>>>> also
> >>>>>>> give it a quick try. I'd like to know if it's just me (and my
> >>>>>>> colleagues),
> >>>>>>> or is there some kind of pattern to explain this huge difference.
> >>>>>>>
> >>>>>>> Here is the simple test program again (inlined this time, since
> > Apache
> >>>>>>> seems to throw away attachments). To run it, just change the
> > TDB_DIR
> >>>>>>> constant to some empty directory. The test program loads100,000
> >>>>>>> datagraphs
> >>>>>>> (about 100 triples each ->  10M triples total). It prints a
message
> >>>>>>> on the
> >>>>>>> console at every 100, so if it's taking seconds for each println,
> >>>> you'll
> >>>>>>> know very quickly that it will take hours to run, instead of a
few
> >>>>>>> minutes,
> >>>>>>> like Andy has seen.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Frank.

Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.
Hi Frank,

Separating out commentary from data, I've put this in a different message.

Here is what I think is happening:

1.  You are on a 32 bit OS, and on a 32 bit OS DIRECT mode enables TDB 
to go faster.

2. The current released version of Jena/ARQ/TDB does a SYNC at the end 
of every model.read into a TDB model, so with model.read() you are doing 
100,000 SYNC's plus one at the end.  model.add does not do a SYNC so if 
you use that you are only doing one SYNC and that is why things go 
faster that way.

3. If you do both DIRECT mode and model.add you get the benefit of both 
faster TDB performance on 32 bit OS's and avoid 100k SYNCs.

4. You reported loading 100k graphs in 2080 secs in direct mode.  Andy 
reported getting the same in 601  secs.  If Andy was running the test on 
the current development version (plausible as he is a developer), that 
is not doing the 100k extra SYNCs and that is why he got a figure much 
closer to your DIRECT mode + model.add().

4. I'm loading 100k graphs in about 540s.  You are taking 812.   Could 
that difference be plausibly accounted for by hardware difference plus 
the fact I am running a 64 bit OS?  Or do you think there is more to it?

Brian


On 28/03/2011 16:04, Frank Budinsky wrote:
>> Brian, Frank, which versions are you running?
> I'm running the same version as Brian.
>
> tbd: 0.8.9
> arq: 2.8.7
> jena: 2.6.4
>
> I tried disabling anti-virus - no difference.
>
> I tried bumping up the heap to 3G - no difference.
>
> I tried running without setting direct mode, but using Brian's change,
> i.e., model.add() instead of model.read() - this runs slower than with both
> Brian's change and direct mode, but faster than if I run with direct mode
> without Brian's change:
>
> Only direct mode:
> 100,000 graphs in 2,080.58 sec
>
> Only model.add():
> 100,000 graphs in 1,246.25 sec
>
> Both direct mode and model.add():
> 100,000 graphs in 812.30 sec
>
> Brian, have you tried running it without setting direct mode? I'm wondering
> if there's a problem with memory mapped I/O on Windows?
>
> Thanks,
> Frank.
>
> Andy Seaborne<an...@epimorphics.com>  wrote on 03/28/2011 04:17:49
> AM:
>
>> [image removed]
>>
>> Re: Bad Jena TDB performance
>>
>> Andy Seaborne
>>
>> to:
>>
>> jena-users
>>
>> 03/28/2011 04:18 AM
>>
>> Please respond to jena-users
>>
>>
>>
>> On 27/03/11 18:56, Brian McBride wrote:
>>> On 25/03/2011 18:25, Frank Budinsky wrote:
>>> [...]
>>>> One interesting difference between my slow running case and Brian's,
> is
>>>> that mine always starts out pretty fast, but gradually slows down.
> Notice
>>>> that it was taking about 2 seconds per 100 at the start, but it was up
> to
>>>> 15 seconds at 39800. I didn't let it run to completion, this time, but
> I
>>>> remember from a previous run that it was taking about 2 min per 100 at
>>>> the
>>>> end. Any idea why it might slow down over time, like this, when not
> using
>>>> direct mode?
>>>>
>>>> Anyway, the next thing I tried, was to change the code has Brian
>>>> suggested.
>>>> That also had a big impact for me:
>>> I have a hypothesis for why.
>>>
>>> When ARP, the RDF/XML parser, finishes reading a file it issues a
>>> finishedRead event to the event handler in the graph it is reading
> into.
>>> If the TDB default event handler then did something (like sync to
> disk?)
>>> that the memory model does not, this could explain the difference in
>>> performance. I have put a profiler on the test program and it reports
>>> that the test program is spending a lot more time in
>>> BlockManagerFile.force() when it is reading directly in to TDB than
> when
>>> it is going via a memory model. So there is some evidence that this is
>>> what is happening.
>>>
>>> I haven't been able to track down the block manager code actually in
> use
>>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows
> off
>>> the top of his head whether this is plausible.
>>   >  s/block manager/event manager/
>>
>> Could be - the model.add(model) will go via the BulkUpdateHandler (I
>> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
>> for insertion.
>>
>> Could you try putting a break point in dataset.sync and see what the
>> call stack is when it gets hit?  That'll tell you who is causing the
> sync.
>> There used to be (up to v 0.8.9? not int he last snapshot build) a sync
>> wrapper that sync() every n'000 triples added.  It's not in the
>> development codebase.  All hidden implicit syncs should now be removed
>>    They were causing problems for a user who was tracking whether the DB
>> on disk was dirty or not.
>>
>> Brian, Frank, which versions are you running?
>>
>>   >  as I'm having trouble checking ARQ out of SVN,
>>
>> What sort of trouble?
>>
>>     Andy
>>
>>
>>> Brian
>>>
>>>
>>>
>>>> Max mem: 1,820M
>>>> DIRECT mode
>>>> log4j:WARN No appenders could be found for logger
>>>> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
>>>> log4j:WARN Please initialize the log4j system properly.
>>>>> Starting test: Fri Mar 25 09:56:03 PDT 2011
>>>>> Initial number of indexed graphs: 0
>>>> 100 at: Fri Mar 25 09:56:06 PDT 2011
>>>> 200 at: Fri Mar 25 09:56:07 PDT 2011
>>>> 300 at: Fri Mar 25 09:56:09 PDT 2011
>>>> 400 at: Fri Mar 25 09:56:10 PDT 2011
>>>> 500 at: Fri Mar 25 09:56:11 PDT 2011
>>>> ...
>>>> 99700 at: Fri Mar 25 10:09:27 PDT 2011
>>>> 99800 at: Fri Mar 25 10:09:27 PDT 2011
>>>> 99900 at: Fri Mar 25 10:09:28 PDT 2011
>>>> 100000 at: Fri Mar 25 10:09:29 PDT 2011
>>>>> Done at: Fri Mar 25 10:09:35 PDT 2011
>>>> 100,000 graphs in 812.30 sec
>>>>
>>>> With that change, I'm now only about 30% slower than Andy's number.
> Maybe
>>>> that's attributable to Windows vs Linux or the hardware differences.
> I'm
>>>> running it on:
>>>>
>>>> Intel Xeon E5345 233 GHZ (2 processors)
>>>> 8 GB RAM
>>>> 300 GB HDD
>>>> Windows 7 Enterprise SP1
>>>>
>>>> Does anybody know how a Xeon 5345 should compare to i5 or i7
>>>> processors, or
>>>> how much difference there might be between Linux vs Windows 7.
>>>>
>>>> Thanks again for your help.
>>>>
>>>> Frank
>>>>
>>>> Brian McBride<br...@epimorphics.com>  wrote on 03/25/2011 12:15:30 PM:
>>>>
>>>>> [image removed]
>>>>>
>>>>> Re: Bad Jena TDB performance
>>>>>
>>>>> Brian McBride
>>>>>
>>>>> to:
>>>>>
>>>>> jena-users
>>>>>
>>>>> 03/25/2011 12:16 PM
>>>>>
>>>>> Please respond to jena-users
>>>>>
>>>>> Hi Andy, Frank,
>>>>>
>>>>> On 25/03/2011 14:13, Andy Seaborne wrote:
>>>>>> I ran today:
>>>>>>
>>>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
>>>>> reports/ReportOutOfMemoryManyGraphsTDB.java
>>>>> [[
>>>>>
>>>>> I would very much appreciate it if others on this mailing list could
>>>>> also
>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>> colleagues),
>>>>> or is there some kind of pattern to explain this huge difference.
>>>>> ]]
>>>>>
>>>>>
>>>>> Having a windows 7 box and a couple of spare minutes I tried running
>>>>> Andy's version of the test code. I got performance akin to what Frank
>>>>> reports.
>>>>>
>>>>> [[
>>>>> Max mem: 1,820M
>>>>> DIRECT mode
>>>>>> Starting test: Fri Mar 25 16:06:36 GMT 2011
>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>> Initial number of indexed graphs: 0
>>>>> 100 at: Fri Mar 25 16:07:17 GMT 2011
>>>>> 200 at: Fri Mar 25 16:08:05 GMT 2011
>>>>> 300 at: Fri Mar 25 16:08:56 GMT 2011
>>>>> ...
>>>>> ]]
>>>>>
>>>>> However, it runs a lot faster, if instead of reading the input stream
>>>>> directly into TDB, I read it into a memory model first and then add
> that
>>>>> model to a TDB model. Then I get performance similar to that which
> Andy
>>>>> reports. i.e.
>>>>>
>>>>> [[
>>>>> Max mem: 1,820M
>>>>> DIRECT mode
>>>>>> Starting test: Fri Mar 25 15:55:48 GMT 2011
>>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>>>>> Initial number of indexed graphs: 0
>>>>> 100 at: Fri Mar 25 15:55:51 GMT 2011
>>>>> 200 at: Fri Mar 25 15:55:51 GMT 2011
>>>>> 300 at: Fri Mar 25 15:55:52 GMT 2011
>>>>> ...
>>>>>
>>>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
>>>>>> Done at: Fri Mar 25 16:04:55 GMT 2011
>>>>> 100,000 graphs in 547.00 sec
>>>>> ]]
>>>>>
>>>>> Frank, you might try replacing
>>>>>
>>>>> [[
>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>> load
>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>> try {
>>>>> Model model =
>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>> com.ibm.team.workitem.WorkItem/"
>>>>> + i);
>>>>> model.read(instream, null);
>>>>> //model.close();
>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>> " + new java.util.Date());
>>>>> instream.close();
>>>>> }
>>>>> ]]
>>>>>
>>>>> with
>>>>> [[
>>>>> InputStream instream = getGraph(i); // the RDF graph to
>>>> load
>>>>> Model m = ModelFactory.createDefaultModel();
>>>>> m.read(instream, null);
>>>>>
>>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>>>> try {
>>>>> Model model =
>>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>>>> com.ibm.team.workitem.WorkItem/"
>>>>> + i);
>>>>> // model.read(instream, null);
>>>>> model.add(m);
>>>>> //model.close();
>>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>>>> " + new java.util.Date());
>>>>> instream.close();
>>>>> }
>>>>> ]]
>>>>>
>>>>> I am running I think the latest Maven versions of Jena etc on Windows
> 7
>>>>> on an Intel i7 with 8GB. Tests were run in Eclipse.
>>>>>
>>>>> Windows 7 SP1
>>>>> Jena 2.6.4
>>>>> ARQ 2.8.7
>>>>> TDB 0.8.9
>>>>>
>>>>> Hoping this might be helpful,
>>>>> Brian
>>>>>
>>>>>
>>>>>
>>>>>> in Eclispe
>>>>>> in direct mode.
>>>>>>
>>>>>> It has some configuration choices you might like to try.
>>>>>>
>>>>>> Max mem: 910M
>>>>>> DIRECT mode
>>>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
>>>>>>> Initial number of indexed graphs: 0
>>>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
>>>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
>>>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
>>>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
>>>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
>>>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
>>>>>> ....
>>>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
>>>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
>>>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
>>>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
>>>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
>>>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
>>>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
>>>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
>>>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
>>>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
>>>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
>>>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
>>>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
>>>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
>>>>>> 100,000 graphs in 601.98 sec
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 25/03/11 13:50, Frank Budinsky wrote:
>>>>>>> Hi Andy and all,
>>>>>>>
>>>>>>> I finally managed to get a relatively powerful machine set up
>>>>>> Details?
>>>>>>
>>>>>>> and I reran
>>>>>>> the test program I sent you, but unfortunately, it still runs
> orders
>>>> of
>>>>>>> magnitude slower than the numbers you got when you tried it. The
>>>>>>> hardware I
>>>>>>> used this time is similar to what you are using, the only
> significant
>>>>>>> difference is that it's running Window's 7 instead of Ubuntu. I
> know
>>>>>>> Linux
>>>>>>> is somewhat faster than Windows, but I don't expect we can blame
>>>>>>> Microsoft
>>>>>>> for a change from 573.87 seconds to about 4 hours :-)
>>>>>>>
>>>>>>> Are you sure that your numbers are correct? How big was the TDB
>>>>>>> database on
>>>>>>> disk at the end of the run?
>>>>>> 3.9G DB1
>>>>>>
>>>>>>> Do you have any other ideas what may be wrong
>>>>>>> with my configuration?
>>>>>> Windows server or desktop? (server is better at I/O)
>>>>>> 64 bit?
>>>>>> Is there a virus checker?
>>>>>>
>>>>>>> I would very much appreciate it if others on this mailing list
> could
>>>>>>> also
>>>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>>>>> colleagues),
>>>>>>> or is there some kind of pattern to explain this huge difference.
>>>>>>>
>>>>>>> Here is the simple test program again (inlined this time, since
> Apache
>>>>>>> seems to throw away attachments). To run it, just change the
> TDB_DIR
>>>>>>> constant to some empty directory. The test program loads100,000
>>>>>>> datagraphs
>>>>>>> (about 100 triples each ->  10M triples total). It prints a message
>>>>>>> on the
>>>>>>> console at every 100, so if it's taking seconds for each println,
>>>> you'll
>>>>>>> know very quickly that it will take hours to run, instead of a few
>>>>>>> minutes,
>>>>>>> like Andy has seen.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Frank.

Re: Bad Jena TDB performance

Posted by Frank Budinsky <fr...@ca.ibm.com>.
> Brian, Frank, which versions are you running?

I'm running the same version as Brian.

tbd: 0.8.9
arq: 2.8.7
jena: 2.6.4

I tried disabling anti-virus - no difference.

I tried bumping up the heap to 3G - no difference.

I tried running without setting direct mode, but using Brian's change,
i.e., model.add() instead of model.read() - this runs slower than with both
Brian's change and direct mode, but faster than if I run with direct mode
without Brian's change:

Only direct mode:
100,000 graphs in 2,080.58 sec

Only model.add():
100,000 graphs in 1,246.25 sec

Both direct mode and model.add():
100,000 graphs in 812.30 sec

Brian, have you tried running it without setting direct mode? I'm wondering
if there's a problem with memory mapped I/O on Windows?

Thanks,
Frank.

Andy Seaborne <an...@epimorphics.com> wrote on 03/28/2011 04:17:49
AM:

> [image removed]
>
> Re: Bad Jena TDB performance
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/28/2011 04:18 AM
>
> Please respond to jena-users
>
>
>
> On 27/03/11 18:56, Brian McBride wrote:
> > On 25/03/2011 18:25, Frank Budinsky wrote:
> > [...]
> >> One interesting difference between my slow running case and Brian's,
is
> >> that mine always starts out pretty fast, but gradually slows down.
Notice
> >> that it was taking about 2 seconds per 100 at the start, but it was up
to
> >> 15 seconds at 39800. I didn't let it run to completion, this time, but
I
> >> remember from a previous run that it was taking about 2 min per 100 at
> >> the
> >> end. Any idea why it might slow down over time, like this, when not
using
> >> direct mode?
> >>
> >> Anyway, the next thing I tried, was to change the code has Brian
> >> suggested.
> >> That also had a big impact for me:
> > I have a hypothesis for why.
> >
> > When ARP, the RDF/XML parser, finishes reading a file it issues a
> > finishedRead event to the event handler in the graph it is reading
into.
> >
> > If the TDB default event handler then did something (like sync to
disk?)
> > that the memory model does not, this could explain the difference in
> > performance. I have put a profiler on the test program and it reports
> > that the test program is spending a lot more time in
> > BlockManagerFile.force() when it is reading directly in to TDB than
when
> > it is going via a memory model. So there is some evidence that this is
> > what is happening.
> >
> > I haven't been able to track down the block manager code actually in
use
> > as I'm having trouble checking ARQ out of SVN, but Andy likely knows
off
> > the top of his head whether this is plausible.
>
>  > s/block manager/event manager/
>
> Could be - the model.add(model) will go via the BulkUpdateHandler (I
> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
> for insertion.
>
> Could you try putting a break point in dataset.sync and see what the
> call stack is when it gets hit?  That'll tell you who is causing the
sync.
>
> There used to be (up to v 0.8.9? not int he last snapshot build) a sync
> wrapper that sync() every n'000 triples added.  It's not in the
> development codebase.  All hidden implicit syncs should now be removed
>   They were causing problems for a user who was tracking whether the DB
> on disk was dirty or not.
>
> Brian, Frank, which versions are you running?
>
>  > as I'm having trouble checking ARQ out of SVN,
>
> What sort of trouble?
>
>    Andy
>
>
> >
> > Brian
> >
> >
> >
> >> Max mem: 1,820M
> >> DIRECT mode
> >> log4j:WARN No appenders could be found for logger
> >> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
> >> log4j:WARN Please initialize the log4j system properly.
> >>> Starting test: Fri Mar 25 09:56:03 PDT 2011
> >>> Initial number of indexed graphs: 0
> >> 100 at: Fri Mar 25 09:56:06 PDT 2011
> >> 200 at: Fri Mar 25 09:56:07 PDT 2011
> >> 300 at: Fri Mar 25 09:56:09 PDT 2011
> >> 400 at: Fri Mar 25 09:56:10 PDT 2011
> >> 500 at: Fri Mar 25 09:56:11 PDT 2011
> >> ...
> >> 99700 at: Fri Mar 25 10:09:27 PDT 2011
> >> 99800 at: Fri Mar 25 10:09:27 PDT 2011
> >> 99900 at: Fri Mar 25 10:09:28 PDT 2011
> >> 100000 at: Fri Mar 25 10:09:29 PDT 2011
> >>> Done at: Fri Mar 25 10:09:35 PDT 2011
> >> 100,000 graphs in 812.30 sec
> >>
> >> With that change, I'm now only about 30% slower than Andy's number.
Maybe
> >> that's attributable to Windows vs Linux or the hardware differences.
I'm
> >> running it on:
> >>
> >> Intel Xeon E5345 233 GHZ (2 processors)
> >> 8 GB RAM
> >> 300 GB HDD
> >> Windows 7 Enterprise SP1
> >>
> >> Does anybody know how a Xeon 5345 should compare to i5 or i7
> >> processors, or
> >> how much difference there might be between Linux vs Windows 7.
> >>
> >> Thanks again for your help.
> >>
> >> Frank
> >>
> >> Brian McBride<br...@epimorphics.com> wrote on 03/25/2011 12:15:30 PM:
> >>
> >>> [image removed]
> >>>
> >>> Re: Bad Jena TDB performance
> >>>
> >>> Brian McBride
> >>>
> >>> to:
> >>>
> >>> jena-users
> >>>
> >>> 03/25/2011 12:16 PM
> >>>
> >>> Please respond to jena-users
> >>>
> >>> Hi Andy, Frank,
> >>>
> >>> On 25/03/2011 14:13, Andy Seaborne wrote:
> >>>> I ran today:
> >>>>
> >>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
> >>> reports/ReportOutOfMemoryManyGraphsTDB.java
> >>> [[
> >>>
> >>> I would very much appreciate it if others on this mailing list could
> >>> also
> >>> give it a quick try. I'd like to know if it's just me (and my
> >> colleagues),
> >>> or is there some kind of pattern to explain this huge difference.
> >>> ]]
> >>>
> >>>
> >>> Having a windows 7 box and a couple of spare minutes I tried running
> >>> Andy's version of the test code. I got performance akin to what Frank
> >>> reports.
> >>>
> >>> [[
> >>> Max mem: 1,820M
> >>> DIRECT mode
> >>> > Starting test: Fri Mar 25 16:06:36 GMT 2011
> >>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>> > Initial number of indexed graphs: 0
> >>> 100 at: Fri Mar 25 16:07:17 GMT 2011
> >>> 200 at: Fri Mar 25 16:08:05 GMT 2011
> >>> 300 at: Fri Mar 25 16:08:56 GMT 2011
> >>> ...
> >>> ]]
> >>>
> >>> However, it runs a lot faster, if instead of reading the input stream
> >>> directly into TDB, I read it into a memory model first and then add
that
> >>> model to a TDB model. Then I get performance similar to that which
Andy
> >>> reports. i.e.
> >>>
> >>> [[
> >>> Max mem: 1,820M
> >>> DIRECT mode
> >>> > Starting test: Fri Mar 25 15:55:48 GMT 2011
> >>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>> > Initial number of indexed graphs: 0
> >>> 100 at: Fri Mar 25 15:55:51 GMT 2011
> >>> 200 at: Fri Mar 25 15:55:51 GMT 2011
> >>> 300 at: Fri Mar 25 15:55:52 GMT 2011
> >>> ...
> >>>
> >>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
> >>> > Done at: Fri Mar 25 16:04:55 GMT 2011
> >>> 100,000 graphs in 547.00 sec
> >>> ]]
> >>>
> >>> Frank, you might try replacing
> >>>
> >>> [[
> >>> InputStream instream = getGraph(i); // the RDF graph to
> >> load
> >>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>> try {
> >>> Model model =
> >>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>> com.ibm.team.workitem.WorkItem/"
> >>> + i);
> >>> model.read(instream, null);
> >>> //model.close();
> >>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>> " + new java.util.Date());
> >>> instream.close();
> >>> }
> >>> ]]
> >>>
> >>> with
> >>> [[
> >>> InputStream instream = getGraph(i); // the RDF graph to
> >> load
> >>> Model m = ModelFactory.createDefaultModel();
> >>> m.read(instream, null);
> >>>
> >>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>> try {
> >>> Model model =
> >>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>> com.ibm.team.workitem.WorkItem/"
> >>> + i);
> >>> // model.read(instream, null);
> >>> model.add(m);
> >>> //model.close();
> >>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>> " + new java.util.Date());
> >>> instream.close();
> >>> }
> >>> ]]
> >>>
> >>> I am running I think the latest Maven versions of Jena etc on Windows
7
> >>> on an Intel i7 with 8GB. Tests were run in Eclipse.
> >>>
> >>> Windows 7 SP1
> >>> Jena 2.6.4
> >>> ARQ 2.8.7
> >>> TDB 0.8.9
> >>>
> >>> Hoping this might be helpful,
> >>> Brian
> >>>
> >>>
> >>>
> >>>> in Eclispe
> >>>> in direct mode.
> >>>>
> >>>> It has some configuration choices you might like to try.
> >>>>
> >>>> Max mem: 910M
> >>>> DIRECT mode
> >>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
> >>>>> Initial number of indexed graphs: 0
> >>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
> >>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
> >>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
> >>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
> >>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
> >>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
> >>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
> >>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
> >>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
> >>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
> >>>> ....
> >>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
> >>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
> >>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
> >>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
> >>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
> >>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
> >>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
> >>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
> >>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
> >>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
> >>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
> >>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
> >>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
> >>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
> >>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
> >>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
> >>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
> >>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
> >>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
> >>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
> >>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
> >>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
> >>>> 100,000 graphs in 601.98 sec
> >>>>
> >>>>
> >>>>
> >>>> On 25/03/11 13:50, Frank Budinsky wrote:
> >>>>> Hi Andy and all,
> >>>>>
> >>>>> I finally managed to get a relatively powerful machine set up
> >>>> Details?
> >>>>
> >>>>> and I reran
> >>>>> the test program I sent you, but unfortunately, it still runs
orders
> >> of
> >>>>> magnitude slower than the numbers you got when you tried it. The
> >>>>> hardware I
> >>>>> used this time is similar to what you are using, the only
significant
> >>>>> difference is that it's running Window's 7 instead of Ubuntu. I
know
> >>>>> Linux
> >>>>> is somewhat faster than Windows, but I don't expect we can blame
> >>>>> Microsoft
> >>>>> for a change from 573.87 seconds to about 4 hours :-)
> >>>>>
> >>>>> Are you sure that your numbers are correct? How big was the TDB
> >>>>> database on
> >>>>> disk at the end of the run?
> >>>> 3.9G DB1
> >>>>
> >>>>> Do you have any other ideas what may be wrong
> >>>>> with my configuration?
> >>>> Windows server or desktop? (server is better at I/O)
> >>>> 64 bit?
> >>>> Is there a virus checker?
> >>>>
> >>>>> I would very much appreciate it if others on this mailing list
could
> >>>>> also
> >>>>> give it a quick try. I'd like to know if it's just me (and my
> >>>>> colleagues),
> >>>>> or is there some kind of pattern to explain this huge difference.
> >>>>>
> >>>>> Here is the simple test program again (inlined this time, since
Apache
> >>>>> seems to throw away attachments). To run it, just change the
TDB_DIR
> >>>>> constant to some empty directory. The test program loads100,000
> >>>>> datagraphs
> >>>>> (about 100 triples each -> 10M triples total). It prints a message
> >>>>> on the
> >>>>> console at every 100, so if it's taking seconds for each println,
> >> you'll
> >>>>> know very quickly that it will take hours to run, instead of a few
> >>>>> minutes,
> >>>>> like Andy has seen.
> >>>>>
> >>>>> Thanks,
> >>>>> Frank.
> >

Re: Bad Jena TDB performance

Posted by Andy Seaborne <an...@epimorphics.com>.

On 27/03/11 18:56, Brian McBride wrote:
> On 25/03/2011 18:25, Frank Budinsky wrote:
> [...]
>> One interesting difference between my slow running case and Brian's, is
>> that mine always starts out pretty fast, but gradually slows down. Notice
>> that it was taking about 2 seconds per 100 at the start, but it was up to
>> 15 seconds at 39800. I didn't let it run to completion, this time, but I
>> remember from a previous run that it was taking about 2 min per 100 at
>> the
>> end. Any idea why it might slow down over time, like this, when not using
>> direct mode?
>>
>> Anyway, the next thing I tried, was to change the code has Brian
>> suggested.
>> That also had a big impact for me:
> I have a hypothesis for why.
>
> When ARP, the RDF/XML parser, finishes reading a file it issues a
> finishedRead event to the event handler in the graph it is reading into.
>
> If the TDB default event handler then did something (like sync to disk?)
> that the memory model does not, this could explain the difference in
> performance. I have put a profiler on the test program and it reports
> that the test program is spending a lot more time in
> BlockManagerFile.force() when it is reading directly in to TDB than when
> it is going via a memory model. So there is some evidence that this is
> what is happening.
>
> I haven't been able to track down the block manager code actually in use
> as I'm having trouble checking ARQ out of SVN, but Andy likely knows off
> the top of his head whether this is plausible.

 > s/block manager/event manager/

Could be - the model.add(model) will go via the BulkUpdateHandler (I 
think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler 
for insertion.

Could you try putting a break point in dataset.sync and see what the 
call stack is when it gets hit?  That'll tell you who is causing the sync.

There used to be (up to v 0.8.9? not int he last snapshot build) a sync 
wrapper that sync() every n'000 triples added.  It's not in the 
development codebase.  All hidden implicit syncs should now be removed 
  They were causing problems for a user who was tracking whether the DB 
on disk was dirty or not.

Brian, Frank, which versions are you running?

 > as I'm having trouble checking ARQ out of SVN,

What sort of trouble?

	Andy


>
> Brian
>
>
>
>> Max mem: 1,820M
>> DIRECT mode
>> log4j:WARN No appenders could be found for logger
>> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
>> log4j:WARN Please initialize the log4j system properly.
>>> Starting test: Fri Mar 25 09:56:03 PDT 2011
>>> Initial number of indexed graphs: 0
>> 100 at: Fri Mar 25 09:56:06 PDT 2011
>> 200 at: Fri Mar 25 09:56:07 PDT 2011
>> 300 at: Fri Mar 25 09:56:09 PDT 2011
>> 400 at: Fri Mar 25 09:56:10 PDT 2011
>> 500 at: Fri Mar 25 09:56:11 PDT 2011
>> ...
>> 99700 at: Fri Mar 25 10:09:27 PDT 2011
>> 99800 at: Fri Mar 25 10:09:27 PDT 2011
>> 99900 at: Fri Mar 25 10:09:28 PDT 2011
>> 100000 at: Fri Mar 25 10:09:29 PDT 2011
>>> Done at: Fri Mar 25 10:09:35 PDT 2011
>> 100,000 graphs in 812.30 sec
>>
>> With that change, I'm now only about 30% slower than Andy's number. Maybe
>> that's attributable to Windows vs Linux or the hardware differences. I'm
>> running it on:
>>
>> Intel Xeon E5345 233 GHZ (2 processors)
>> 8 GB RAM
>> 300 GB HDD
>> Windows 7 Enterprise SP1
>>
>> Does anybody know how a Xeon 5345 should compare to i5 or i7
>> processors, or
>> how much difference there might be between Linux vs Windows 7.
>>
>> Thanks again for your help.
>>
>> Frank
>>
>> Brian McBride<br...@epimorphics.com> wrote on 03/25/2011 12:15:30 PM:
>>
>>> [image removed]
>>>
>>> Re: Bad Jena TDB performance
>>>
>>> Brian McBride
>>>
>>> to:
>>>
>>> jena-users
>>>
>>> 03/25/2011 12:16 PM
>>>
>>> Please respond to jena-users
>>>
>>> Hi Andy, Frank,
>>>
>>> On 25/03/2011 14:13, Andy Seaborne wrote:
>>>> I ran today:
>>>>
>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
>>> reports/ReportOutOfMemoryManyGraphsTDB.java
>>> [[
>>>
>>> I would very much appreciate it if others on this mailing list could
>>> also
>>> give it a quick try. I'd like to know if it's just me (and my
>> colleagues),
>>> or is there some kind of pattern to explain this huge difference.
>>> ]]
>>>
>>>
>>> Having a windows 7 box and a couple of spare minutes I tried running
>>> Andy's version of the test code. I got performance akin to what Frank
>>> reports.
>>>
>>> [[
>>> Max mem: 1,820M
>>> DIRECT mode
>>> > Starting test: Fri Mar 25 16:06:36 GMT 2011
>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>> > Initial number of indexed graphs: 0
>>> 100 at: Fri Mar 25 16:07:17 GMT 2011
>>> 200 at: Fri Mar 25 16:08:05 GMT 2011
>>> 300 at: Fri Mar 25 16:08:56 GMT 2011
>>> ...
>>> ]]
>>>
>>> However, it runs a lot faster, if instead of reading the input stream
>>> directly into TDB, I read it into a memory model first and then add that
>>> model to a TDB model. Then I get performance similar to that which Andy
>>> reports. i.e.
>>>
>>> [[
>>> Max mem: 1,820M
>>> DIRECT mode
>>> > Starting test: Fri Mar 25 15:55:48 GMT 2011
>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>> > Initial number of indexed graphs: 0
>>> 100 at: Fri Mar 25 15:55:51 GMT 2011
>>> 200 at: Fri Mar 25 15:55:51 GMT 2011
>>> 300 at: Fri Mar 25 15:55:52 GMT 2011
>>> ...
>>>
>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
>>> > Done at: Fri Mar 25 16:04:55 GMT 2011
>>> 100,000 graphs in 547.00 sec
>>> ]]
>>>
>>> Frank, you might try replacing
>>>
>>> [[
>>> InputStream instream = getGraph(i); // the RDF graph to
>> load
>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>> try {
>>> Model model =
>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>> com.ibm.team.workitem.WorkItem/"
>>> + i);
>>> model.read(instream, null);
>>> //model.close();
>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>> " + new java.util.Date());
>>> instream.close();
>>> }
>>> ]]
>>>
>>> with
>>> [[
>>> InputStream instream = getGraph(i); // the RDF graph to
>> load
>>> Model m = ModelFactory.createDefaultModel();
>>> m.read(instream, null);
>>>
>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
>>> try {
>>> Model model =
>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>>> com.ibm.team.workitem.WorkItem/"
>>> + i);
>>> // model.read(instream, null);
>>> model.add(m);
>>> //model.close();
>>> } finally { dataset.getLock().leaveCriticalSection() ; }
>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>>> " + new java.util.Date());
>>> instream.close();
>>> }
>>> ]]
>>>
>>> I am running I think the latest Maven versions of Jena etc on Windows 7
>>> on an Intel i7 with 8GB. Tests were run in Eclipse.
>>>
>>> Windows 7 SP1
>>> Jena 2.6.4
>>> ARQ 2.8.7
>>> TDB 0.8.9
>>>
>>> Hoping this might be helpful,
>>> Brian
>>>
>>>
>>>
>>>> in Eclispe
>>>> in direct mode.
>>>>
>>>> It has some configuration choices you might like to try.
>>>>
>>>> Max mem: 910M
>>>> DIRECT mode
>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
>>>>> Initial number of indexed graphs: 0
>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
>>>> ....
>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
>>>> 100,000 graphs in 601.98 sec
>>>>
>>>>
>>>>
>>>> On 25/03/11 13:50, Frank Budinsky wrote:
>>>>> Hi Andy and all,
>>>>>
>>>>> I finally managed to get a relatively powerful machine set up
>>>> Details?
>>>>
>>>>> and I reran
>>>>> the test program I sent you, but unfortunately, it still runs orders
>> of
>>>>> magnitude slower than the numbers you got when you tried it. The
>>>>> hardware I
>>>>> used this time is similar to what you are using, the only significant
>>>>> difference is that it's running Window's 7 instead of Ubuntu. I know
>>>>> Linux
>>>>> is somewhat faster than Windows, but I don't expect we can blame
>>>>> Microsoft
>>>>> for a change from 573.87 seconds to about 4 hours :-)
>>>>>
>>>>> Are you sure that your numbers are correct? How big was the TDB
>>>>> database on
>>>>> disk at the end of the run?
>>>> 3.9G DB1
>>>>
>>>>> Do you have any other ideas what may be wrong
>>>>> with my configuration?
>>>> Windows server or desktop? (server is better at I/O)
>>>> 64 bit?
>>>> Is there a virus checker?
>>>>
>>>>> I would very much appreciate it if others on this mailing list could
>>>>> also
>>>>> give it a quick try. I'd like to know if it's just me (and my
>>>>> colleagues),
>>>>> or is there some kind of pattern to explain this huge difference.
>>>>>
>>>>> Here is the simple test program again (inlined this time, since Apache
>>>>> seems to throw away attachments). To run it, just change the TDB_DIR
>>>>> constant to some empty directory. The test program loads100,000
>>>>> datagraphs
>>>>> (about 100 triples each -> 10M triples total). It prints a message
>>>>> on the
>>>>> console at every 100, so if it's taking seconds for each println,
>> you'll
>>>>> know very quickly that it will take hours to run, instead of a few
>>>>> minutes,
>>>>> like Andy has seen.
>>>>>
>>>>> Thanks,
>>>>> Frank.
>

Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.
On 25/03/2011 18:25, Frank Budinsky wrote:
[...]
> One interesting difference between my slow running case and Brian's, is
> that mine always starts out pretty fast, but gradually slows down. Notice
> that it was taking about 2 seconds per 100 at the start, but it was up to
> 15 seconds at 39800. I didn't let it run to completion, this time, but I
> remember from a previous run that it was taking about 2 min per 100 at the
> end. Any idea why it might slow down over time, like this, when not using
> direct mode?
>
> Anyway, the next thing I tried, was to change the code has Brian suggested.
> That also had a big impact for me:
I have a hypothesis for why.

When ARP, the RDF/XML parser, finishes reading a file it issues a 
finishedRead event to the event handler in the graph it is reading into.

If the TDB default event handler then did something (like sync to disk?) 
that the memory model does not, this could explain the difference in 
performance.  I have put a profiler on the test program and it reports 
that the test program is spending a lot more time in 
BlockManagerFile.force() when it is reading directly in to TDB than when 
it is going via a memory model.  So there is some evidence that this is 
what is happening.

I haven't been able to track down the block manager code actually in use 
as  I'm having trouble checking ARQ out of SVN, but Andy likely knows 
off the top of his head whether this is plausible.

Brian



> Max mem: 1,820M
> DIRECT mode
> log4j:WARN No appenders could be found for logger
> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
> log4j:WARN Please initialize the log4j system properly.
>> Starting test: Fri Mar 25 09:56:03 PDT 2011
>> Initial number of indexed graphs: 0
> 100 at: Fri Mar 25 09:56:06 PDT 2011
> 200 at: Fri Mar 25 09:56:07 PDT 2011
> 300 at: Fri Mar 25 09:56:09 PDT 2011
> 400 at: Fri Mar 25 09:56:10 PDT 2011
> 500 at: Fri Mar 25 09:56:11 PDT 2011
> ...
> 99700 at: Fri Mar 25 10:09:27 PDT 2011
> 99800 at: Fri Mar 25 10:09:27 PDT 2011
> 99900 at: Fri Mar 25 10:09:28 PDT 2011
> 100000 at: Fri Mar 25 10:09:29 PDT 2011
>> Done at: Fri Mar 25 10:09:35 PDT 2011
> 100,000 graphs in 812.30 sec
>
> With that change, I'm now only about 30% slower than Andy's number. Maybe
> that's attributable to Windows vs Linux or the hardware differences. I'm
> running it on:
>
> Intel Xeon E5345 233 GHZ (2 processors)
> 8 GB RAM
> 300 GB HDD
> Windows 7 Enterprise SP1
>
> Does anybody know how a Xeon 5345 should compare to i5 or i7 processors, or
> how much difference there might be between Linux vs Windows 7.
>
> Thanks again for your help.
>
> Frank
>
> Brian McBride<br...@epimorphics.com>  wrote on 03/25/2011 12:15:30 PM:
>
>> [image removed]
>>
>> Re: Bad Jena TDB performance
>>
>> Brian McBride
>>
>> to:
>>
>> jena-users
>>
>> 03/25/2011 12:16 PM
>>
>> Please respond to jena-users
>>
>> Hi Andy, Frank,
>>
>> On 25/03/2011 14:13, Andy Seaborne wrote:
>>> I ran today:
>>>
>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
>> reports/ReportOutOfMemoryManyGraphsTDB.java
>> [[
>>
>> I would very much appreciate it if others on this mailing list could also
>> give it a quick try. I'd like to know if it's just me (and my
> colleagues),
>> or is there some kind of pattern to explain this huge difference.
>> ]]
>>
>>
>> Having a windows 7 box and a couple of spare minutes I tried running
>> Andy's version of the test code.  I got performance akin to what Frank
>> reports.
>>
>> [[
>> Max mem: 1,820M
>> DIRECT mode
>>   >  Starting test: Fri Mar 25 16:06:36 GMT 2011
>>    WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>   >  Initial number of indexed graphs: 0
>> 100 at: Fri Mar 25 16:07:17 GMT 2011
>> 200 at: Fri Mar 25 16:08:05 GMT 2011
>> 300 at: Fri Mar 25 16:08:56 GMT 2011
>> ...
>> ]]
>>
>> However, it runs a lot faster, if instead of reading the input stream
>> directly into TDB, I read it into a memory model first and then add that
>> model to a TDB model.  Then I get performance similar to that which Andy
>> reports. i.e.
>>
>> [[
>> Max mem: 1,820M
>> DIRECT mode
>>   >  Starting test: Fri Mar 25 15:55:48 GMT 2011
>>    WARN [main] (SetupTDB.java:755) - No BGP optimizer
>>   >  Initial number of indexed graphs: 0
>> 100 at: Fri Mar 25 15:55:51 GMT 2011
>> 200 at: Fri Mar 25 15:55:51 GMT 2011
>> 300 at: Fri Mar 25 15:55:52 GMT 2011
>> ...
>>
>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
>>   >  Done at: Fri Mar 25 16:04:55 GMT 2011
>> 100,000 graphs in 547.00 sec
>> ]]
>>
>> Frank, you might try replacing
>>
>> [[
>>                  InputStream instream = getGraph(i); // the RDF graph to
> load
>>                   dataset.getLock().enterCriticalSection(Lock.WRITE);
>>                   try {
>>                       Model model =
>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>> com.ibm.team.workitem.WorkItem/"
>> + i);
>>                       model.read(instream, null);
>>                       //model.close();
>>                   } finally { dataset.getLock().leaveCriticalSection() ; }
>>                   if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>> " + new java.util.Date());
>>                   instream.close();
>>               }
>> ]]
>>
>> with
>> [[
>>                  InputStream instream = getGraph(i); // the RDF graph to
> load
>>                   Model m = ModelFactory.createDefaultModel();
>>                   m.read(instream, null);
>>
>>                   dataset.getLock().enterCriticalSection(Lock.WRITE);
>>                   try {
>>                       Model model =
>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
>> com.ibm.team.workitem.WorkItem/"
>> + i);
>> //                    model.read(instream, null);
>>                       model.add(m);
>>                       //model.close();
>>                   } finally { dataset.getLock().leaveCriticalSection() ; }
>>                   if (++i % 100 == 0) System.out.println(i/100 + "00 at:
>> " + new java.util.Date());
>>                   instream.close();
>>               }
>> ]]
>>
>> I am running I think the latest Maven versions of Jena etc on Windows 7
>> on an Intel i7 with 8GB.  Tests were run in Eclipse.
>>
>> Windows 7 SP1
>> Jena 2.6.4
>> ARQ 2.8.7
>> TDB 0.8.9
>>
>> Hoping this might be helpful,
>> Brian
>>
>>
>>
>>> in Eclispe
>>> in direct mode.
>>>
>>> It has some configuration choices you might like to try.
>>>
>>> Max mem: 910M
>>> DIRECT mode
>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
>>>> Initial number of indexed graphs: 0
>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
>>> ....
>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
>>> 100,000 graphs in 601.98 sec
>>>
>>>
>>>
>>> On 25/03/11 13:50, Frank Budinsky wrote:
>>>> Hi Andy and all,
>>>>
>>>> I finally managed to get a relatively powerful machine set up
>>> Details?
>>>
>>>> and I reran
>>>> the test program I sent you, but unfortunately, it still runs orders
> of
>>>> magnitude slower than the numbers you got when you tried it. The
>>>> hardware I
>>>> used this time is similar to what you are using, the only significant
>>>> difference is that it's running Window's 7 instead of Ubuntu. I know
>>>> Linux
>>>> is somewhat faster than Windows, but I don't expect we can blame
>>>> Microsoft
>>>> for a change from 573.87 seconds to about 4 hours :-)
>>>>
>>>> Are you sure that your numbers are correct? How big was the TDB
>>>> database on
>>>> disk at the end of the run?
>>> 3.9G    DB1
>>>
>>>> Do you have any other ideas what may be wrong
>>>> with my configuration?
>>> Windows server or desktop? (server is better at I/O)
>>> 64 bit?
>>> Is there a virus checker?
>>>
>>>> I would very much appreciate it if others on this mailing list could
>>>> also
>>>> give it a quick try. I'd like to know if it's just me (and my
>>>> colleagues),
>>>> or is there some kind of pattern to explain this huge difference.
>>>>
>>>> Here is the simple test program again (inlined this time, since Apache
>>>> seems to throw away attachments). To run it, just change the TDB_DIR
>>>> constant to some empty directory. The test program loads100,000
>>>> datagraphs
>>>> (about 100 triples each ->   10M triples total). It prints a message
>>>> on the
>>>> console at every 100, so if it's taking seconds for each println,
> you'll
>>>> know very quickly that it will take hours to run, instead of a few
>>>> minutes,
>>>> like Andy has seen.
>>>>
>>>> Thanks,
>>>> Frank.


Re: Bad Jena TDB performance

Posted by Frank Budinsky <fr...@ca.ibm.com>.
Hi Andy and Brian,

Thank you both for your speedy replies! Your input, combined, has enabled
me to get my performance numbers into the right ballpark.

When running Andy's version of the test, I noticed significantly better
performance than I had previously been getting:

Max mem: 1,820M
DIRECT mode
log4j:WARN No appenders could be found for logger
(com.hp.hpl.jena.sparql.mgt.ARQMgt).
log4j:WARN Please initialize the log4j system properly.
> Starting test: Fri Mar 25 08:21:07 PDT 2011
> Initial number of indexed graphs: 0
100 at: Fri Mar 25 08:21:11 PDT 2011
200 at: Fri Mar 25 08:21:13 PDT 2011
300 at: Fri Mar 25 08:21:15 PDT 2011
400 at: Fri Mar 25 08:21:17 PDT 2011
1200 at: Fri Mar 25 08:21:34 PDT 2011
...
99700 at: Fri Mar 25 08:55:41 PDT 2011
99800 at: Fri Mar 25 08:55:43 PDT 2011
99900 at: Fri Mar 25 08:55:45 PDT 2011
100000 at: Fri Mar 25 08:55:47 PDT 2011
> Done at: Fri Mar 25 08:55:47 PDT 2011
100,000 graphs in 2,080.58 sec

Although much better than the 4 hours I was seeing previously, it's still
3.5 times slower than Andy's numbers. Is there any chance the warning
messages about log4j might affect performance?

Looking at the difference between Andy's and my original test, I see this
call is the only difference:

            SystemTDB.setFileMode(FileMode.direct) ;

What exactly does that do, and should I always be configuring TDB this way?

I tried running without this call and, as expected, it seems to be back to
the 4 hour performance:

Max mem: 1,820M
> Starting test: Fri Mar 25 09:00:50 PDT 2011
log4j:WARN No appenders could be found for logger
(com.hp.hpl.jena.tdb.info).
log4j:WARN Please initialize the log4j system properly.
> Initial number of indexed graphs: 0
100 at: Fri Mar 25 09:00:55 PDT 2011
200 at: Fri Mar 25 09:00:57 PDT 2011
300 at: Fri Mar 25 09:00:59 PDT 2011
400 at: Fri Mar 25 09:01:01 PDT 2011
500 at: Fri Mar 25 09:01:03 PDT 2011
...
39400 at: Fri Mar 25 09:51:23 PDT 2011
39500 at: Fri Mar 25 09:51:38 PDT 2011
39600 at: Fri Mar 25 09:51:52 PDT 2011
39700 at: Fri Mar 25 09:52:06 PDT 2011
39800 at: Fri Mar 25 09:52:21 PDT 2011
...
> Done at: About 4 hours from start time

One interesting difference between my slow running case and Brian's, is
that mine always starts out pretty fast, but gradually slows down. Notice
that it was taking about 2 seconds per 100 at the start, but it was up to
15 seconds at 39800. I didn't let it run to completion, this time, but I
remember from a previous run that it was taking about 2 min per 100 at the
end. Any idea why it might slow down over time, like this, when not using
direct mode?

Anyway, the next thing I tried, was to change the code has Brian suggested.
That also had a big impact for me:

Max mem: 1,820M
DIRECT mode
log4j:WARN No appenders could be found for logger
(com.hp.hpl.jena.sparql.mgt.ARQMgt).
log4j:WARN Please initialize the log4j system properly.
> Starting test: Fri Mar 25 09:56:03 PDT 2011
> Initial number of indexed graphs: 0
100 at: Fri Mar 25 09:56:06 PDT 2011
200 at: Fri Mar 25 09:56:07 PDT 2011
300 at: Fri Mar 25 09:56:09 PDT 2011
400 at: Fri Mar 25 09:56:10 PDT 2011
500 at: Fri Mar 25 09:56:11 PDT 2011
...
99700 at: Fri Mar 25 10:09:27 PDT 2011
99800 at: Fri Mar 25 10:09:27 PDT 2011
99900 at: Fri Mar 25 10:09:28 PDT 2011
100000 at: Fri Mar 25 10:09:29 PDT 2011
> Done at: Fri Mar 25 10:09:35 PDT 2011
100,000 graphs in 812.30 sec

With that change, I'm now only about 30% slower than Andy's number. Maybe
that's attributable to Windows vs Linux or the hardware differences. I'm
running it on:

Intel Xeon E5345 233 GHZ (2 processors)
8 GB RAM
300 GB HDD
Windows 7 Enterprise SP1

Does anybody know how a Xeon 5345 should compare to i5 or i7 processors, or
how much difference there might be between Linux vs Windows 7.

Thanks again for your help.

Frank

Brian McBride <br...@epimorphics.com> wrote on 03/25/2011 12:15:30 PM:

> [image removed]
>
> Re: Bad Jena TDB performance
>
> Brian McBride
>
> to:
>
> jena-users
>
> 03/25/2011 12:16 PM
>
> Please respond to jena-users
>
> Hi Andy, Frank,
>
> On 25/03/2011 14:13, Andy Seaborne wrote:
> > I ran today:
> >
> > https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
> reports/ReportOutOfMemoryManyGraphsTDB.java
> >
>
> [[
>
> I would very much appreciate it if others on this mailing list could also
> give it a quick try. I'd like to know if it's just me (and my
colleagues),
> or is there some kind of pattern to explain this huge difference.
> ]]
>
>
> Having a windows 7 box and a couple of spare minutes I tried running
> Andy's version of the test code.  I got performance akin to what Frank
> reports.
>
> [[
> Max mem: 1,820M
> DIRECT mode
>  > Starting test: Fri Mar 25 16:06:36 GMT 2011
>   WARN [main] (SetupTDB.java:755) - No BGP optimizer
>  > Initial number of indexed graphs: 0
> 100 at: Fri Mar 25 16:07:17 GMT 2011
> 200 at: Fri Mar 25 16:08:05 GMT 2011
> 300 at: Fri Mar 25 16:08:56 GMT 2011
> ...
> ]]
>
> However, it runs a lot faster, if instead of reading the input stream
> directly into TDB, I read it into a memory model first and then add that
> model to a TDB model.  Then I get performance similar to that which Andy
> reports. i.e.
>
> [[
> Max mem: 1,820M
> DIRECT mode
>  > Starting test: Fri Mar 25 15:55:48 GMT 2011
>   WARN [main] (SetupTDB.java:755) - No BGP optimizer
>  > Initial number of indexed graphs: 0
> 100 at: Fri Mar 25 15:55:51 GMT 2011
> 200 at: Fri Mar 25 15:55:51 GMT 2011
> 300 at: Fri Mar 25 15:55:52 GMT 2011
> ...
>
> 100000 at: Fri Mar 25 16:04:47 GMT 2011
>  > Done at: Fri Mar 25 16:04:55 GMT 2011
> 100,000 graphs in 547.00 sec
> ]]
>
> Frank, you might try replacing
>
> [[
>                 InputStream instream = getGraph(i); // the RDF graph to
load
>                  dataset.getLock().enterCriticalSection(Lock.WRITE);
>                  try {
>                      Model model =
> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> com.ibm.team.workitem.WorkItem/"
> + i);
>                      model.read(instream, null);
>                      //model.close();
>                  } finally { dataset.getLock().leaveCriticalSection() ; }
>                  if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> " + new java.util.Date());
>                  instream.close();
>              }
> ]]
>
> with
> [[
>                 InputStream instream = getGraph(i); // the RDF graph to
load
>                  Model m = ModelFactory.createDefaultModel();
>                  m.read(instream, null);
>
>                  dataset.getLock().enterCriticalSection(Lock.WRITE);
>                  try {
>                      Model model =
> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> com.ibm.team.workitem.WorkItem/"
> + i);
> //                    model.read(instream, null);
>                      model.add(m);
>                      //model.close();
>                  } finally { dataset.getLock().leaveCriticalSection() ; }
>                  if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> " + new java.util.Date());
>                  instream.close();
>              }
> ]]
>
> I am running I think the latest Maven versions of Jena etc on Windows 7
> on an Intel i7 with 8GB.  Tests were run in Eclipse.
>
> Windows 7 SP1
> Jena 2.6.4
> ARQ 2.8.7
> TDB 0.8.9
>
> Hoping this might be helpful,
> Brian
>
>
>
> >
> > in Eclispe
> > in direct mode.
> >
> > It has some configuration choices you might like to try.
> >
> > Max mem: 910M
> > DIRECT mode
> > > Starting test: Fri Mar 25 13:57:02 GMT 2011
> > > Initial number of indexed graphs: 0
> > 100 at: Fri Mar 25 13:57:04 GMT 2011
> > 200 at: Fri Mar 25 13:57:04 GMT 2011
> > 300 at: Fri Mar 25 13:57:05 GMT 2011
> > 400 at: Fri Mar 25 13:57:06 GMT 2011
> > 500 at: Fri Mar 25 13:57:06 GMT 2011
> > 600 at: Fri Mar 25 13:57:07 GMT 2011
> > 700 at: Fri Mar 25 13:57:07 GMT 2011
> > 800 at: Fri Mar 25 13:57:08 GMT 2011
> > 900 at: Fri Mar 25 13:57:08 GMT 2011
> > 1000 at: Fri Mar 25 13:57:09 GMT 2011
> > ....
> > 98000 at: Fri Mar 25 14:06:47 GMT 2011
> > 98100 at: Fri Mar 25 14:06:47 GMT 2011
> > 98200 at: Fri Mar 25 14:06:48 GMT 2011
> > 98300 at: Fri Mar 25 14:06:48 GMT 2011
> > 98400 at: Fri Mar 25 14:06:49 GMT 2011
> > 98500 at: Fri Mar 25 14:06:50 GMT 2011
> > 98600 at: Fri Mar 25 14:06:50 GMT 2011
> > 98700 at: Fri Mar 25 14:06:52 GMT 2011
> > 98800 at: Fri Mar 25 14:06:52 GMT 2011
> > 98900 at: Fri Mar 25 14:06:53 GMT 2011
> > 99000 at: Fri Mar 25 14:06:53 GMT 2011
> > 99100 at: Fri Mar 25 14:06:54 GMT 2011
> > 99200 at: Fri Mar 25 14:06:55 GMT 2011
> > 99300 at: Fri Mar 25 14:06:55 GMT 2011
> > 99400 at: Fri Mar 25 14:06:56 GMT 2011
> > 99500 at: Fri Mar 25 14:06:56 GMT 2011
> > 99600 at: Fri Mar 25 14:06:57 GMT 2011
> > 99700 at: Fri Mar 25 14:06:58 GMT 2011
> > 99800 at: Fri Mar 25 14:06:59 GMT 2011
> > 99900 at: Fri Mar 25 14:07:00 GMT 2011
> > 100000 at: Fri Mar 25 14:07:00 GMT 2011
> > > Done at: Fri Mar 25 14:07:04 GMT 2011
> > 100,000 graphs in 601.98 sec
> >
> >
> >
> > On 25/03/11 13:50, Frank Budinsky wrote:
> >>
> >> Hi Andy and all,
> >>
> >> I finally managed to get a relatively powerful machine set up
> >
> > Details?
> >
> >> and I reran
> >> the test program I sent you, but unfortunately, it still runs orders
of
> >> magnitude slower than the numbers you got when you tried it. The
> >> hardware I
> >> used this time is similar to what you are using, the only significant
> >> difference is that it's running Window's 7 instead of Ubuntu. I know
> >> Linux
> >> is somewhat faster than Windows, but I don't expect we can blame
> >> Microsoft
> >> for a change from 573.87 seconds to about 4 hours :-)
> >>
> >> Are you sure that your numbers are correct? How big was the TDB
> >> database on
> >> disk at the end of the run?
> >
> > 3.9G    DB1
> >
> >> Do you have any other ideas what may be wrong
> >> with my configuration?
> >
> > Windows server or desktop? (server is better at I/O)
> > 64 bit?
> > Is there a virus checker?
> >
> >> I would very much appreciate it if others on this mailing list could
> >> also
> >> give it a quick try. I'd like to know if it's just me (and my
> >> colleagues),
> >> or is there some kind of pattern to explain this huge difference.
> >>
> >> Here is the simple test program again (inlined this time, since Apache
> >> seems to throw away attachments). To run it, just change the TDB_DIR
> >> constant to some empty directory. The test program loads100,000
> >> datagraphs
> >> (about 100 triples each ->  10M triples total). It prints a message
> >> on the
> >> console at every 100, so if it's taking seconds for each println,
you'll
> >> know very quickly that it will take hours to run, instead of a few
> >> minutes,
> >> like Andy has seen.
> >>
> >> Thanks,
> >> Frank.

Re: Bad Jena TDB performance

Posted by Brian McBride <br...@epimorphics.com>.
Hi Andy, Frank,

On 25/03/2011 14:13, Andy Seaborne wrote:
> I ran today:
>
> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/reports/ReportOutOfMemoryManyGraphsTDB.java 
>

[[

I would very much appreciate it if others on this mailing list could also
give it a quick try. I'd like to know if it's just me (and my colleagues),
or is there some kind of pattern to explain this huge difference.
]]


Having a windows 7 box and a couple of spare minutes I tried running 
Andy's version of the test code.  I got performance akin to what Frank 
reports.

[[
Max mem: 1,820M
DIRECT mode
 > Starting test: Fri Mar 25 16:06:36 GMT 2011
  WARN [main] (SetupTDB.java:755) - No BGP optimizer
 > Initial number of indexed graphs: 0
100 at: Fri Mar 25 16:07:17 GMT 2011
200 at: Fri Mar 25 16:08:05 GMT 2011
300 at: Fri Mar 25 16:08:56 GMT 2011
...
]]

However, it runs a lot faster, if instead of reading the input stream 
directly into TDB, I read it into a memory model first and then add that 
model to a TDB model.  Then I get performance similar to that which Andy 
reports. i.e.

[[
Max mem: 1,820M
DIRECT mode
 > Starting test: Fri Mar 25 15:55:48 GMT 2011
  WARN [main] (SetupTDB.java:755) - No BGP optimizer
 > Initial number of indexed graphs: 0
100 at: Fri Mar 25 15:55:51 GMT 2011
200 at: Fri Mar 25 15:55:51 GMT 2011
300 at: Fri Mar 25 15:55:52 GMT 2011
...

100000 at: Fri Mar 25 16:04:47 GMT 2011
 > Done at: Fri Mar 25 16:04:55 GMT 2011
100,000 graphs in 547.00 sec
]]

Frank, you might try replacing

[[
                InputStream instream = getGraph(i); // the RDF graph to load
                 dataset.getLock().enterCriticalSection(Lock.WRITE);
                 try {
                     Model model = 
dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/" 
+ i);
                     model.read(instream, null);
                     //model.close();
                 } finally { dataset.getLock().leaveCriticalSection() ; }
                 if (++i % 100 == 0) System.out.println(i/100 + "00 at: 
" + new java.util.Date());
                 instream.close();
             }
]]

with
[[
                InputStream instream = getGraph(i); // the RDF graph to load
                 Model m = ModelFactory.createDefaultModel();
                 m.read(instream, null);

                 dataset.getLock().enterCriticalSection(Lock.WRITE);
                 try {
                     Model model = 
dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/" 
+ i);
//                    model.read(instream, null);
                     model.add(m);
                     //model.close();
                 } finally { dataset.getLock().leaveCriticalSection() ; }
                 if (++i % 100 == 0) System.out.println(i/100 + "00 at: 
" + new java.util.Date());
                 instream.close();
             }
]]

I am running I think the latest Maven versions of Jena etc on Windows 7 
on an Intel i7 with 8GB.  Tests were run in Eclipse.

Windows 7 SP1
Jena 2.6.4
ARQ 2.8.7
TDB 0.8.9

Hoping this might be helpful,
Brian



>
> in Eclispe
> in direct mode.
>
> It has some configuration choices you might like to try.
>
> Max mem: 910M
> DIRECT mode
> > Starting test: Fri Mar 25 13:57:02 GMT 2011
> > Initial number of indexed graphs: 0
> 100 at: Fri Mar 25 13:57:04 GMT 2011
> 200 at: Fri Mar 25 13:57:04 GMT 2011
> 300 at: Fri Mar 25 13:57:05 GMT 2011
> 400 at: Fri Mar 25 13:57:06 GMT 2011
> 500 at: Fri Mar 25 13:57:06 GMT 2011
> 600 at: Fri Mar 25 13:57:07 GMT 2011
> 700 at: Fri Mar 25 13:57:07 GMT 2011
> 800 at: Fri Mar 25 13:57:08 GMT 2011
> 900 at: Fri Mar 25 13:57:08 GMT 2011
> 1000 at: Fri Mar 25 13:57:09 GMT 2011
> ....
> 98000 at: Fri Mar 25 14:06:47 GMT 2011
> 98100 at: Fri Mar 25 14:06:47 GMT 2011
> 98200 at: Fri Mar 25 14:06:48 GMT 2011
> 98300 at: Fri Mar 25 14:06:48 GMT 2011
> 98400 at: Fri Mar 25 14:06:49 GMT 2011
> 98500 at: Fri Mar 25 14:06:50 GMT 2011
> 98600 at: Fri Mar 25 14:06:50 GMT 2011
> 98700 at: Fri Mar 25 14:06:52 GMT 2011
> 98800 at: Fri Mar 25 14:06:52 GMT 2011
> 98900 at: Fri Mar 25 14:06:53 GMT 2011
> 99000 at: Fri Mar 25 14:06:53 GMT 2011
> 99100 at: Fri Mar 25 14:06:54 GMT 2011
> 99200 at: Fri Mar 25 14:06:55 GMT 2011
> 99300 at: Fri Mar 25 14:06:55 GMT 2011
> 99400 at: Fri Mar 25 14:06:56 GMT 2011
> 99500 at: Fri Mar 25 14:06:56 GMT 2011
> 99600 at: Fri Mar 25 14:06:57 GMT 2011
> 99700 at: Fri Mar 25 14:06:58 GMT 2011
> 99800 at: Fri Mar 25 14:06:59 GMT 2011
> 99900 at: Fri Mar 25 14:07:00 GMT 2011
> 100000 at: Fri Mar 25 14:07:00 GMT 2011
> > Done at: Fri Mar 25 14:07:04 GMT 2011
> 100,000 graphs in 601.98 sec
>
>
>
> On 25/03/11 13:50, Frank Budinsky wrote:
>>
>> Hi Andy and all,
>>
>> I finally managed to get a relatively powerful machine set up
>
> Details?
>
>> and I reran
>> the test program I sent you, but unfortunately, it still runs orders of
>> magnitude slower than the numbers you got when you tried it. The 
>> hardware I
>> used this time is similar to what you are using, the only significant
>> difference is that it's running Window's 7 instead of Ubuntu. I know 
>> Linux
>> is somewhat faster than Windows, but I don't expect we can blame 
>> Microsoft
>> for a change from 573.87 seconds to about 4 hours :-)
>>
>> Are you sure that your numbers are correct? How big was the TDB 
>> database on
>> disk at the end of the run?
>
> 3.9G    DB1
>
>> Do you have any other ideas what may be wrong
>> with my configuration?
>
> Windows server or desktop? (server is better at I/O)
> 64 bit?
> Is there a virus checker?
>
>> I would very much appreciate it if others on this mailing list could 
>> also
>> give it a quick try. I'd like to know if it's just me (and my 
>> colleagues),
>> or is there some kind of pattern to explain this huge difference.
>>
>> Here is the simple test program again (inlined this time, since Apache
>> seems to throw away attachments). To run it, just change the TDB_DIR
>> constant to some empty directory. The test program loads100,000 
>> datagraphs
>> (about 100 triples each ->  10M triples total). It prints a message 
>> on the
>> console at every 100, so if it's taking seconds for each println, you'll
>> know very quickly that it will take hours to run, instead of a few 
>> minutes,
>> like Andy has seen.
>>
>> Thanks,
>> Frank.

Re: Bad Jena TDB performance

Posted by Andy Seaborne <an...@epimorphics.com>.
I ran today:

https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/reports/ReportOutOfMemoryManyGraphsTDB.java

in Eclispe
in direct mode.

It has some configuration choices you might like to try.

Max mem: 910M
DIRECT mode
 > Starting test: Fri Mar 25 13:57:02 GMT 2011
 > Initial number of indexed graphs: 0
100 at: Fri Mar 25 13:57:04 GMT 2011
200 at: Fri Mar 25 13:57:04 GMT 2011
300 at: Fri Mar 25 13:57:05 GMT 2011
400 at: Fri Mar 25 13:57:06 GMT 2011
500 at: Fri Mar 25 13:57:06 GMT 2011
600 at: Fri Mar 25 13:57:07 GMT 2011
700 at: Fri Mar 25 13:57:07 GMT 2011
800 at: Fri Mar 25 13:57:08 GMT 2011
900 at: Fri Mar 25 13:57:08 GMT 2011
1000 at: Fri Mar 25 13:57:09 GMT 2011
....
98000 at: Fri Mar 25 14:06:47 GMT 2011
98100 at: Fri Mar 25 14:06:47 GMT 2011
98200 at: Fri Mar 25 14:06:48 GMT 2011
98300 at: Fri Mar 25 14:06:48 GMT 2011
98400 at: Fri Mar 25 14:06:49 GMT 2011
98500 at: Fri Mar 25 14:06:50 GMT 2011
98600 at: Fri Mar 25 14:06:50 GMT 2011
98700 at: Fri Mar 25 14:06:52 GMT 2011
98800 at: Fri Mar 25 14:06:52 GMT 2011
98900 at: Fri Mar 25 14:06:53 GMT 2011
99000 at: Fri Mar 25 14:06:53 GMT 2011
99100 at: Fri Mar 25 14:06:54 GMT 2011
99200 at: Fri Mar 25 14:06:55 GMT 2011
99300 at: Fri Mar 25 14:06:55 GMT 2011
99400 at: Fri Mar 25 14:06:56 GMT 2011
99500 at: Fri Mar 25 14:06:56 GMT 2011
99600 at: Fri Mar 25 14:06:57 GMT 2011
99700 at: Fri Mar 25 14:06:58 GMT 2011
99800 at: Fri Mar 25 14:06:59 GMT 2011
99900 at: Fri Mar 25 14:07:00 GMT 2011
100000 at: Fri Mar 25 14:07:00 GMT 2011
 > Done at: Fri Mar 25 14:07:04 GMT 2011
100,000 graphs in 601.98 sec



On 25/03/11 13:50, Frank Budinsky wrote:
>
> Hi Andy and all,
>
> I finally managed to get a relatively powerful machine set up

Details?

> and I reran
> the test program I sent you, but unfortunately, it still runs orders of
> magnitude slower than the numbers you got when you tried it. The hardware I
> used this time is similar to what you are using, the only significant
> difference is that it's running Window's 7 instead of Ubuntu. I know Linux
> is somewhat faster than Windows, but I don't expect we can blame Microsoft
> for a change from 573.87 seconds to about 4 hours :-)
>
> Are you sure that your numbers are correct? How big was the TDB database on
> disk at the end of the run?

3.9G	DB1

> Do you have any other ideas what may be wrong
> with my configuration?

Windows server or desktop? (server is better at I/O)
64 bit?
Is there a virus checker?

> I would very much appreciate it if others on this mailing list could also
> give it a quick try. I'd like to know if it's just me (and my colleagues),
> or is there some kind of pattern to explain this huge difference.
>
> Here is the simple test program again (inlined this time, since Apache
> seems to throw away attachments). To run it, just change the TDB_DIR
> constant to some empty directory. The test program loads100,000 datagraphs
> (about 100 triples each ->  10M triples total). It prints a message on the
> console at every 100, so if it's taking seconds for each println, you'll
> know very quickly that it will take hours to run, instead of a few minutes,
> like Andy has seen.
>
> Thanks,
> Frank.

Bad Jena TDB performance

Posted by Frank Budinsky <fr...@ca.ibm.com>.
Hi Andy and all,

I finally managed to get a relatively powerful machine set up and I reran
the test program I sent you, but unfortunately, it still runs orders of
magnitude slower than the numbers you got when you tried it. The hardware I
used this time is similar to what you are using, the only significant
difference is that it's running Window's 7 instead of Ubuntu. I know Linux
is somewhat faster than Windows, but I don't expect we can blame Microsoft
for a change from 573.87 seconds to about 4 hours :-)

Are you sure that your numbers are correct? How big was the TDB database on
disk at the end of the run? Do you have any other ideas what may be wrong
with my configuration?

I would very much appreciate it if others on this mailing list could also
give it a quick try. I'd like to know if it's just me (and my colleagues),
or is there some kind of pattern to explain this huge difference.

Here is the simple test program again (inlined this time, since Apache
seems to throw away attachments). To run it, just change the TDB_DIR
constant to some empty directory. The test program loads100,000 datagraphs
(about 100 triples each -> 10M triples total). It prints a message on the
console at every 100, so if it's taking seconds for each println, you'll
know very quickly that it will take hours to run, instead of a few minutes,
like Andy has seen.

Thanks,
Frank.

>>>>>>>> BEGIN TDBOutOfMemoryTest.java
package com.ibm.lld.test;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;

import com.hp.hpl.jena.query.Dataset;
import com.hp.hpl.jena.rdf.model.Model;
import com.hp.hpl.jena.shared.Lock;
import com.hp.hpl.jena.tdb.TDB;
import com.hp.hpl.jena.tdb.TDBFactory;

public class TDBOutOfMemoryTest
{
      public static final String TDB_DIR = "D:/work/relm/outofmem_jena_DB";
      public static final int NOGRAPHS = 100000; // Number of data graphs
to load

      public static void main( String[] args ) {
            System.out.println("> Starting test: " + new java.util.Date());
            Dataset dataset = TDBFactory.createDataset(TDB_DIR);
            System.out.println("> Initial number of indexed graphs: " +
dataset.asDatasetGraph().size());
            try {
                  for (int i=0; i<NOGRAPHS; ) {
                        InputStream instream = getGraph(i); // the RDF
graph to load
                        dataset.getLock().enterCriticalSection(Lock.WRITE);
                        try {
                              Model model = dataset.getNamedModel
("https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/"
+ i);
                              model.read(instream, null);
                              //model.close();
                        } finally { dataset.getLock().leaveCriticalSection
() ; }
                        if (++i % 100 == 0) System.out.println(i/100 + "00
at: " + new java.util.Date());
                        instream.close();
                  }
                  TDB.sync(dataset);
                  dataset.close();
                  System.out.println("> Done at: " + new java.util.Date());
            }
            catch (IOException e) {
                  System.out.println("> Failed: " + e.getMessage());
            }
      }

      private static InputStream getGraph(int no) {
            String graph = GRAPH_TEMPLATE.replaceAll("%NUMBER%",
String.valueOf(no));
            return new ByteArrayInputStream(graph.getBytes());
      }

      private static final String GRAPH_TEMPLATE =
            "<rdf:RDF\n" +
            "    xmlns:rdf=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"\n" +
            "    xmlns:j.0=\"http://open-services.net/ns/core#\"\n" +
            "    xmlns:j.1=\"http://open-services.net/ns/cm-x#\"\n" +
            "    xmlns:j.2=\"http://purl.org/dc/terms/\"\n" +
            "    xmlns:j.3=\"http://open-services.net/ns/cm#\"\n" +
            "    xmlns:j.5=
\"http://jazz.net/xmlns/prod/jazz/rtc/ext/1.0/\"\n" +
            "    xmlns:j.4=\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/\"
> \n" +
            "  <rdf:Description rdf:nodeID=\"A0\">\n" +
            "    <j.2:title>@mandrew</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_y4JlcPYJEdqU64Cr2VV0dQ\"/>
\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:about=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\">
\n" +
            "    <j.2:title rdf:parseType=\"Literal\">Process REST Service
doesn't scale</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://open-services.net/ns/cm#ChangeRequest\"/>\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_P_wUELLTEduhAusIxeOxbA\"/>
\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_gTuTMG62Edu8R4joT9P1Ug\"/>
\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_y4JlcPYJEdqU64Cr2VV0dQ\"/>
\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_vCJP8OeKEduR89vYjZT89g\"/>
\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/sandbox02/ccm/service/com.ibm.team.process.internal.common.service.IProcessRestService/processAreasForUser?userId=shilpat
\"/>\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_DziEAHHfEdyLLb7t1B32_A\"/>
\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/wiki/bin/view/Main/DraftTeamProcessRestApi#Project_Areas_collection
\"/>\n" +
            "
<j.4:com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/154737
\"/>\n" +
            "    <j.3:fixed>false</j.3:fixed>\n" +
            "    <j.0:discussion rdf:resource=
\"https://jazz.net/jazz/oslc/workitems/_p0Lr8D7SEeCl0bUDoWAOSQ/rtc_cm:comments
\"/>\n" +
            "    <j.2:contributor rdf:resource=
\"https://jazz.net/jazz/oslc/users/_y4JlcPYJEdqU64Cr2VV0dQ\"/>\n" +
            "    <j.5:client rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/client/client.literal.l12
\"/>\n" +
            "    <j.4:plannedFor rdf:resource=
\"https://jazz.net/jazz/oslc/iterations/_VceosAh8EeC72Mz-78YBKQ\"/>\n" +
            "    <j.2:modified>2011-02-23T22:33:45.764Z</j.2:modified>\n" +
            "    <j.5:contextId>_Q2fMII8EEd2Q-OW8dr3S5w</j.5:contextId>\n"
+
            "    <j.4:timeSheet rdf:resource=
\"https://jazz.net/jazz/oslc/workitems/_p0Lr8D7SEeCl0bUDoWAOSQ/rtc_cm:timeSheet
\"/>\n" +
            "    <j.4:filedAgainst rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.workitem.Category/_YNQI4I8FEd2Q-OW8dr3S5w
\"/>\n" +
            "    <j.3:verified>false</j.3:verified>\n" +
            "    <j.4:correctedEstimate></j.4:correctedEstimate>\n" +
            "    <j.1:priority rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/priority/4
\"/>\n" +
            "    <j.3:approved>false</j.3:approved>\n" +
            "    <j.3:status>In Progress</j.3:status>\n" +
            "    <j.2:type>Defect</j.2:type>\n" +
            "    <j.4:modifiedBy rdf:resource=
\"https://jazz.net/jazz/oslc/users/_y4JlcPYJEdqU64Cr2VV0dQ\"/>\n" +
            "    <j.2:created>2011-02-22T22:35:15.682Z</j.2:created>\n" +
            "    <j.4:timeSpent></j.4:timeSpent>\n" +
            "    <j.3:closed>false</j.3:closed>\n" +
            "    <j.0:shortTitle rdf:parseType=\"Literal\">Defect
%NUMBER%</j.0:shortTitle>\n" +
            "    <j.4:state rdf:resource=
\"https://jazz.net/jazz/oslc/workflows/_Q2fMII8EEd2Q-OW8dr3S5w/states/bugzillaWorkflow/2
\"/>\n" +
            "    <j.4:estimate></j.4:estimate>\n" +
            "    <j.2:identifier>%NUMBER%</j.2:identifier>\n" +
            "    <j.2:description rdf:datatype=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral\">On Jazz.net, we
have a 3.0 &amp;quot;sandbox&amp;quot; deployed (CCM+JTS) which allows any
jazz.net user to create a project to try out RTC.&amp;nbsp; We're seeing
massive performance problems due to an apparent scalability problem in
process.&amp;nbsp; Currently, the sandbox has &amp;gt; 100 projects
created. This is causing the following
issues:&lt;br&gt;&lt;/br&gt;&lt;br&gt;&lt;/br&gt;1) Sandbox home page loads
a list of the current user's projects by calling process rest service with
user id.&amp;nbsp; This request takes &amp;gt; 60
seconds.&lt;br&gt;&lt;/br&gt;2) CCM app gets stuck on
&amp;quot;Loading...&amp;quot; for &amp;gt; 60 seconds, spinning on the
request to InitializationData.&amp;nbsp; InitData is waiting to get the
response from process's initializer (which is doing a lookup based on the
name of the project).&lt;br&gt;&lt;/br&gt;3) Home menu hangs for a long
time waiting for the list of projects to populate (There's also a UI
scaleability issue in the home menu... the number of projects exceeds
available space in the viewport, but that's another
item).&lt;br&gt;&lt;/br&gt;&lt;br&gt;&lt;/br&gt;This is a blocker. Process
must be able to scale to hundreds, possibly thousands of projects, without
slowing down the loading of the web UI.</j.2:description>\n" +
            "    <j.4:foundIn rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.workitem.Deliverable/_kuFJcPDhEd-1FumPcb1epw
\"/>\n" +
            "    <j.5:howfound rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/howfound/howfound.literal.l3
\"/>\n" +
            "    <j.2:subject>service</j.2:subject>\n" +
            "    <j.1:severity rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/severity/5
\"/>\n" +
            "    <j.4:type rdf:resource=
\"https://jazz.net/jazz/oslc/types/_Q2fMII8EEd2Q-OW8dr3S5w/defect\"/>\n" +
            "    <j.1:project rdf:resource=
\"https://jazz.net/jazz/oslc/projectareas/_Q2fMII8EEd2Q-OW8dr3S5w\"/>\n" +
            "    <j.2:creator rdf:resource=
\"https://jazz.net/jazz/oslc/users/_gTuTMG62Edu8R4joT9P1Ug\"/>\n" +
            "    <j.4:teamArea rdf:resource=
\"https://jazz.net/jazz/oslc/teamareas/_ER2xcI8FEd2Q-OW8dr3S5w\"/>\n" +
            "    <j.3:reviewed>false</j.3:reviewed>\n" +
            "    <j.5:archived>false</j.5:archived>\n" +
            "    <j.4:resolvedBy rdf:resource=
\"https://jazz.net/jazz/oslc/users/_YNh4MOlsEdq4xpiOKg5hvA\"/>\n" +
            "    <j.5:os rdf:resource=
\"https://jazz.net/jazz/oslc/enumerations/_Q2fMII8EEd2Q-OW8dr3S5w/OS/OS.literal.l1
\"/>\n" +
            "    <j.3:inprogress>true</j.3:inprogress>\n" +
            "    <j.0:serviceProvider rdf:resource=
\"https://jazz.net/jazz/oslc/contexts/_Q2fMII8EEd2Q-OW8dr3S5w/workitems/services
\"/>\n" +
            "    <j.4:progressTracking rdf:resource=
\"https://jazz.net/jazz/oslc/workitems/_p0Lr8D7SEeCl0bUDoWAOSQ/progressTracking
\"/>\n" +
            "
<j.4:com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
 rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_p2Q08T7lEeC50ZOFeYh_9w
\"/>\n" +
            "
<j.4:com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
 rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_S7U3gT8XEeC50ZOFeYh_9w
\"/>\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A1\">\n" +
            "
<j.2:title>https://jazz.net/sandbox02/ccm/service/com.ibm.team.process.internal.common.service.IProcessRestService/processAreasForUser?userId=shilpat</j.2:title>
\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/sandbox02/ccm/service/com.ibm.team.process.internal.common.service.IProcessRestService/processAreasForUser?userId=shilpat
\"/>\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A2\">\n" +
            "    <j.2:title>@packham</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_vCJP8OeKEduR89vYjZT89g\"/>
\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A3\">\n" +
            "    <j.2:title>154737: Replace
ProjectAreaWebUIInitializionData with a dynamic module</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/154737
\"/>\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A4\">\n" +
            "    <j.2:title>@retchles</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_gTuTMG62Edu8R4joT9P1Ug\"/>
\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A5\">\n" +
            "    <j.2:title>Changes in Process - &lt;No Comment&gt; - Jared
Burns - Feb 23, 2011 1:36 AM</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_S7U3gT8XEeC50ZOFeYh_9w
\"/>\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A6\">\n" +
            "    <j.2:title>@mjarvis</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_P_wUELLTEduhAusIxeOxbA\"/>
\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A7\">\n" +
            "    <j.2:title>Changes in Process - Performance test - Martha
Andrews - Feb 22, 2011 9:45 PM</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/resource/itemOid/com.ibm.team.scm.ChangeSet/_p2Q08T7lEeC50ZOFeYh_9w
\"/>\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.filesystem.workitems.change_set.com.ibm.team.scm.ChangeSet
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A8\">\n" +
            "    <j.2:title>@storaskar</j.2:title>\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/jazz/oslc/automation/persons/_DziEAHHfEdyLLb7t1B32_A\"/>
\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "  <rdf:Description rdf:nodeID=\"A9\">\n" +
            "
<j.2:title>https://jazz.net/wiki/bin/view/Main/DraftTeamProcessRestApi#Project_Areas_collection</j.2:title>
\n" +
            "    <rdf:type rdf:resource=
\"http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement\"/>\n" +
            "    <rdf:object rdf:resource=
\"https://jazz.net/wiki/bin/view/Main/DraftTeamProcessRestApi#Project_Areas_collection
\"/>\n" +
            "    <rdf:predicate rdf:resource=
\"http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.textualReference.textuallyReferenced
\"/>\n" +
            "    <rdf:subject rdf:resource=
\"https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/%NUMBER%\"/>
\n" +
            "  </rdf:Description>\n" +
            "</rdf:RDF>\n";

}
>>>>>>>> END TDBOutOfMemoryTest.java



Andy Seaborne <an...@epimorphics.com> wrote on 03/09/2011 09:46:58
AM:

> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/09/2011 09:48 AM
>
> Please respond to jena-users
>
>
>
> On 08/03/11 18:45, Frank Budinsky wrote:
> >
> > Hi Andy,
> >
> > I never actually tried to run it past 100,000 graphs, so I don't know
what
> > will happen.
> >
> > I just finished running it with -xmx1200m, and as you thought it
should, it
> > did run (with 100,000 graphs) to completion. It took just under 4 hours
to
> > run on my Thinkpad T61 laptop running Windows XP and with 3G of RAM. I
> > noticed that it was loading about 700 graphs/minute at the start,
around
> > 550 graphs/minute though most of the run (up till about 90,000 graphs),
but
> > then got significantly slower towards the end (i.e., only about 55
> > graphs/minute for the last 1000). You mentioned that your JVM was 1.6
G. Is
> > that something you can config? I noticed that the total memory of my
java
> > process peaked at about 1.2 G.
>
> No - it might just be an artifact of the using a 64 bit JVM (openjdk).
> The heap size (max memory) was 1044M in main() I think.
>
> > I'll try running the test again with 500,000 graphs to see if it gets
all
> > the way through. Since it runs so slow, I'll let it run overnight and
see
> > what happens.
>
> It was fatser my my machine but I was using a desktop and disk speeds on
> portables are notoriously slow for database use.
>
> > How long does it take to load the 100,000 graphs for you?  I assume
this
> > runs much faster on your hardware.
>
> 573.87 seconds and one cup of coffee running inside Eclipse.
>
> My desktop is a quad core i5 CPU 760  @ 2.80GHz, ext4 filing system in
> Ubuntu 10.10.  SATA IDE disk; 7200 RPM.
>
> > I'm wondering if there's a minimum
> > hardware requirement to run TDB, especially if it's being used to load
tens
> > or hundreds of millions of triples? It would be nice to set
expectations
> > for what kind of hardware is needed for this.
>
> Yes but it's very hard to set expectations.  If its doing subject ->
> properties lookup, it scales much better than GROUP BY and BI queries.
>
> SSDs are the next new thing.
>
> The bulk loader is faster; and you can load on one machine and run
> queries on another.
>
>    Andy
>
> >
> > Thanks,
> > Frank.
> >
> > Andy Seaborne<an...@epimorphics.com>  wrote on 03/08/2011
11:53:06
> > AM:
> >
> >> [image removed]
> >>
> >> Re: OutOfMemoryError while loading datagraphs
> >>
> >> Andy Seaborne
> >>
> >> to:
> >>
> >> jena-users
> >>
> >> 03/08/2011 11:53 AM
> >>
> >> Please respond to jena-users
> >>
> >>
> >>
> >> On 08/03/11 15:05, Frank Budinsky wrote:
> >>>
> >>>>> I tried increasing the amount of memory, but that just increased
the
> >>> number
> >>>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> >>> exception.
> >>>>
> >>>> What sizes of heap are you using?
> >>>
> >>> I've been experimenting with various heap sizes, noticing that the
> > bigger I
> >>> make it the longer it runs
> >>> before crashing. When using the the default (small) heap size
(-xmx64m)
> > the
> >>> test program fails at about 5800 graphs. If I bump it all the way up
to
> >>> -xmx1200m, like you did, I suspect I will also be able to run the
test
> > to
> >>> completion (100,000 graphs), but it takes very long to run (more than
2
> >>> hours on my machine). I'm guessing this is also running much faster
for
> >>> you?
> >>>
> >>> Extrapolating from what you're saying it looks like I would need a
heap
> > of
> >>> 6G, or so, to hit my original target of 500,000 graphs (about 50M
> > triples
> >>> total). Does that sound right? That is, needing to run with such a
huge
> >>> heap?
> >>
> >> No - the caches are bounded.  Once they reach steady state, there is
no
> >> further growth and no fixed scale limit.  I'mve run a trial with
500,000
> >> with -Mx1200M and it works. for me.  JVM is 1.6G.  The caches were
still
> >> filling up in in your test case.
> >>
> >> The caches are LRU by slots, which is a bit crude for the node cache
as
> >> nodes vary in size.  Index files have fixed used units (blocks - they
> >> are 8Kbytes).
> >>
> >> The default settings are supposed to work for heap of 1.2G -- it's
what
> >> the scripts set the heap to.
> >>
> >> The caches were still filling up in in your test case.
> >>
> >>     Andy
> >>
> >>>
> >>> Thanks,
> >>> Frank.
> >>>
> >>> Andy Seaborne<an...@epimorphics.com>   wrote on 03/08/2011
> > 09:23:50
> >>> AM:
> >>>
> >>>> [image removed]
> >>>>
> >>>> Re: OutOfMemoryError while loading datagraphs
> >>>>
> >>>> Andy Seaborne
> >>>>
> >>>> to:
> >>>>
> >>>> jena-users
> >>>>
> >>>> 03/08/2011 09:24 AM
> >>>>
> >>>> Please respond to jena-users
> >>>>
> >>>> (Frank sent me the detached file)
> >>>>
> >>>> Frank,
> >>>>
> >>>> I'm on a 64 bit machine, but I'm settign direct mode and limiting
the
> >>>> Java heap size with -Xmx
> >>>>
> >>>> With a heap of 1200M, java reports 1066M max memory, and the test
> > runs.
> >>>> With a heap of 500M, java reports 444M max memory, and the test
stops
> > at
> >>>> 11800.
> >>>>
> >>>> Things will be a little different for 32 bit but should be
> > approximately
> >>>> the same.  TDB is doing the same things.
> >>>>
> >>>> Tweaking the block cache sizes (sorry, magic needed) down to 5000
> > (read,
> >>>> default 10000) and 1000 (write, default 2000), it runs at 500M, but
> >>> slower.
> >>>>
> >>>> There are quite a few files for named graphs so small changes in
cache
> >>>> size get multiplied (x12, I think).
> >>>>
> >>>>> I tried increasing the amount of memory, but that just increased
the
> >>> number
> >>>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> >>> exception.
> >>>>
> >>>> What sizes of heap are you using?
> >>>>
> >>>>      Andy
> >>>>
> >>>> On 07/03/11 18:34, Frank Budinsky wrote:
> >>>>> Hi Andy,
> >>>>>
> >>>>> I created a simple standalone test program that roughly simulates
> > what
> >>>>> my application is doing and it also crashes with the same
> >>>>> OutOfMemoryError exception. I've attached it here. Would it be
> > possible
> >>>>> for you to give it a try?
> >>>>>
> >>>>> /(See attached file: TDBOutOfMemoryTest.java)/
> >>>>>
> >>>>> Just change TDB_DIR to some new empty database location and run. It
> >>>>> get's the OutOfMemoryError at around 5800 graphs when I run it with
> >>>>> default VM params.
> >>>>>
> >>>>> Thanks,
> >>>>> Frank.
> >>>>>
> >>>>>
> >>>>> Andy Seaborne<an...@epimorphics.com>   wrote on 03/02/2011
> >>>>> 09:38:51 AM:
> >>>>>
> >>>>>    >   [image removed]
> >>>>>    >
> >>>>>    >   Re: OutOfMemoryError while loading datagraphs
> >>>>>    >
> >>>>>    >   Andy Seaborne
> >>>>>    >
> >>>>>    >   to:
> >>>>>    >
> >>>>>    >   jena-users
> >>>>>    >
> >>>>>    >   03/02/2011 09:41 AM
> >>>>>    >
> >>>>>    >   Please respond to jena-users
> >>>>>    >
> >>>>>    >   Hi Frank,
> >>>>>    >
> >>>>>    >   On 28/02/11 14:48, Frank Budinsky wrote:
> >>>>>    >   >
> >>>>>    >   >   Hi Andy,
> >>>>>    >   >
> >>>>>    >   >   I did some further analysis of my OutOfMemeoryError
problem,
> > and
> >>>>> this is
> >>>>>    >   >   what I've discovered. The problem seems to be that there
is
> > one
> >>>>> instance of
> >>>>>    >   >   class NodeTupleTableConcrete that contains an ever
growing
> > set of
> >>>>> tuples
> >>>>>    >   >   which eventually uses up all the available heap space
and
> > then
> >>> crashes.
> >>>>>    >   >
> >>>>>    >   >   To be more specific, this field in class TupleTable:
> >>>>>    >   >
> >>>>>    >   >   private final TupleIndex[] indexes ;
> >>>>>    >   >
> >>>>>    >   >   seems to contain 6 continually growing TupleIndexRecord
> > instances
> >>>>>    >   >   (BPlusTrees). From my measurements, this seems to eat up
> >>>>> approximately 1G
> >>>>>    >   >   of heap for every 1M triples in the Dataset (i.e., about
1K
> > per
> >>>>> datagraph).
> >>>>>    >   >   So, to load my 100K datagraphs (~10M total triples) it
would
> > seem
> >>>>> to need
> >>>>>    >   >   10G of heap space.
> >>>>>    >
> >>>>>    >   There are 6 indexes for named graphs (see the files GSPO
etc).
> > TDB
> >>> uses
> >>>>>    >   total indexing which puts a lot of work at load time but
means
> > any
> >>>>>    >   lookup needed is always done with an index scan. The code
can
> > run
> >>> with
> >>>>>    >   less indexes - the minimum is one - but that is no exposed
in
> > the
> >>>>>    >   configuration.
> >>>>>    >
> >>>>>    >   Each index holds quads (4 NodeIds, a NodeId is 64 bits on
disk).
> > As
> >>> the
> >>>>>    >   index grows the data goes to disk. There is a finite LRU
cache
> > in
> >>> front
> >>>>>    >   of each index.
> >>>>>    >
> >>>>>    >   Does your dataset have a location? If has no location, it's
all
> >>>>>    >   in-memory with a RAM-disk like structure. This is for
> > small-scale
> >>>>>    >   testing only - it really does read and write blocks out of
the
> > RAM
> >>> disk
> >>>>>    >   by copy to give strict disk-like semantics.
> >>>>>    >
> >>>>>    >   There is also a NodeTable mapping between NodeId and Node
> > (Jena's
> >>>>>    >   graph-level RDF Term class). This has a cache in front of
it .
> >>>>>    >   readPropertiesFile
> >>>>>    >   The long-ish literals maybe the problem. The node table
cache is
> >>>>>    >   fixed-number, not bounded by size.
> >>>>>    >
> >>>>>    >   The sizeof the caches are controlled by:
> >>>>>    >
> >>>>>    >   SystemTDB.Node2NodeIdCacheSize
> >>>>>    >   SystemTDB.NodeId2NodeCacheSize
> >>>>>    >
> >>>>>    >   These are not easy to control but either (1) get the source
code
> > and
> >>>>>    >   alter the default values (2) see the code in SystemTDB that
uses
> > a
> >>>>>    >   properties file.
> >>>>>    >
> >>>>>    >   If you can end me a copy of the data, I can try loading it
here.
> >>>>>    >
> >>>>>    >   >   Does this make sense? How is it supposed to work?
Shouldn't
> > the
> >>> triples
> >>>>>    >   >   from previously loaded named graphs be eligable for GC
when
> > I'm
> >>>>> loading the
> >>>>>    >   >   next named graph? Could it be that I'm holding
ontosomething
> >>> that's
> >>>>>    >   >   preventing GC in the TupleTable?
> >>>>>    >   >
> >>>>>    >   >   Also, after looking more carefully at the resources
being
> > indexed,
> >>> I
> >>>>>    >   >   noticed that many of them do have relatively large
literals
> > (100s
> >>> of
> >>>>>    >   >   characters). I also noticed that when using Fuseki to
load
> > the
> >>>>> resources I
> >>>>>    >   >   get lots of warning messages like this, on the console:
> >>>>>    >   >
> >>>>>    >   >   Lexical form 'We are currently doing
> >>>>>    >   >
this:<br></br><br></br>workspaceConnection.replaceComponents
> >>>>>    >   >   (replaceComponents, replaceSource, falses, false,
> >>>>>    >   >   monitor);<br></br><br></br>the new way of doing it would
be
> >>> something
> >>>>>    >   >
> >>>>>
> >>>
> >
like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>>>    >   >   ArrayList&lt;IComponentOp&gt; replaceOps = new
> >>>>>    >   >   ArrayList&lt;IComponentOp&gt;();<br></
> >>>>>    >   br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>>>    >   >   for (Iterator iComponents = components.iterator();
> >>>>> iComponents.hasNext();)
> >>>>>    >   >
> >>>>>
> >>>
> >
{<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>
> >>>>>    >   >   IComponentHandle componentHandle = (IComponentHandle)
> >>> iComponents.next
> >>>>>    >   >   ();<br></
> >>>>>    >
> >>>>>
> >>>>
> >>>
> >>
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>
> >>>>>    >   >   replaceOps.add
(promotionTargetConnection.componentOpFactory
> >>>>>    >   >   ().replaceComponent
> >>>>>    >   >   (componentHandle,<br></
> >>>>>    >
> >>>>>
> >>>>
> >>>
> >>
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>
> >>>>>    >   >   buildWorkspaceConnection,
> >>>>>    >   >
false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
> >>>>>    >   <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>>>    >   >   promotionTargetConnection.applyComponentOperations
> > (replaceOps,
> >>>>> monitor);'
> >>>>>    >   >   not valid for datatype
> >>>>>    >   >   http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
> >>>>>    >   >
> >>>>>    >   >   Could this be part of the problem?
> >>>>>    >
> >>>>>    >   No - it's a different issue. This is something coming from
the
> >>> parser.
> >>>>>    >
> >>>>>    >   RDF XMLLiterals have special rules - they must follow
> >>>>>    >   exclusive canonical XML, which means, amongst a lot of other
> > thigs,
> >>> they
> >>>>>    >   have to be a single XML node. The rules for exclusive
Canonical
> > XML
> >>> are
> >>>>>    >   really quite strict (e.g. attributes in alphabetical order).
> >>>>>    >
> >>>>>    >   http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
> >>>>>    >
> >>>>>    >   If you want to store XML or HTML fragments, you can't use
RDF
> >>>>>    >   XMLLiterals very easily - you have to mangle them to conform
to
> > the
> >>>>>    >   rules. I suggest either as strings or invent your own
datatype.
> >>>>>    >
> >>>>>    >   You can run the parser on it's own using
> >>>>>    >   "riotcmd.riot --validate FILE ..."
> >>>>>    >
> >>>>>    >
> >>>>>    >   Andy
> >>>>>    >
> >>>>>    >   >
> >>>>>    >   >   Thanks,
> >>>>>    >   >   Frank.
> >>>>>

Re: OutOfMemoryError while loading datagraphs

Posted by Andy Seaborne <an...@epimorphics.com>.

On 08/03/11 18:45, Frank Budinsky wrote:
>
> Hi Andy,
>
> I never actually tried to run it past 100,000 graphs, so I don't know what
> will happen.
>
> I just finished running it with -xmx1200m, and as you thought it should, it
> did run (with 100,000 graphs) to completion. It took just under 4 hours to
> run on my Thinkpad T61 laptop running Windows XP and with 3G of RAM. I
> noticed that it was loading about 700 graphs/minute at the start, around
> 550 graphs/minute though most of the run (up till about 90,000 graphs), but
> then got significantly slower towards the end (i.e., only about 55
> graphs/minute for the last 1000). You mentioned that your JVM was 1.6 G. Is
> that something you can config? I noticed that the total memory of my java
> process peaked at about 1.2 G.

No - it might just be an artifact of the using a 64 bit JVM (openjdk). 
The heap size (max memory) was 1044M in main() I think.

> I'll try running the test again with 500,000 graphs to see if it gets all
> the way through. Since it runs so slow, I'll let it run overnight and see
> what happens.

It was fatser my my machine but I was using a desktop and disk speeds on 
portables are notoriously slow for database use.

> How long does it take to load the 100,000 graphs for you?  I assume this
> runs much faster on your hardware.

573.87 seconds and one cup of coffee running inside Eclipse.

My desktop is a quad core i5 CPU 760  @ 2.80GHz, ext4 filing system in 
Ubuntu 10.10.  SATA IDE disk; 7200 RPM.

> I'm wondering if there's a minimum
> hardware requirement to run TDB, especially if it's being used to load tens
> or hundreds of millions of triples? It would be nice to set expectations
> for what kind of hardware is needed for this.

Yes but it's very hard to set expectations.  If its doing subject -> 
properties lookup, it scales much better than GROUP BY and BI queries.

SSDs are the next new thing.

The bulk loader is faster; and you can load on one machine and run 
queries on another.

	Andy

>
> Thanks,
> Frank.
>
> Andy Seaborne<an...@epimorphics.com>  wrote on 03/08/2011 11:53:06
> AM:
>
>> [image removed]
>>
>> Re: OutOfMemoryError while loading datagraphs
>>
>> Andy Seaborne
>>
>> to:
>>
>> jena-users
>>
>> 03/08/2011 11:53 AM
>>
>> Please respond to jena-users
>>
>>
>>
>> On 08/03/11 15:05, Frank Budinsky wrote:
>>>
>>>>> I tried increasing the amount of memory, but that just increased the
>>> number
>>>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
>>> exception.
>>>>
>>>> What sizes of heap are you using?
>>>
>>> I've been experimenting with various heap sizes, noticing that the
> bigger I
>>> make it the longer it runs
>>> before crashing. When using the the default (small) heap size (-xmx64m)
> the
>>> test program fails at about 5800 graphs. If I bump it all the way up to
>>> -xmx1200m, like you did, I suspect I will also be able to run the test
> to
>>> completion (100,000 graphs), but it takes very long to run (more than 2
>>> hours on my machine). I'm guessing this is also running much faster for
>>> you?
>>>
>>> Extrapolating from what you're saying it looks like I would need a heap
> of
>>> 6G, or so, to hit my original target of 500,000 graphs (about 50M
> triples
>>> total). Does that sound right? That is, needing to run with such a huge
>>> heap?
>>
>> No - the caches are bounded.  Once they reach steady state, there is no
>> further growth and no fixed scale limit.  I'mve run a trial with 500,000
>> with -Mx1200M and it works. for me.  JVM is 1.6G.  The caches were still
>> filling up in in your test case.
>>
>> The caches are LRU by slots, which is a bit crude for the node cache as
>> nodes vary in size.  Index files have fixed used units (blocks - they
>> are 8Kbytes).
>>
>> The default settings are supposed to work for heap of 1.2G -- it's what
>> the scripts set the heap to.
>>
>> The caches were still filling up in in your test case.
>>
>>     Andy
>>
>>>
>>> Thanks,
>>> Frank.
>>>
>>> Andy Seaborne<an...@epimorphics.com>   wrote on 03/08/2011
> 09:23:50
>>> AM:
>>>
>>>> [image removed]
>>>>
>>>> Re: OutOfMemoryError while loading datagraphs
>>>>
>>>> Andy Seaborne
>>>>
>>>> to:
>>>>
>>>> jena-users
>>>>
>>>> 03/08/2011 09:24 AM
>>>>
>>>> Please respond to jena-users
>>>>
>>>> (Frank sent me the detached file)
>>>>
>>>> Frank,
>>>>
>>>> I'm on a 64 bit machine, but I'm settign direct mode and limiting the
>>>> Java heap size with -Xmx
>>>>
>>>> With a heap of 1200M, java reports 1066M max memory, and the test
> runs.
>>>> With a heap of 500M, java reports 444M max memory, and the test stops
> at
>>>> 11800.
>>>>
>>>> Things will be a little different for 32 bit but should be
> approximately
>>>> the same.  TDB is doing the same things.
>>>>
>>>> Tweaking the block cache sizes (sorry, magic needed) down to 5000
> (read,
>>>> default 10000) and 1000 (write, default 2000), it runs at 500M, but
>>> slower.
>>>>
>>>> There are quite a few files for named graphs so small changes in cache
>>>> size get multiplied (x12, I think).
>>>>
>>>>> I tried increasing the amount of memory, but that just increased the
>>> number
>>>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
>>> exception.
>>>>
>>>> What sizes of heap are you using?
>>>>
>>>>      Andy
>>>>
>>>> On 07/03/11 18:34, Frank Budinsky wrote:
>>>>> Hi Andy,
>>>>>
>>>>> I created a simple standalone test program that roughly simulates
> what
>>>>> my application is doing and it also crashes with the same
>>>>> OutOfMemoryError exception. I've attached it here. Would it be
> possible
>>>>> for you to give it a try?
>>>>>
>>>>> /(See attached file: TDBOutOfMemoryTest.java)/
>>>>>
>>>>> Just change TDB_DIR to some new empty database location and run. It
>>>>> get's the OutOfMemoryError at around 5800 graphs when I run it with
>>>>> default VM params.
>>>>>
>>>>> Thanks,
>>>>> Frank.
>>>>>
>>>>>
>>>>> Andy Seaborne<an...@epimorphics.com>   wrote on 03/02/2011
>>>>> 09:38:51 AM:
>>>>>
>>>>>    >   [image removed]
>>>>>    >
>>>>>    >   Re: OutOfMemoryError while loading datagraphs
>>>>>    >
>>>>>    >   Andy Seaborne
>>>>>    >
>>>>>    >   to:
>>>>>    >
>>>>>    >   jena-users
>>>>>    >
>>>>>    >   03/02/2011 09:41 AM
>>>>>    >
>>>>>    >   Please respond to jena-users
>>>>>    >
>>>>>    >   Hi Frank,
>>>>>    >
>>>>>    >   On 28/02/11 14:48, Frank Budinsky wrote:
>>>>>    >   >
>>>>>    >   >   Hi Andy,
>>>>>    >   >
>>>>>    >   >   I did some further analysis of my OutOfMemeoryError problem,
> and
>>>>> this is
>>>>>    >   >   what I've discovered. The problem seems to be that there is
> one
>>>>> instance of
>>>>>    >   >   class NodeTupleTableConcrete that contains an ever growing
> set of
>>>>> tuples
>>>>>    >   >   which eventually uses up all the available heap space and
> then
>>> crashes.
>>>>>    >   >
>>>>>    >   >   To be more specific, this field in class TupleTable:
>>>>>    >   >
>>>>>    >   >   private final TupleIndex[] indexes ;
>>>>>    >   >
>>>>>    >   >   seems to contain 6 continually growing TupleIndexRecord
> instances
>>>>>    >   >   (BPlusTrees). From my measurements, this seems to eat up
>>>>> approximately 1G
>>>>>    >   >   of heap for every 1M triples in the Dataset (i.e., about 1K
> per
>>>>> datagraph).
>>>>>    >   >   So, to load my 100K datagraphs (~10M total triples) it would
> seem
>>>>> to need
>>>>>    >   >   10G of heap space.
>>>>>    >
>>>>>    >   There are 6 indexes for named graphs (see the files GSPO etc).
> TDB
>>> uses
>>>>>    >   total indexing which puts a lot of work at load time but means
> any
>>>>>    >   lookup needed is always done with an index scan. The code can
> run
>>> with
>>>>>    >   less indexes - the minimum is one - but that is no exposed in
> the
>>>>>    >   configuration.
>>>>>    >
>>>>>    >   Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk).
> As
>>> the
>>>>>    >   index grows the data goes to disk. There is a finite LRU cache
> in
>>> front
>>>>>    >   of each index.
>>>>>    >
>>>>>    >   Does your dataset have a location? If has no location, it's all
>>>>>    >   in-memory with a RAM-disk like structure. This is for
> small-scale
>>>>>    >   testing only - it really does read and write blocks out of the
> RAM
>>> disk
>>>>>    >   by copy to give strict disk-like semantics.
>>>>>    >
>>>>>    >   There is also a NodeTable mapping between NodeId and Node
> (Jena's
>>>>>    >   graph-level RDF Term class). This has a cache in front of it .
>>>>>    >   readPropertiesFile
>>>>>    >   The long-ish literals maybe the problem. The node table cache is
>>>>>    >   fixed-number, not bounded by size.
>>>>>    >
>>>>>    >   The sizeof the caches are controlled by:
>>>>>    >
>>>>>    >   SystemTDB.Node2NodeIdCacheSize
>>>>>    >   SystemTDB.NodeId2NodeCacheSize
>>>>>    >
>>>>>    >   These are not easy to control but either (1) get the source code
> and
>>>>>    >   alter the default values (2) see the code in SystemTDB that uses
> a
>>>>>    >   properties file.
>>>>>    >
>>>>>    >   If you can end me a copy of the data, I can try loading it here.
>>>>>    >
>>>>>    >   >   Does this make sense? How is it supposed to work? Shouldn't
> the
>>> triples
>>>>>    >   >   from previously loaded named graphs be eligable for GC when
> I'm
>>>>> loading the
>>>>>    >   >   next named graph? Could it be that I'm holding onto something
>>> that's
>>>>>    >   >   preventing GC in the TupleTable?
>>>>>    >   >
>>>>>    >   >   Also, after looking more carefully at the resources being
> indexed,
>>> I
>>>>>    >   >   noticed that many of them do have relatively large literals
> (100s
>>> of
>>>>>    >   >   characters). I also noticed that when using Fuseki to load
> the
>>>>> resources I
>>>>>    >   >   get lots of warning messages like this, on the console:
>>>>>    >   >
>>>>>    >   >   Lexical form 'We are currently doing
>>>>>    >   >   this:<br></br><br></br>workspaceConnection.replaceComponents
>>>>>    >   >   (replaceComponents, replaceSource, falses, false,
>>>>>    >   >   monitor);<br></br><br></br>the new way of doing it would be
>>> something
>>>>>    >   >
>>>>>
>>>
> like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>>>>>    >   >   ArrayList&lt;IComponentOp&gt; replaceOps = new
>>>>>    >   >   ArrayList&lt;IComponentOp&gt;();<br></
>>>>>    >   br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>>>>>    >   >   for (Iterator iComponents = components.iterator();
>>>>> iComponents.hasNext();)
>>>>>    >   >
>>>>>
>>>
> {<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>
>>>
>>>>>    >   >   IComponentHandle componentHandle = (IComponentHandle)
>>> iComponents.next
>>>>>    >   >   ();<br></
>>>>>    >
>>>>>
>>>>
>>>
>>
> br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>
>>>
>>>>>    >   >   replaceOps.add(promotionTargetConnection.componentOpFactory
>>>>>    >   >   ().replaceComponent
>>>>>    >   >   (componentHandle,<br></
>>>>>    >
>>>>>
>>>>
>>>
>>
> br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>
>>>
>>>>>    >   >   buildWorkspaceConnection,
>>>>>    >   >   false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
>>>>>    >   <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>>>>>    >   >   promotionTargetConnection.applyComponentOperations
> (replaceOps,
>>>>> monitor);'
>>>>>    >   >   not valid for datatype
>>>>>    >   >   http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
>>>>>    >   >
>>>>>    >   >   Could this be part of the problem?
>>>>>    >
>>>>>    >   No - it's a different issue. This is something coming from the
>>> parser.
>>>>>    >
>>>>>    >   RDF XMLLiterals have special rules - they must follow
>>>>>    >   exclusive canonical XML, which means, amongst a lot of other
> thigs,
>>> they
>>>>>    >   have to be a single XML node. The rules for exclusive Canonical
> XML
>>> are
>>>>>    >   really quite strict (e.g. attributes in alphabetical order).
>>>>>    >
>>>>>    >   http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
>>>>>    >
>>>>>    >   If you want to store XML or HTML fragments, you can't use RDF
>>>>>    >   XMLLiterals very easily - you have to mangle them to conform to
> the
>>>>>    >   rules. I suggest either as strings or invent your own datatype.
>>>>>    >
>>>>>    >   You can run the parser on it's own using
>>>>>    >   "riotcmd.riot --validate FILE ..."
>>>>>    >
>>>>>    >
>>>>>    >   Andy
>>>>>    >
>>>>>    >   >
>>>>>    >   >   Thanks,
>>>>>    >   >   Frank.
>>>>>

Re: OutOfMemoryError while loading datagraphs

Posted by Frank Budinsky <fr...@ca.ibm.com>.
Hi Andy,

I never actually tried to run it past 100,000 graphs, so I don't know what
will happen.

I just finished running it with -xmx1200m, and as you thought it should, it
did run (with 100,000 graphs) to completion. It took just under 4 hours to
run on my Thinkpad T61 laptop running Windows XP and with 3G of RAM. I
noticed that it was loading about 700 graphs/minute at the start, around
550 graphs/minute though most of the run (up till about 90,000 graphs), but
then got significantly slower towards the end (i.e., only about 55
graphs/minute for the last 1000). You mentioned that your JVM was 1.6 G. Is
that something you can config? I noticed that the total memory of my java
process peaked at about 1.2 G.

I'll try running the test again with 500,000 graphs to see if it gets all
the way through. Since it runs so slow, I'll let it run overnight and see
what happens.

How long does it take to load the 100,000 graphs for you?  I assume this
runs much faster on your hardware. I'm wondering if there's a minimum
hardware requirement to run TDB, especially if it's being used to load tens
or hundreds of millions of triples? It would be nice to set expectations
for what kind of hardware is needed for this.

Thanks,
Frank.

Andy Seaborne <an...@epimorphics.com> wrote on 03/08/2011 11:53:06
AM:

> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/08/2011 11:53 AM
>
> Please respond to jena-users
>
>
>
> On 08/03/11 15:05, Frank Budinsky wrote:
> >
> >>> I tried increasing the amount of memory, but that just increased the
> > number
> >>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> > exception.
> >>
> >> What sizes of heap are you using?
> >
> > I've been experimenting with various heap sizes, noticing that the
bigger I
> > make it the longer it runs
> > before crashing. When using the the default (small) heap size (-xmx64m)
the
> > test program fails at about 5800 graphs. If I bump it all the way up to
> > -xmx1200m, like you did, I suspect I will also be able to run the test
to
> > completion (100,000 graphs), but it takes very long to run (more than 2
> > hours on my machine). I'm guessing this is also running much faster for
> > you?
> >
> > Extrapolating from what you're saying it looks like I would need a heap
of
> > 6G, or so, to hit my original target of 500,000 graphs (about 50M
triples
> > total). Does that sound right? That is, needing to run with such a huge
> > heap?
>
> No - the caches are bounded.  Once they reach steady state, there is no
> further growth and no fixed scale limit.  I'mve run a trial with 500,000
> with -Mx1200M and it works. for me.  JVM is 1.6G.  The caches were still
> filling up in in your test case.
>
> The caches are LRU by slots, which is a bit crude for the node cache as
> nodes vary in size.  Index files have fixed used units (blocks - they
> are 8Kbytes).
>
> The default settings are supposed to work for heap of 1.2G -- it's what
> the scripts set the heap to.
>
> The caches were still filling up in in your test case.
>
>    Andy
>
> >
> > Thanks,
> > Frank.
> >
> > Andy Seaborne<an...@epimorphics.com>  wrote on 03/08/2011
09:23:50
> > AM:
> >
> >> [image removed]
> >>
> >> Re: OutOfMemoryError while loading datagraphs
> >>
> >> Andy Seaborne
> >>
> >> to:
> >>
> >> jena-users
> >>
> >> 03/08/2011 09:24 AM
> >>
> >> Please respond to jena-users
> >>
> >> (Frank sent me the detached file)
> >>
> >> Frank,
> >>
> >> I'm on a 64 bit machine, but I'm settign direct mode and limiting the
> >> Java heap size with -Xmx
> >>
> >> With a heap of 1200M, java reports 1066M max memory, and the test
runs.
> >> With a heap of 500M, java reports 444M max memory, and the test stops
at
> >> 11800.
> >>
> >> Things will be a little different for 32 bit but should be
approximately
> >> the same.  TDB is doing the same things.
> >>
> >> Tweaking the block cache sizes (sorry, magic needed) down to 5000
(read,
> >> default 10000) and 1000 (write, default 2000), it runs at 500M, but
> > slower.
> >>
> >> There are quite a few files for named graphs so small changes in cache
> >> size get multiplied (x12, I think).
> >>
> >>> I tried increasing the amount of memory, but that just increased the
> > number
> >>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> > exception.
> >>
> >> What sizes of heap are you using?
> >>
> >>     Andy
> >>
> >> On 07/03/11 18:34, Frank Budinsky wrote:
> >>> Hi Andy,
> >>>
> >>> I created a simple standalone test program that roughly simulates
what
> >>> my application is doing and it also crashes with the same
> >>> OutOfMemoryError exception. I've attached it here. Would it be
possible
> >>> for you to give it a try?
> >>>
> >>> /(See attached file: TDBOutOfMemoryTest.java)/
> >>>
> >>> Just change TDB_DIR to some new empty database location and run. It
> >>> get's the OutOfMemoryError at around 5800 graphs when I run it with
> >>> default VM params.
> >>>
> >>> Thanks,
> >>> Frank.
> >>>
> >>>
> >>> Andy Seaborne<an...@epimorphics.com>  wrote on 03/02/2011
> >>> 09:38:51 AM:
> >>>
> >>>   >  [image removed]
> >>>   >
> >>>   >  Re: OutOfMemoryError while loading datagraphs
> >>>   >
> >>>   >  Andy Seaborne
> >>>   >
> >>>   >  to:
> >>>   >
> >>>   >  jena-users
> >>>   >
> >>>   >  03/02/2011 09:41 AM
> >>>   >
> >>>   >  Please respond to jena-users
> >>>   >
> >>>   >  Hi Frank,
> >>>   >
> >>>   >  On 28/02/11 14:48, Frank Budinsky wrote:
> >>>   >  >
> >>>   >  >  Hi Andy,
> >>>   >  >
> >>>   >  >  I did some further analysis of my OutOfMemeoryError problem,
and
> >>> this is
> >>>   >  >  what I've discovered. The problem seems to be that there is
one
> >>> instance of
> >>>   >  >  class NodeTupleTableConcrete that contains an ever growing
set of
> >>> tuples
> >>>   >  >  which eventually uses up all the available heap space and
then
> > crashes.
> >>>   >  >
> >>>   >  >  To be more specific, this field in class TupleTable:
> >>>   >  >
> >>>   >  >  private final TupleIndex[] indexes ;
> >>>   >  >
> >>>   >  >  seems to contain 6 continually growing TupleIndexRecord
instances
> >>>   >  >  (BPlusTrees). From my measurements, this seems to eat up
> >>> approximately 1G
> >>>   >  >  of heap for every 1M triples in the Dataset (i.e., about 1K
per
> >>> datagraph).
> >>>   >  >  So, to load my 100K datagraphs (~10M total triples) it would
seem
> >>> to need
> >>>   >  >  10G of heap space.
> >>>   >
> >>>   >  There are 6 indexes for named graphs (see the files GSPO etc).
TDB
> > uses
> >>>   >  total indexing which puts a lot of work at load time but means
any
> >>>   >  lookup needed is always done with an index scan. The code can
run
> > with
> >>>   >  less indexes - the minimum is one - but that is no exposed in
the
> >>>   >  configuration.
> >>>   >
> >>>   >  Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk).
As
> > the
> >>>   >  index grows the data goes to disk. There is a finite LRU cache
in
> > front
> >>>   >  of each index.
> >>>   >
> >>>   >  Does your dataset have a location? If has no location, it's all
> >>>   >  in-memory with a RAM-disk like structure. This is for
small-scale
> >>>   >  testing only - it really does read and write blocks out of the
RAM
> > disk
> >>>   >  by copy to give strict disk-like semantics.
> >>>   >
> >>>   >  There is also a NodeTable mapping between NodeId and Node
(Jena's
> >>>   >  graph-level RDF Term class). This has a cache in front of it .
> >>>   >  readPropertiesFile
> >>>   >  The long-ish literals maybe the problem. The node table cache is
> >>>   >  fixed-number, not bounded by size.
> >>>   >
> >>>   >  The sizeof the caches are controlled by:
> >>>   >
> >>>   >  SystemTDB.Node2NodeIdCacheSize
> >>>   >  SystemTDB.NodeId2NodeCacheSize
> >>>   >
> >>>   >  These are not easy to control but either (1) get the source code
and
> >>>   >  alter the default values (2) see the code in SystemTDB that uses
a
> >>>   >  properties file.
> >>>   >
> >>>   >  If you can end me a copy of the data, I can try loading it here.
> >>>   >
> >>>   >  >  Does this make sense? How is it supposed to work? Shouldn't
the
> > triples
> >>>   >  >  from previously loaded named graphs be eligable for GC when
I'm
> >>> loading the
> >>>   >  >  next named graph? Could it be that I'm holding onto something
> > that's
> >>>   >  >  preventing GC in the TupleTable?
> >>>   >  >
> >>>   >  >  Also, after looking more carefully at the resources being
indexed,
> > I
> >>>   >  >  noticed that many of them do have relatively large literals
(100s
> > of
> >>>   >  >  characters). I also noticed that when using Fuseki to load
the
> >>> resources I
> >>>   >  >  get lots of warning messages like this, on the console:
> >>>   >  >
> >>>   >  >  Lexical form 'We are currently doing
> >>>   >  >  this:<br></br><br></br>workspaceConnection.replaceComponents
> >>>   >  >  (replaceComponents, replaceSource, falses, false,
> >>>   >  >  monitor);<br></br><br></br>the new way of doing it would be
> > something
> >>>   >  >
> >>>
> >
like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>   >  >  ArrayList&lt;IComponentOp&gt; replaceOps = new
> >>>   >  >  ArrayList&lt;IComponentOp&gt;();<br></
> >>>   >  br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>   >  >  for (Iterator iComponents = components.iterator();
> >>> iComponents.hasNext();)
> >>>   >  >
> >>>
> >
{<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>   >  >  IComponentHandle componentHandle = (IComponentHandle)
> > iComponents.next
> >>>   >  >  ();<br></
> >>>   >
> >>>
> >>
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>   >  >  replaceOps.add(promotionTargetConnection.componentOpFactory
> >>>   >  >  ().replaceComponent
> >>>   >  >  (componentHandle,<br></
> >>>   >
> >>>
> >>
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >
> >>>   >  >  buildWorkspaceConnection,
> >>>   >  >  false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
> >>>   >  <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >>>   >  >  promotionTargetConnection.applyComponentOperations
(replaceOps,
> >>> monitor);'
> >>>   >  >  not valid for datatype
> >>>   >  >  http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
> >>>   >  >
> >>>   >  >  Could this be part of the problem?
> >>>   >
> >>>   >  No - it's a different issue. This is something coming from the
> > parser.
> >>>   >
> >>>   >  RDF XMLLiterals have special rules - they must follow
> >>>   >  exclusive canonical XML, which means, amongst a lot of other
thigs,
> > they
> >>>   >  have to be a single XML node. The rules for exclusive Canonical
XML
> > are
> >>>   >  really quite strict (e.g. attributes in alphabetical order).
> >>>   >
> >>>   >  http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
> >>>   >
> >>>   >  If you want to store XML or HTML fragments, you can't use RDF
> >>>   >  XMLLiterals very easily - you have to mangle them to conform to
the
> >>>   >  rules. I suggest either as strings or invent your own datatype.
> >>>   >
> >>>   >  You can run the parser on it's own using
> >>>   >  "riotcmd.riot --validate FILE ..."
> >>>   >
> >>>   >
> >>>   >  Andy
> >>>   >
> >>>   >  >
> >>>   >  >  Thanks,
> >>>   >  >  Frank.
> >>>

Re: OutOfMemoryError while loading datagraphs

Posted by Andy Seaborne <an...@epimorphics.com>.

On 08/03/11 15:05, Frank Budinsky wrote:
>
>>> I tried increasing the amount of memory, but that just increased the
> number
>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> exception.
>>
>> What sizes of heap are you using?
>
> I've been experimenting with various heap sizes, noticing that the bigger I
> make it the longer it runs
> before crashing. When using the the default (small) heap size (-xmx64m) the
> test program fails at about 5800 graphs. If I bump it all the way up to
> -xmx1200m, like you did, I suspect I will also be able to run the test to
> completion (100,000 graphs), but it takes very long to run (more than 2
> hours on my machine). I'm guessing this is also running much faster for
> you?
>
> Extrapolating from what you're saying it looks like I would need a heap of
> 6G, or so, to hit my original target of 500,000 graphs (about 50M triples
> total). Does that sound right? That is, needing to run with such a huge
> heap?

No - the caches are bounded.  Once they reach steady state, there is no 
further growth and no fixed scale limit.  I'mve run a trial with 500,000 
with -Mx1200M and it works. for me.  JVM is 1.6G.  The caches were still 
filling up in in your test case.

The caches are LRU by slots, which is a bit crude for the node cache as 
nodes vary in size.  Index files have fixed used units (blocks - they 
are 8Kbytes).

The default settings are supposed to work for heap of 1.2G -- it's what 
the scripts set the heap to.

The caches were still filling up in in your test case.

	Andy

>
> Thanks,
> Frank.
>
> Andy Seaborne<an...@epimorphics.com>  wrote on 03/08/2011 09:23:50
> AM:
>
>> [image removed]
>>
>> Re: OutOfMemoryError while loading datagraphs
>>
>> Andy Seaborne
>>
>> to:
>>
>> jena-users
>>
>> 03/08/2011 09:24 AM
>>
>> Please respond to jena-users
>>
>> (Frank sent me the detached file)
>>
>> Frank,
>>
>> I'm on a 64 bit machine, but I'm settign direct mode and limiting the
>> Java heap size with -Xmx
>>
>> With a heap of 1200M, java reports 1066M max memory, and the test runs.
>> With a heap of 500M, java reports 444M max memory, and the test stops at
>> 11800.
>>
>> Things will be a little different for 32 bit but should be approximately
>> the same.  TDB is doing the same things.
>>
>> Tweaking the block cache sizes (sorry, magic needed) down to 5000 (read,
>> default 10000) and 1000 (write, default 2000), it runs at 500M, but
> slower.
>>
>> There are quite a few files for named graphs so small changes in cache
>> size get multiplied (x12, I think).
>>
>>> I tried increasing the amount of memory, but that just increased the
> number
>>> of calls that succeed (e.g., 10000 vs 2000) before getting the
> exception.
>>
>> What sizes of heap are you using?
>>
>>     Andy
>>
>> On 07/03/11 18:34, Frank Budinsky wrote:
>>> Hi Andy,
>>>
>>> I created a simple standalone test program that roughly simulates what
>>> my application is doing and it also crashes with the same
>>> OutOfMemoryError exception. I've attached it here. Would it be possible
>>> for you to give it a try?
>>>
>>> /(See attached file: TDBOutOfMemoryTest.java)/
>>>
>>> Just change TDB_DIR to some new empty database location and run. It
>>> get's the OutOfMemoryError at around 5800 graphs when I run it with
>>> default VM params.
>>>
>>> Thanks,
>>> Frank.
>>>
>>>
>>> Andy Seaborne<an...@epimorphics.com>  wrote on 03/02/2011
>>> 09:38:51 AM:
>>>
>>>   >  [image removed]
>>>   >
>>>   >  Re: OutOfMemoryError while loading datagraphs
>>>   >
>>>   >  Andy Seaborne
>>>   >
>>>   >  to:
>>>   >
>>>   >  jena-users
>>>   >
>>>   >  03/02/2011 09:41 AM
>>>   >
>>>   >  Please respond to jena-users
>>>   >
>>>   >  Hi Frank,
>>>   >
>>>   >  On 28/02/11 14:48, Frank Budinsky wrote:
>>>   >  >
>>>   >  >  Hi Andy,
>>>   >  >
>>>   >  >  I did some further analysis of my OutOfMemeoryError problem, and
>>> this is
>>>   >  >  what I've discovered. The problem seems to be that there is one
>>> instance of
>>>   >  >  class NodeTupleTableConcrete that contains an ever growing set of
>>> tuples
>>>   >  >  which eventually uses up all the available heap space and then
> crashes.
>>>   >  >
>>>   >  >  To be more specific, this field in class TupleTable:
>>>   >  >
>>>   >  >  private final TupleIndex[] indexes ;
>>>   >  >
>>>   >  >  seems to contain 6 continually growing TupleIndexRecord instances
>>>   >  >  (BPlusTrees). From my measurements, this seems to eat up
>>> approximately 1G
>>>   >  >  of heap for every 1M triples in the Dataset (i.e., about 1K per
>>> datagraph).
>>>   >  >  So, to load my 100K datagraphs (~10M total triples) it would seem
>>> to need
>>>   >  >  10G of heap space.
>>>   >
>>>   >  There are 6 indexes for named graphs (see the files GSPO etc). TDB
> uses
>>>   >  total indexing which puts a lot of work at load time but means any
>>>   >  lookup needed is always done with an index scan. The code can run
> with
>>>   >  less indexes - the minimum is one - but that is no exposed in the
>>>   >  configuration.
>>>   >
>>>   >  Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk). As
> the
>>>   >  index grows the data goes to disk. There is a finite LRU cache in
> front
>>>   >  of each index.
>>>   >
>>>   >  Does your dataset have a location? If has no location, it's all
>>>   >  in-memory with a RAM-disk like structure. This is for small-scale
>>>   >  testing only - it really does read and write blocks out of the RAM
> disk
>>>   >  by copy to give strict disk-like semantics.
>>>   >
>>>   >  There is also a NodeTable mapping between NodeId and Node (Jena's
>>>   >  graph-level RDF Term class). This has a cache in front of it .
>>>   >  readPropertiesFile
>>>   >  The long-ish literals maybe the problem. The node table cache is
>>>   >  fixed-number, not bounded by size.
>>>   >
>>>   >  The sizeof the caches are controlled by:
>>>   >
>>>   >  SystemTDB.Node2NodeIdCacheSize
>>>   >  SystemTDB.NodeId2NodeCacheSize
>>>   >
>>>   >  These are not easy to control but either (1) get the source code and
>>>   >  alter the default values (2) see the code in SystemTDB that uses a
>>>   >  properties file.
>>>   >
>>>   >  If you can end me a copy of the data, I can try loading it here.
>>>   >
>>>   >  >  Does this make sense? How is it supposed to work? Shouldn't the
> triples
>>>   >  >  from previously loaded named graphs be eligable for GC when I'm
>>> loading the
>>>   >  >  next named graph? Could it be that I'm holding onto something
> that's
>>>   >  >  preventing GC in the TupleTable?
>>>   >  >
>>>   >  >  Also, after looking more carefully at the resources being indexed,
> I
>>>   >  >  noticed that many of them do have relatively large literals (100s
> of
>>>   >  >  characters). I also noticed that when using Fuseki to load the
>>> resources I
>>>   >  >  get lots of warning messages like this, on the console:
>>>   >  >
>>>   >  >  Lexical form 'We are currently doing
>>>   >  >  this:<br></br><br></br>workspaceConnection.replaceComponents
>>>   >  >  (replaceComponents, replaceSource, falses, false,
>>>   >  >  monitor);<br></br><br></br>the new way of doing it would be
> something
>>>   >  >
>>>
> like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>>>   >  >  ArrayList&lt;IComponentOp&gt; replaceOps = new
>>>   >  >  ArrayList&lt;IComponentOp&gt;();<br></
>>>   >  br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>>>   >  >  for (Iterator iComponents = components.iterator();
>>> iComponents.hasNext();)
>>>   >  >
>>>
> {<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>
>>>   >  >  IComponentHandle componentHandle = (IComponentHandle)
> iComponents.next
>>>   >  >  ();<br></
>>>   >
>>>
>>
> br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>
>>>   >  >  replaceOps.add(promotionTargetConnection.componentOpFactory
>>>   >  >  ().replaceComponent
>>>   >  >  (componentHandle,<br></
>>>   >
>>>
>>
> br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>
>>>   >  >  buildWorkspaceConnection,
>>>   >  >  false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
>>>   >  <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>>>   >  >  promotionTargetConnection.applyComponentOperations(replaceOps,
>>> monitor);'
>>>   >  >  not valid for datatype
>>>   >  >  http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
>>>   >  >
>>>   >  >  Could this be part of the problem?
>>>   >
>>>   >  No - it's a different issue. This is something coming from the
> parser.
>>>   >
>>>   >  RDF XMLLiterals have special rules - they must follow
>>>   >  exclusive canonical XML, which means, amongst a lot of other thigs,
> they
>>>   >  have to be a single XML node. The rules for exclusive Canonical XML
> are
>>>   >  really quite strict (e.g. attributes in alphabetical order).
>>>   >
>>>   >  http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
>>>   >
>>>   >  If you want to store XML or HTML fragments, you can't use RDF
>>>   >  XMLLiterals very easily - you have to mangle them to conform to the
>>>   >  rules. I suggest either as strings or invent your own datatype.
>>>   >
>>>   >  You can run the parser on it's own using
>>>   >  "riotcmd.riot --validate FILE ..."
>>>   >
>>>   >
>>>   >  Andy
>>>   >
>>>   >  >
>>>   >  >  Thanks,
>>>   >  >  Frank.
>>>

Re: OutOfMemoryError while loading datagraphs

Posted by Frank Budinsky <fr...@ca.ibm.com>.
> > I tried increasing the amount of memory, but that just increased the
number
> > of calls that succeed (e.g., 10000 vs 2000) before getting the
exception.
>
> What sizes of heap are you using?

I've been experimenting with various heap sizes, noticing that the bigger I
make it the longer it runs
before crashing. When using the the default (small) heap size (-xmx64m) the
test program fails at about 5800 graphs. If I bump it all the way up to
-xmx1200m, like you did, I suspect I will also be able to run the test to
completion (100,000 graphs), but it takes very long to run (more than 2
hours on my machine). I'm guessing this is also running much faster for
you?

Extrapolating from what you're saying it looks like I would need a heap of
6G, or so, to hit my original target of 500,000 graphs (about 50M triples
total). Does that sound right? That is, needing to run with such a huge
heap?

Thanks,
Frank.

Andy Seaborne <an...@epimorphics.com> wrote on 03/08/2011 09:23:50
AM:

> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/08/2011 09:24 AM
>
> Please respond to jena-users
>
> (Frank sent me the detached file)
>
> Frank,
>
> I'm on a 64 bit machine, but I'm settign direct mode and limiting the
> Java heap size with -Xmx
>
> With a heap of 1200M, java reports 1066M max memory, and the test runs.
> With a heap of 500M, java reports 444M max memory, and the test stops at
> 11800.
>
> Things will be a little different for 32 bit but should be approximately
> the same.  TDB is doing the same things.
>
> Tweaking the block cache sizes (sorry, magic needed) down to 5000 (read,
> default 10000) and 1000 (write, default 2000), it runs at 500M, but
slower.
>
> There are quite a few files for named graphs so small changes in cache
> size get multiplied (x12, I think).
>
> > I tried increasing the amount of memory, but that just increased the
number
> > of calls that succeed (e.g., 10000 vs 2000) before getting the
exception.
>
> What sizes of heap are you using?
>
>    Andy
>
> On 07/03/11 18:34, Frank Budinsky wrote:
> > Hi Andy,
> >
> > I created a simple standalone test program that roughly simulates what
> > my application is doing and it also crashes with the same
> > OutOfMemoryError exception. I've attached it here. Would it be possible
> > for you to give it a try?
> >
> > /(See attached file: TDBOutOfMemoryTest.java)/
> >
> > Just change TDB_DIR to some new empty database location and run. It
> > get's the OutOfMemoryError at around 5800 graphs when I run it with
> > default VM params.
> >
> > Thanks,
> > Frank.
> >
> >
> > Andy Seaborne <an...@epimorphics.com> wrote on 03/02/2011
> > 09:38:51 AM:
> >
> >  > [image removed]
> >  >
> >  > Re: OutOfMemoryError while loading datagraphs
> >  >
> >  > Andy Seaborne
> >  >
> >  > to:
> >  >
> >  > jena-users
> >  >
> >  > 03/02/2011 09:41 AM
> >  >
> >  > Please respond to jena-users
> >  >
> >  > Hi Frank,
> >  >
> >  > On 28/02/11 14:48, Frank Budinsky wrote:
> >  > >
> >  > > Hi Andy,
> >  > >
> >  > > I did some further analysis of my OutOfMemeoryError problem, and
> > this is
> >  > > what I've discovered. The problem seems to be that there is one
> > instance of
> >  > > class NodeTupleTableConcrete that contains an ever growing set of
> > tuples
> >  > > which eventually uses up all the available heap space and then
crashes.
> >  > >
> >  > > To be more specific, this field in class TupleTable:
> >  > >
> >  > > private final TupleIndex[] indexes ;
> >  > >
> >  > > seems to contain 6 continually growing TupleIndexRecord instances
> >  > > (BPlusTrees). From my measurements, this seems to eat up
> > approximately 1G
> >  > > of heap for every 1M triples in the Dataset (i.e., about 1K per
> > datagraph).
> >  > > So, to load my 100K datagraphs (~10M total triples) it would seem
> > to need
> >  > > 10G of heap space.
> >  >
> >  > There are 6 indexes for named graphs (see the files GSPO etc). TDB
uses
> >  > total indexing which puts a lot of work at load time but means any
> >  > lookup needed is always done with an index scan. The code can run
with
> >  > less indexes - the minimum is one - but that is no exposed in the
> >  > configuration.
> >  >
> >  > Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk). As
the
> >  > index grows the data goes to disk. There is a finite LRU cache in
front
> >  > of each index.
> >  >
> >  > Does your dataset have a location? If has no location, it's all
> >  > in-memory with a RAM-disk like structure. This is for small-scale
> >  > testing only - it really does read and write blocks out of the RAM
disk
> >  > by copy to give strict disk-like semantics.
> >  >
> >  > There is also a NodeTable mapping between NodeId and Node (Jena's
> >  > graph-level RDF Term class). This has a cache in front of it .
> >  > readPropertiesFile
> >  > The long-ish literals maybe the problem. The node table cache is
> >  > fixed-number, not bounded by size.
> >  >
> >  > The sizeof the caches are controlled by:
> >  >
> >  > SystemTDB.Node2NodeIdCacheSize
> >  > SystemTDB.NodeId2NodeCacheSize
> >  >
> >  > These are not easy to control but either (1) get the source code and
> >  > alter the default values (2) see the code in SystemTDB that uses a
> >  > properties file.
> >  >
> >  > If you can end me a copy of the data, I can try loading it here.
> >  >
> >  > > Does this make sense? How is it supposed to work? Shouldn't the
triples
> >  > > from previously loaded named graphs be eligable for GC when I'm
> > loading the
> >  > > next named graph? Could it be that I'm holding onto something
that's
> >  > > preventing GC in the TupleTable?
> >  > >
> >  > > Also, after looking more carefully at the resources being indexed,
I
> >  > > noticed that many of them do have relatively large literals (100s
of
> >  > > characters). I also noticed that when using Fuseki to load the
> > resources I
> >  > > get lots of warning messages like this, on the console:
> >  > >
> >  > > Lexical form 'We are currently doing
> >  > > this:<br></br><br></br>workspaceConnection.replaceComponents
> >  > > (replaceComponents, replaceSource, falses, false,
> >  > > monitor);<br></br><br></br>the new way of doing it would be
something
> >  > >
> >
like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >  > > ArrayList&lt;IComponentOp&gt; replaceOps = new
> >  > > ArrayList&lt;IComponentOp&gt;();<br></
> >  > br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >  > > for (Iterator iComponents = components.iterator();
> > iComponents.hasNext();)
> >  > >
> >
{<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >  > > IComponentHandle componentHandle = (IComponentHandle)
iComponents.next
> >  > > ();<br></
> >  >
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >  > > replaceOps.add(promotionTargetConnection.componentOpFactory
> >  > > ().replaceComponent
> >  > > (componentHandle,<br></
> >  >
> >
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >  > > buildWorkspaceConnection,
> >  > > false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
> >  > <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >  > > promotionTargetConnection.applyComponentOperations(replaceOps,
> > monitor);'
> >  > > not valid for datatype
> >  > > http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
> >  > >
> >  > > Could this be part of the problem?
> >  >
> >  > No - it's a different issue. This is something coming from the
parser.
> >  >
> >  > RDF XMLLiterals have special rules - they must follow
> >  > exclusive canonical XML, which means, amongst a lot of other thigs,
they
> >  > have to be a single XML node. The rules for exclusive Canonical XML
are
> >  > really quite strict (e.g. attributes in alphabetical order).
> >  >
> >  > http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
> >  >
> >  > If you want to store XML or HTML fragments, you can't use RDF
> >  > XMLLiterals very easily - you have to mangle them to conform to the
> >  > rules. I suggest either as strings or invent your own datatype.
> >  >
> >  > You can run the parser on it's own using
> >  > "riotcmd.riot --validate FILE ..."
> >  >
> >  >
> >  > Andy
> >  >
> >  > >
> >  > > Thanks,
> >  > > Frank.
> >

Re: OutOfMemoryError while loading datagraphs

Posted by Andy Seaborne <an...@epimorphics.com>.
(Frank sent me the detached file)

Frank,

I'm on a 64 bit machine, but I'm settign direct mode and limiting the 
Java heap size with -Xmx

With a heap of 1200M, java reports 1066M max memory, and the test runs.
With a heap of 500M, java reports 444M max memory, and the test stops at 
11800.

Things will be a little different for 32 bit but should be approximately 
the same.  TDB is doing the same things.

Tweaking the block cache sizes (sorry, magic needed) down to 5000 (read, 
default 10000) and 1000 (write, default 2000), it runs at 500M, but slower.

There are quite a few files for named graphs so small changes in cache 
size get multiplied (x12, I think).

> I tried increasing the amount of memory, but that just increased the number
> of calls that succeed (e.g., 10000 vs 2000) before getting the exception.

What sizes of heap are you using?

	Andy

On 07/03/11 18:34, Frank Budinsky wrote:
> Hi Andy,
>
> I created a simple standalone test program that roughly simulates what
> my application is doing and it also crashes with the same
> OutOfMemoryError exception. I've attached it here. Would it be possible
> for you to give it a try?
>
> /(See attached file: TDBOutOfMemoryTest.java)/
>
> Just change TDB_DIR to some new empty database location and run. It
> get's the OutOfMemoryError at around 5800 graphs when I run it with
> default VM params.
>
> Thanks,
> Frank.
>
>
> Andy Seaborne <an...@epimorphics.com> wrote on 03/02/2011
> 09:38:51 AM:
>
>  > [image removed]
>  >
>  > Re: OutOfMemoryError while loading datagraphs
>  >
>  > Andy Seaborne
>  >
>  > to:
>  >
>  > jena-users
>  >
>  > 03/02/2011 09:41 AM
>  >
>  > Please respond to jena-users
>  >
>  > Hi Frank,
>  >
>  > On 28/02/11 14:48, Frank Budinsky wrote:
>  > >
>  > > Hi Andy,
>  > >
>  > > I did some further analysis of my OutOfMemeoryError problem, and
> this is
>  > > what I've discovered. The problem seems to be that there is one
> instance of
>  > > class NodeTupleTableConcrete that contains an ever growing set of
> tuples
>  > > which eventually uses up all the available heap space and then crashes.
>  > >
>  > > To be more specific, this field in class TupleTable:
>  > >
>  > > private final TupleIndex[] indexes ;
>  > >
>  > > seems to contain 6 continually growing TupleIndexRecord instances
>  > > (BPlusTrees). From my measurements, this seems to eat up
> approximately 1G
>  > > of heap for every 1M triples in the Dataset (i.e., about 1K per
> datagraph).
>  > > So, to load my 100K datagraphs (~10M total triples) it would seem
> to need
>  > > 10G of heap space.
>  >
>  > There are 6 indexes for named graphs (see the files GSPO etc). TDB uses
>  > total indexing which puts a lot of work at load time but means any
>  > lookup needed is always done with an index scan. The code can run with
>  > less indexes - the minimum is one - but that is no exposed in the
>  > configuration.
>  >
>  > Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk). As the
>  > index grows the data goes to disk. There is a finite LRU cache in front
>  > of each index.
>  >
>  > Does your dataset have a location? If has no location, it's all
>  > in-memory with a RAM-disk like structure. This is for small-scale
>  > testing only - it really does read and write blocks out of the RAM disk
>  > by copy to give strict disk-like semantics.
>  >
>  > There is also a NodeTable mapping between NodeId and Node (Jena's
>  > graph-level RDF Term class). This has a cache in front of it .
>  > readPropertiesFile
>  > The long-ish literals maybe the problem. The node table cache is
>  > fixed-number, not bounded by size.
>  >
>  > The sizeof the caches are controlled by:
>  >
>  > SystemTDB.Node2NodeIdCacheSize
>  > SystemTDB.NodeId2NodeCacheSize
>  >
>  > These are not easy to control but either (1) get the source code and
>  > alter the default values (2) see the code in SystemTDB that uses a
>  > properties file.
>  >
>  > If you can end me a copy of the data, I can try loading it here.
>  >
>  > > Does this make sense? How is it supposed to work? Shouldn't the triples
>  > > from previously loaded named graphs be eligable for GC when I'm
> loading the
>  > > next named graph? Could it be that I'm holding onto something that's
>  > > preventing GC in the TupleTable?
>  > >
>  > > Also, after looking more carefully at the resources being indexed, I
>  > > noticed that many of them do have relatively large literals (100s of
>  > > characters). I also noticed that when using Fuseki to load the
> resources I
>  > > get lots of warning messages like this, on the console:
>  > >
>  > > Lexical form 'We are currently doing
>  > > this:<br></br><br></br>workspaceConnection.replaceComponents
>  > > (replaceComponents, replaceSource, falses, false,
>  > > monitor);<br></br><br></br>the new way of doing it would be something
>  > >
> like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  > > ArrayList&lt;IComponentOp&gt; replaceOps = new
>  > > ArrayList&lt;IComponentOp&gt;();<br></
>  > br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  > > for (Iterator iComponents = components.iterator();
> iComponents.hasNext();)
>  > >
> {<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  > > IComponentHandle componentHandle = (IComponentHandle) iComponents.next
>  > > ();<br></
>  >
> br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  > > replaceOps.add(promotionTargetConnection.componentOpFactory
>  > > ().replaceComponent
>  > > (componentHandle,<br></
>  >
> br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  > > buildWorkspaceConnection,
>  > > false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
>  > <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
>  > > promotionTargetConnection.applyComponentOperations(replaceOps,
> monitor);'
>  > > not valid for datatype
>  > > http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
>  > >
>  > > Could this be part of the problem?
>  >
>  > No - it's a different issue. This is something coming from the parser.
>  >
>  > RDF XMLLiterals have special rules - they must follow
>  > exclusive canonical XML, which means, amongst a lot of other thigs, they
>  > have to be a single XML node. The rules for exclusive Canonical XML are
>  > really quite strict (e.g. attributes in alphabetical order).
>  >
>  > http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
>  >
>  > If you want to store XML or HTML fragments, you can't use RDF
>  > XMLLiterals very easily - you have to mangle them to conform to the
>  > rules. I suggest either as strings or invent your own datatype.
>  >
>  > You can run the parser on it's own using
>  > "riotcmd.riot --validate FILE ..."
>  >
>  >
>  > Andy
>  >
>  > >
>  > > Thanks,
>  > > Frank.
>

Re: OutOfMemoryError while loading datagraphs

Posted by Frank Budinsky <fr...@ca.ibm.com>.
Hi Andy,

I created a simple standalone test program that roughly simulates what my
application is doing and it also crashes with the same OutOfMemoryError
exception. I've attached it here. Would it be possible for you to give it a
try?

(See attached file: TDBOutOfMemoryTest.java)

Just change TDB_DIR to some new empty database location and run. It get's
the OutOfMemoryError at around 5800 graphs when I run it with default VM
params.

Thanks,
Frank.


Andy Seaborne <an...@epimorphics.com> wrote on 03/02/2011 09:38:51
AM:

> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 03/02/2011 09:41 AM
>
> Please respond to jena-users
>
> Hi Frank,
>
> On 28/02/11 14:48, Frank Budinsky wrote:
> >
> > Hi Andy,
> >
> > I did some further analysis of my OutOfMemeoryError problem, and this
is
> > what I've discovered. The problem seems to be that there is one
instance of
> > class NodeTupleTableConcrete that contains an ever growing set of
tuples
> > which eventually uses up all the available heap space and then crashes.
> >
> > To be more specific, this field in class TupleTable:
> >
> >      private final TupleIndex[] indexes ;
> >
> > seems to contain 6 continually growing TupleIndexRecord instances
> > (BPlusTrees). From my measurements, this seems to eat up approximately
1G
> > of heap for every 1M triples in the Dataset (i.e., about 1K per
datagraph).
> > So, to load my 100K datagraphs (~10M total triples) it would seem to
need
> > 10G of heap space.
>
> There are 6 indexes for named graphs (see the files GSPO etc).  TDB uses
> total indexing which puts a lot of work at load time but means any
> lookup needed is always done with an index scan.  The code can run with
> less indexes - the minimum is one - but that is no exposed in the
> configuration.
>
> Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk).  As the
> index grows the data goes to disk.  There is a finite LRU cache in front
> of each index.
>
> Does your dataset have a location?  If has no location, it's all
> in-memory with a RAM-disk like structure.  This is for small-scale
> testing only - it really does read and write blocks out of the RAM disk
> by copy to give strict disk-like semantics.
>
> There is also a NodeTable mapping between NodeId and Node (Jena's
> graph-level RDF Term class).  This has a cache in front of it .
> readPropertiesFile
> The long-ish literals maybe the problem.  The node table cache is
> fixed-number, not bounded by size.
>
> The sizeof the caches are controlled by:
>
> SystemTDB.Node2NodeIdCacheSize
> SystemTDB.NodeId2NodeCacheSize
>
> These are not easy to control but either (1) get the source code and
> alter the default values (2) see the code in SystemTDB that uses a
> properties file.
>
> If you can end me a copy of the data, I can try loading it here.
>
> > Does this make sense? How is it supposed to work? Shouldn't the triples
> > from previously loaded named graphs be eligable for GC when I'm loading
the
> > next named graph? Could it be that I'm holding onto something that's
> > preventing GC in the TupleTable?
> >
> > Also, after looking more carefully at the resources being indexed, I
> > noticed that many of them do have relatively large literals (100s of
> > characters). I also noticed that when using Fuseki to load the
resources I
> > get lots of warning messages like this, on the console:
> >
> >     Lexical form 'We are currently doing
> > this:<br></br><br></br>workspaceConnection.replaceComponents
> > (replaceComponents, replaceSource, falses, false,
> > monitor);<br></br><br></br>the new way of doing it would be something
> >
like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > ArrayList&lt;IComponentOp&gt; replaceOps = new
> > ArrayList&lt;IComponentOp&gt;();<br></
> br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >   for (Iterator iComponents = components.iterator();
iComponents.hasNext();)
> >
{<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >   IComponentHandle componentHandle = (IComponentHandle)
iComponents.next
> > ();<br></
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >   replaceOps.add(promotionTargetConnection.componentOpFactory
> > ().replaceComponent
> > (componentHandle,<br></
>
br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

> >   buildWorkspaceConnection,
> > false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }
> <br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> >   promotionTargetConnection.applyComponentOperations(replaceOps,
monitor);'
> > not valid for datatype
> > http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
> >
> > Could this be part of the problem?
>
> No - it's a different issue.  This is something coming from the parser.
>
> RDF XMLLiterals have special rules - they must follow
> exclusive canonical XML, which means, amongst a lot of other thigs, they
> have to be a single XML node.  The rules for exclusive Canonical XML are
> really quite strict (e.g. attributes in alphabetical order).
>
> http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
>
> If you want to store XML or HTML fragments, you can't use RDF
> XMLLiterals very easily - you have to mangle them to conform to the
> rules.  I suggest either as strings or invent your own datatype.
>
> You can run the parser on it's own using
> "riotcmd.riot --validate FILE ..."
>
>
>    Andy
>
> >
> > Thanks,
> > Frank.