You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by David Riccitelli <da...@insideout.io> on 2012/03/16 10:30:57 UTC

clerezza.rdf.jena.tdb.storage filling up with ontonet files

Dears,

As I ran into disk issues, I found that this folder:
 sling/felix/bundleXXX/data/tdb-data/mgraph

where XX is the bundle of:
 Clerezza - SCB Jena TDB Storage Provider
org.apache.clerezza.rdf.jena.tdb.storage

took almost 70 gbytes of disk space (then the disk space has been
exhausted).

These are some of the files I found inside:
193M ./ontonet%3A%3Ainputstream%3Aontology889
193M ./ontonet%3A%3Ainputstream%3Aontology1041
193M ./ontonet%3A%3Ainputstream%3Aontology395
193M ./ontonet%3A%3Ainputstream%3Aontology363
193M ./ontonet%3A%3Ainputstream%3Aontology661
193M ./ontonet%3A%3Ainputstream%3Aontology786
193M ./ontonet%3A%3Ainputstream%3Aontology608
193M ./ontonet%3A%3Ainputstream%3Aontology213
193M ./ontonet%3A%3Ainputstream%3Aontology188
193M ./ontonet%3A%3Ainputstream%3Aontology602


Any clues?

Thanks,
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Thu, Apr 5, 2012 at 1:16 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

>
>
> Von meinem iPad gesendet
>
> Am 05.04.2012 um 12:59 schrieb Reto Bachmann-Gmür <re...@apache.org>:
>
> > Hi Rupert,
> >
> > I like your proposal but would suggest:
> > - SingleDatasetTdbTcProvider should not need a directory configured
>
> No problem with that. One can use the configuration policy OPTIONAL, than
> OSGI will create a default instance with the default directory while it
> would still be possible for users to creat additional instances with
> manually configured directories.
>
> WDYT
>
Not sure why one would need to create multiple SingleDatasetTdbTcProviderS.
With the current limitation of TcManager I don't think there's much use in
having multiple instances, but of course I wouldn't mind having this
possibility.

Cheers,
Reto

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Reto Bachmann-Gmür <re...@apache.org>.

On Thu, Apr 5, 2012 at 1:16 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

>
>
> Von meinem iPad gesendet
>
> Am 05.04.2012 um 12:59 schrieb Reto Bachmann-Gmür <re...@apache.org>:
>
> > Hi Rupert,
> >
> > I like your proposal but would suggest:
> > - SingleDatasetTdbTcProvider should not need a directory configured
>
> No problem with that. One can use the configuration policy OPTIONAL, than
> OSGI will create a default instance with the default directory while it
> would still be possible for users to creat additional instances with
> manually configured directories.
>
> WDYT
>
Not sure why one would need to create multiple SingleDatasetTdbTcProviderS.
With the current limitation of TcManager I don't think there's much use in
having multiple instances, but of course I wouldn't mind having this
possibility.

Cheers,
Reto

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.


Von meinem iPad gesendet

Am 05.04.2012 um 12:59 schrieb Reto Bachmann-Gmür <re...@apache.org>:

> Hi Rupert,
> 
> I like your proposal but would suggest:
> - SingleDatasetTdbTcProvider should not need a directory configured

No problem with that. One can use the configuration policy OPTIONAL, than OSGI will create a default instance with the default directory while it would still be possible for users to creat additional instances with manually configured directories.

WDYT

> - SingleDatasetTdbTcProvider should have the higher weight and thus be the
> one used by default

That's fine with me.

> I think there might be usecases were you want an graph to be isolated from
> the rests, but I think the default behaviour should be the more perfomant
> and less memory expensive SingleDatasetTdbTcProvider.
> 
> We could add a tool to clerezza that allows creating an mgraph in a
> tc-provider other than the one with the highest weight.
> 

+1

best
Rupert

> Cheers,
> Reto
> 
> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
> 
>> Hi David, stanbol & clerezza community
>> 
>> Short summary of the situation:
>> 
>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>> provider. This causes the disc consumption and number of open files to
>> explode. See the quoted emails for details
>> 
>> 
>> @Stanbol  we are already discussion how to avoid the creation of such many
>> graphs
>> 
>> 
>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>> (at least for typical use cases in Apache Stanbol).
>> 
>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>> solution for that as it suggests to use named graphs instead of isolated
>> TDB instances for creating MGraphs.
>> 
>> To be honest this would be the optimal solution for our usages of Clerezza
>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>> different TDB datasets.
>> 
>> Because of that I  would like to make the following proposal that
>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>> 
>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>> store Clerezza MGraphs in Jena TDB
>> 
>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>> follows the currently used methodology to map Clerezza graphs to separate
>> TDB datasets
>> 
>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>> MGraphs in a single TDB dataset. This provider should also support
>> "configurationFactory=true" (multiple instances). each instance would use a
>> different TDB dataset to store its MGrpahs.
>> 
>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>> requires a configuration of the directory for the  TDB dataset as well as a
>> name (that can be used in Filters). This ensures full backward
>> compatibility.
>> 
>> In environment - such as Stanbol - where you want to store multiple graphs
>> in the same TDB dataset you would need to provide a configuration for the
>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>> 
>> * if you just need a single TDB dataset that stores all MGraphs, than you
>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>> and normally use the TcManager to create your graphs.
>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>> SingleDatasetTdbTcProvider's you will need to use according filters.
>> 
>> 
>> WDYT
>> Rupert
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>> 
>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>> 
>>> Hi David, all
>>> 
>>> this could be the explanation for the failed build on the Jenkins server
>> when the SEO configuration for the Refactor engine was used in the default
>> configuration of the Full launcher
>>> 
>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>> 
>>> For me that looks like as if the RefactorEngine does create multiple
>> Jena TDB instances for various created MGraphs. One needs to know the even
>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>> important to map multiple MGraphs to different named graphs of the same
>> Jena TDB store.
>>> 
>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>> but I hope this can help in tracing this down.
>>> 
>>> best
>>> Rupert
>>> 
>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>> 
>>>> Dears,
>>>> 
>>>> As I ran into disk issues, I found that this folder:
>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>> 
>>>> where XX is the bundle of:
>>>> Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>> 
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>> 
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>> 
>>>> 
>>>> Any clues?
>>>> 
>>>> Thanks,
>>>> David Riccitelli
>>>> 
>>>> 
>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> 
>>>> 
>> ********************************************************************************
>>> 
>> 
>>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.


Von meinem iPad gesendet

Am 05.04.2012 um 12:59 schrieb Reto Bachmann-Gmür <re...@apache.org>:

> Hi Rupert,
> 
> I like your proposal but would suggest:
> - SingleDatasetTdbTcProvider should not need a directory configured

No problem with that. One can use the configuration policy OPTIONAL, than OSGI will create a default instance with the default directory while it would still be possible for users to creat additional instances with manually configured directories.

WDYT

> - SingleDatasetTdbTcProvider should have the higher weight and thus be the
> one used by default

That's fine with me.

> I think there might be usecases were you want an graph to be isolated from
> the rests, but I think the default behaviour should be the more perfomant
> and less memory expensive SingleDatasetTdbTcProvider.
> 
> We could add a tool to clerezza that allows creating an mgraph in a
> tc-provider other than the one with the highest weight.
> 

+1

best
Rupert

> Cheers,
> Reto
> 
> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
> 
>> Hi David, stanbol & clerezza community
>> 
>> Short summary of the situation:
>> 
>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>> provider. This causes the disc consumption and number of open files to
>> explode. See the quoted emails for details
>> 
>> 
>> @Stanbol  we are already discussion how to avoid the creation of such many
>> graphs
>> 
>> 
>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>> (at least for typical use cases in Apache Stanbol).
>> 
>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>> solution for that as it suggests to use named graphs instead of isolated
>> TDB instances for creating MGraphs.
>> 
>> To be honest this would be the optimal solution for our usages of Clerezza
>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>> different TDB datasets.
>> 
>> Because of that I  would like to make the following proposal that
>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>> 
>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>> store Clerezza MGraphs in Jena TDB
>> 
>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>> follows the currently used methodology to map Clerezza graphs to separate
>> TDB datasets
>> 
>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>> MGraphs in a single TDB dataset. This provider should also support
>> "configurationFactory=true" (multiple instances). each instance would use a
>> different TDB dataset to store its MGrpahs.
>> 
>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>> requires a configuration of the directory for the  TDB dataset as well as a
>> name (that can be used in Filters). This ensures full backward
>> compatibility.
>> 
>> In environment - such as Stanbol - where you want to store multiple graphs
>> in the same TDB dataset you would need to provide a configuration for the
>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>> 
>> * if you just need a single TDB dataset that stores all MGraphs, than you
>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>> and normally use the TcManager to create your graphs.
>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>> SingleDatasetTdbTcProvider's you will need to use according filters.
>> 
>> 
>> WDYT
>> Rupert
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>> 
>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>> 
>>> Hi David, all
>>> 
>>> this could be the explanation for the failed build on the Jenkins server
>> when the SEO configuration for the Refactor engine was used in the default
>> configuration of the Full launcher
>>> 
>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>> 
>>> For me that looks like as if the RefactorEngine does create multiple
>> Jena TDB instances for various created MGraphs. One needs to know the even
>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>> important to map multiple MGraphs to different named graphs of the same
>> Jena TDB store.
>>> 
>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>> but I hope this can help in tracing this down.
>>> 
>>> best
>>> Rupert
>>> 
>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>> 
>>>> Dears,
>>>> 
>>>> As I ran into disk issues, I found that this folder:
>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>> 
>>>> where XX is the bundle of:
>>>> Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>> 
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>> 
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>> 
>>>> 
>>>> Any clues?
>>>> 
>>>> Thanks,
>>>> David Riccitelli
>>>> 
>>>> 
>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> 
>>>> 
>> ********************************************************************************
>>> 
>> 
>>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Reto Bachmann-Gmür <re...@apache.org>.

Hi Rupert,

I like your proposal but would suggest:
- SingleDatasetTdbTcProvider should not need a directory configured
- SingleDatasetTdbTcProvider should have the higher weight and thus be the
one used by default

I think there might be usecases were you want an graph to be isolated from
the rests, but I think the default behaviour should be the more perfomant
and less memory expensive SingleDatasetTdbTcProvider.

We could add a tool to clerezza that allows creating an mgraph in a
tc-provider other than the one with the highest weight.

Cheers,
Reto

On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi David, stanbol & clerezza community
>
> Short summary of the situation:
>
> The Ontonet component generate a lot of MGraphs using the Jena TDB
> provider. This causes the disc consumption and number of open files to
> explode. See the quoted emails for details
>
>
> @Stanbol  we are already discussion how to avoid the creation of such many
> graphs
>
>
> @Clerezza the observed behavior of the TDB provider is also very dangerous
> (at least for typical use cases in Apache Stanbol).
>
> Even targeting at a different CLEREZZA-467 maybe provides a possible
> solution for that as it suggests to use named graphs instead of isolated
> TDB instances for creating MGraphs.
>
> To be honest this would be the optimal solution for our usages of Clerezza
> in Stanbol. However I assume that for a semantic CMS it is saver to use
> different TDB datasets.
>
> Because of that I  would like to make the following proposal that
> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>
> 1. AbstractTdbTcProvider: providing most of the functionality needed to
> store Clerezza MGraphs in Jena TDB
>
> 2. TdbTcProvider: The same as now but now extending the abstract one. I
> follows the currently used methodology to map Clerezza graphs to separate
> TDB datasets
>
> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
> MGraphs in a single TDB dataset. This provider should also support
> "configurationFactory=true" (multiple instances). each instance would use a
> different TDB dataset to store its MGrpahs.
>
> By default the SingleDatasetTdbTcProvider would be inactive, because it
> requires a configuration of the directory for the  TDB dataset as well as a
> name (that can be used in Filters). This ensures full backward
> compatibility.
>
> In environment - such as Stanbol - where you want to store multiple graphs
> in the same TDB dataset you would need to provide a configuration for the
> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>
> * if you just need a single TDB dataset that stores all MGraphs, than you
> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
> and normally use the TcManager to create your graphs.
> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
> SingleDatasetTdbTcProvider's you will need to use according filters.
>
>
> WDYT
> Rupert
>
>
> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>
> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>
> > Hi David, all
> >
> > this could be the explanation for the failed build on the Jenkins server
> when the SEO configuration for the Refactor engine was used in the default
> configuration of the Full launcher
> >
> > see http://markmail.org/message/sprwklaobdjankig for details.
> >
> > For me that looks like as if the RefactorEngine does create multiple
> Jena TDB instances for various created MGraphs. One needs to know the even
> for an empty graph Jena TDB creates ~200MByte of index files. So it is
> important to map multiple MGraphs to different named graphs of the same
> Jena TDB store.
> >
> > I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
> but I hope this can help in tracing this down.
> >
> > best
> > Rupert
> >
> > On 16.03.2012, at 10:30, David Riccitelli wrote:
> >
> >> Dears,
> >>
> >> As I ran into disk issues, I found that this folder:
> >> sling/felix/bundleXXX/data/tdb-data/mgraph
> >>
> >> where XX is the bundle of:
> >> Clerezza - SCB Jena TDB Storage Provider
> >> org.apache.clerezza.rdf.jena.tdb.storage
> >>
> >> took almost 70 gbytes of disk space (then the disk space has been
> >> exhausted).
> >>
> >> These are some of the files I found inside:
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology602
> >>
> >>
> >> Any clues?
> >>
> >> Thanks,
> >> David Riccitelli
> >>
> >>
> ********************************************************************************
> >> InsideOut10 s.r.l.
> >> P.IVA: IT-11381771002
> >> Fax: +39 0110708239
> >> ---
> >> LinkedIn: http://it.linkedin.com/in/riccitelli
> >> Twitter: ziodave
> >> ---
> >> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
> >>
> ********************************************************************************
> >
>
>

Re: In-memory copies of graphs returned for Jena base TcProviders (was: PrivilegedMGraphWrapper#getGraph() create in-memory copy)

Posted by Rupert Westenthaler <ru...@gmail.com>.

On 21.03.2012, at 16:27, Daniel Spicar wrote:

> Hi Rupert,
> 
> Your findings sound quite serious to me. From a quick check I can confirm
> your findings. It seems TDB backed read-only Graphs are in fact in-memory
> SimpleTripleCollections.  I didn't implement this functionality originally
> so I am not an authority on it though ;)

> About the problem of the MGraph's getGraph method: Intuitively I would
> approach the problem by creating a Wrapper (Decortator) for MGraphs that
> returns a "Immutable" Graph. This would return a Graph to the user that
> forwards read-access to the MGraph and prevents write access. However the
> backing graph will be an MGraph.

I am also in favor of the decorator pattern, but it is not quite the same as creating an immutable copy, because with a decorator components with a reference to the decorated MGraph could still modify it. This would - theoretically - introduce the need to use read locks on Graphs (e.g. to protect iterators over the Graph for changes in the backing MGraph).

Creating a real immutable copy of a graph is already possible by calling TcProvider.createGraph(TripleCollection tc). I do not see the necessarily to duplicate this in the MGraph API.

> 
> In general: I don't know much about TDB's inner workings, does it offer
> read-only graphs? And if so what are the benefits of using them? (I assume
> more efficient synchronization). If there is such a thing, implementing
> native access to TDBs read-only graphs is definitely something great.
> 

AFAIK TDB does not provide feature, but the TcProvider implementations for TDB take care of that as the do not allow to create an MGraph over an graph that was created as Graph (that's true for both the "TdbTcProvider" and the "SingleTdbDatasetTcProvider". The only possibility to change a TDB model that was created for a Graph would therefore to directly access the TDB dataset (outside of Clerezza).

BTW a simple workaround to avoid the  creation of in-memory copy for TDB graphs is to instantiate the JenaGraphAdapter by using

                MGraph jenaAdapter = new JenaGraphAdaptor(model.getGraph()){
                    /**
                     * Ensure that no in-memory copies are created for read only
                     * Jena Graphs
                     * @return
                     */
                    @Override
                    public Graph getGraph() {
                        return new SimpleGraph(this,true);
                    }
                };
                Graph graph = jenaAdapter.getGraph();

when get/createGraph is called on the TcProvider as the constructor "SimpleGraph(TripleCollection tc,boolean tripleCollectionWillNeverChange)" does not create a copy of the parsed TripleCollection.
For now I use this with the SingleTdbDatasetTcProvider.

best
Rupert


> Daniel
> 
> On 20 March 2012 09:49, Rupert Westenthaler
> <ru...@gmail.com>wrote:
> 
>> Hi again
>> 
>> Just noticed that the
>> "org.apache.clerezza.rdf.jena.storage.JenaGraphAdaptor" does the exact same
>> by extending "org.apache.clerezza.rdf.core.impl.AbstractMGraph".
>> 
>> This means that all Graphs returned by the Jena TDB provider
>> (org.apache.clerezza.rdf.jena.tdb.storage.TdbTcProvider) are in fact
>> in-memory copies. This would not be necessary as the TdbTcProvider already
>> ensures that a Graph can not be opened as MGraph.
>> 
>> To avoid such copies one would need to refactor the JenaGraphAdaptor so
>> that one can create both a "JenaMGraphAdaptor" and a read-only
>> "JenaGraphAdaptor". JenaMGraphAdaptor.getGraph() would still need to create
>> an in-memory copy, but the "JenaGraphAdaptor" would allow to avoid this.
>> TcProvider implementations that instantiate ""JenaGraphAdaptor" would need
>> to ensure themselves that the underlining JenaGraph is not modified.
>> 
>> This is of special importance to the SingleTdbDatasetTcProvider as I am
>> planing to add support for exposing the "urn:x-arq:UnionGraph" via the
>> TcProvider.getGraph(..) method. Creating in-memory copies of the union
>> graph over all named models within the TDB store is not feasible.
>> 
>> best
>> Rupert
>> 
>> 
>> On 20.03.2012, at 08:42, Rupert Westenthaler wrote:
>> 
>>> Hi all,
>>> 
>>> While working on the SingleTdbDatasetTcProvider I noticed that the
>>> 
>>>   PrivilegedMGraphWrapper#getGraph()
>>> 
>>> calls
>>> 
>>>      public Graph getGraph() {
>>>              return new SimpleGraph(this);
>>>      }
>>> 
>>> If I am right this causes an in-memory copy of the the wrapped MGraph to
>> be created. Is there a special reason for that or should that?
>>> 
>>> I would rather expect an PrivilegedGraphWrapper  wrapping the graph
>> returned by the wrapped MGraph to be returned. Something like.
>>> 
>>>      public Graph getGraph() {
>>>              return AccessController.doPrivileged(new
>> PrivilegedAction<Graph>() {
>>> 
>>>                      @Override
>>>                      public Graph run() {
>>>                              return new
>> PrivilegedGraphWrapper(wrapped.getGraph());
>>>                      }
>>>              });
>>>      }
>>> 
>>> Maybe one would even like to have only a single PrivilegedGraphWrapper
>> that is created on the first call to getGraph()
>>> 
>>> best
>>> Rupert
>>> 
>> 
>>

Re: In-memory copies of graphs returned for Jena base TcProviders (was: PrivilegedMGraphWrapper#getGraph() create in-memory copy)

Posted by Daniel Spicar <ds...@apache.org>.

Hi Rupert,

Your findings sound quite serious to me. From a quick check I can confirm
your findings. It seems TDB backed read-only Graphs are in fact in-memory
SimpleTripleCollections.  I didn't implement this functionality originally
so I am not an authority on it though ;)

About the problem of the MGraph's getGraph method: Intuitively I would
approach the problem by creating a Wrapper (Decortator) for MGraphs that
returns a "Immutable" Graph. This would return a Graph to the user that
forwards read-access to the MGraph and prevents write access. However the
backing graph will be an MGraph.

In general: I don't know much about TDB's inner workings, does it offer
read-only graphs? And if so what are the benefits of using them? (I assume
more efficient synchronization). If there is such a thing, implementing
native access to TDBs read-only graphs is definitely something great.

Daniel

On 20 March 2012 09:49, Rupert Westenthaler
<ru...@gmail.com>wrote:

> Hi again
>
> Just noticed that the
> "org.apache.clerezza.rdf.jena.storage.JenaGraphAdaptor" does the exact same
> by extending "org.apache.clerezza.rdf.core.impl.AbstractMGraph".
>
> This means that all Graphs returned by the Jena TDB provider
> (org.apache.clerezza.rdf.jena.tdb.storage.TdbTcProvider) are in fact
> in-memory copies. This would not be necessary as the TdbTcProvider already
> ensures that a Graph can not be opened as MGraph.
>
> To avoid such copies one would need to refactor the JenaGraphAdaptor so
> that one can create both a "JenaMGraphAdaptor" and a read-only
> "JenaGraphAdaptor". JenaMGraphAdaptor.getGraph() would still need to create
> an in-memory copy, but the "JenaGraphAdaptor" would allow to avoid this.
> TcProvider implementations that instantiate ""JenaGraphAdaptor" would need
> to ensure themselves that the underlining JenaGraph is not modified.
>
> This is of special importance to the SingleTdbDatasetTcProvider as I am
> planing to add support for exposing the "urn:x-arq:UnionGraph" via the
> TcProvider.getGraph(..) method. Creating in-memory copies of the union
> graph over all named models within the TDB store is not feasible.
>
> best
> Rupert
>
>
> On 20.03.2012, at 08:42, Rupert Westenthaler wrote:
>
> > Hi all,
> >
> > While working on the SingleTdbDatasetTcProvider I noticed that the
> >
> >    PrivilegedMGraphWrapper#getGraph()
> >
> > calls
> >
> >       public Graph getGraph() {
> >               return new SimpleGraph(this);
> >       }
> >
> > If I am right this causes an in-memory copy of the the wrapped MGraph to
> be created. Is there a special reason for that or should that?
> >
> > I would rather expect an PrivilegedGraphWrapper  wrapping the graph
> returned by the wrapped MGraph to be returned. Something like.
> >
> >       public Graph getGraph() {
> >               return AccessController.doPrivileged(new
> PrivilegedAction<Graph>() {
> >
> >                       @Override
> >                       public Graph run() {
> >                               return new
> PrivilegedGraphWrapper(wrapped.getGraph());
> >                       }
> >               });
> >       }
> >
> > Maybe one would even like to have only a single PrivilegedGraphWrapper
> that is created on the first call to getGraph()
> >
> > best
> > Rupert
> >
>
>

In-memory copies of graphs returned for Jena base TcProviders (was: PrivilegedMGraphWrapper#getGraph() create in-memory copy)

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi again

Just noticed that the "org.apache.clerezza.rdf.jena.storage.JenaGraphAdaptor" does the exact same by extending "org.apache.clerezza.rdf.core.impl.AbstractMGraph".

This means that all Graphs returned by the Jena TDB provider (org.apache.clerezza.rdf.jena.tdb.storage.TdbTcProvider) are in fact in-memory copies. This would not be necessary as the TdbTcProvider already ensures that a Graph can not be opened as MGraph.

To avoid such copies one would need to refactor the JenaGraphAdaptor so that one can create both a "JenaMGraphAdaptor" and a read-only "JenaGraphAdaptor". JenaMGraphAdaptor.getGraph() would still need to create an in-memory copy, but the "JenaGraphAdaptor" would allow to avoid this. TcProvider implementations that instantiate ""JenaGraphAdaptor" would need to ensure themselves that the underlining JenaGraph is not modified.

This is of special importance to the SingleTdbDatasetTcProvider as I am planing to add support for exposing the "urn:x-arq:UnionGraph" via the TcProvider.getGraph(..) method. Creating in-memory copies of the union graph over all named models within the TDB store is not feasible.

best
Rupert

On 20.03.2012, at 08:42, Rupert Westenthaler wrote:

> Hi all,
> 
> While working on the SingleTdbDatasetTcProvider I noticed that the 
> 
>    PrivilegedMGraphWrapper#getGraph()
> 
> calls
> 
> 	public Graph getGraph() {
> 		return new SimpleGraph(this);
> 	}
> 
> If I am right this causes an in-memory copy of the the wrapped MGraph to be created. Is there a special reason for that or should that? 
> 
> I would rather expect an PrivilegedGraphWrapper  wrapping the graph returned by the wrapped MGraph to be returned. Something like.
> 
> 	public Graph getGraph() {
> 		return AccessController.doPrivileged(new PrivilegedAction<Graph>() {
> 
> 			@Override
> 			public Graph run() {
> 				return new PrivilegedGraphWrapper(wrapped.getGraph());
> 			}
> 		});
> 	}
> 
> Maybe one would even like to have only a single PrivilegedGraphWrapper that is created on the first call to getGraph()
> 
> best
> Rupert
>

PrivilegedMGraphWrapper#getGraph() create in-memory copy (was Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files)

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi all,

While working on the SingleTdbDatasetTcProvider I noticed that the 

    PrivilegedMGraphWrapper#getGraph()

calls

	public Graph getGraph() {
		return new SimpleGraph(this);
	}

If I am right this causes an in-memory copy of the the wrapped MGraph to be created. Is there a special reason for that or should that? 

I would rather expect an PrivilegedGraphWrapper  wrapping the graph returned by the wrapped MGraph to be returned. Something like.

	public Graph getGraph() {
		return AccessController.doPrivileged(new PrivilegedAction<Graph>() {

			@Override
			public Graph run() {
				return new PrivilegedGraphWrapper(wrapped.getGraph());
			}
		});
	}

Maybe one would even like to have only a single PrivilegedGraphWrapper that is created on the first call to getGraph()

best
Rupert

On 19.03.2012, at 10:45, Daniel Spicar wrote:

> There are a couple of things to keep in mind. I think they are both handled
> on a higher layer and should work transparently but it's good to keep it in
> mind.
> 1. Graph permissions need to work. I think they work via the graph
> URI/name, so they may be handled transparently.



> 2. Make sure rdf.storage.externalizer works with your solution.
> 
> Best,
> Daniel
> 
> On 19 March 2012 09:16, Hasan Hasan <ha...@trialox.org> wrote:
> 
>> Hi all,
>> 
>> I generally agree to extend Clerezza to be able to support multiple
>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>> Although I am bit unhappy, due to the fact, that application developers
>> have to be aware of this.
>> Note that, new clerezza instances (at least my own build) do not anymore
>> generate 200 MB of index files for empty graphs, but merely 200K.
>> 
>> Regards
>> Hasan
>> 
>> 
>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>> 
>>> Hi David, stanbol & clerezza community
>>> 
>>> Short summary of the situation:
>>> 
>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>> provider. This causes the disc consumption and number of open files to
>>> explode. See the quoted emails for details
>>> 
>>> 
>>> @Stanbol  we are already discussion how to avoid the creation of such
>> many
>>> graphs
>>> 
>>> 
>>> @Clerezza the observed behavior of the TDB provider is also very
>> dangerous
>>> (at least for typical use cases in Apache Stanbol).
>>> 
>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>> solution for that as it suggests to use named graphs instead of isolated
>>> TDB instances for creating MGraphs.
>>> 
>>> To be honest this would be the optimal solution for our usages of
>> Clerezza
>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>> different TDB datasets.
>>> 
>>> Because of that I  would like to make the following proposal that
>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>> 
>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>> store Clerezza MGraphs in Jena TDB
>>> 
>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>> follows the currently used methodology to map Clerezza graphs to separate
>>> TDB datasets
>>> 
>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>> MGraphs in a single TDB dataset. This provider should also support
>>> "configurationFactory=true" (multiple instances). each instance would
>> use a
>>> different TDB dataset to store its MGrpahs.
>>> 
>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>> requires a configuration of the directory for the  TDB dataset as well
>> as a
>>> name (that can be used in Filters). This ensures full backward
>>> compatibility.
>>> 
>>> In environment - such as Stanbol - where you want to store multiple
>> graphs
>>> in the same TDB dataset you would need to provide a configuration for the
>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>> 
>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>> can assign a high enough service.ranking to the
>> SingleDatasetTdbTcProvider
>>> and normally use the TcManager to create your graphs.
>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider
>> and
>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>> 
>>> 
>>> WDYT
>>> Rupert
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>> 
>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>> 
>>>> Hi David, all
>>>> 
>>>> this could be the explanation for the failed build on the Jenkins
>> server
>>> when the SEO configuration for the Refactor engine was used in the
>> default
>>> configuration of the Full launcher
>>>> 
>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>> 
>>>> For me that looks like as if the RefactorEngine does create multiple
>>> Jena TDB instances for various created MGraphs. One needs to know the
>> even
>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>> 
>>>> I have no Idea how Clerezza manages this or how Ontonet creates
>> MGraphs,
>>> but I hope this can help in tracing this down.
>>>> 
>>>> best
>>>> Rupert
>>>> 
>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>> 
>>>>> Dears,
>>>>> 
>>>>> As I ran into disk issues, I found that this folder:
>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>> 
>>>>> where XX is the bundle of:
>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>> 
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>> 
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>> 
>>>>> 
>>>>> Any clues?
>>>>> 
>>>>> Thanks,
>>>>> David Riccitelli
>>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<
>>> 
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>> 
>>> 
>>> 
>>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi,

On 19.03.2012, at 09:16, Hasan Hasan wrote:

> Hi all,
> 
> I generally agree to extend Clerezza to be able to support multiple
> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
> Although I am bit unhappy, due to the fact, that application developers
> have to be aware of this.

A good documentation should help with that. 

> Note that, new clerezza instances (at least my own build) do not anymore
> generate 200 MB of index files for empty graphs, but merely 200K.
> 

I tested this against both

* The Stanbol 0.9.0-incubating RC 3 and
* The unit tests of "rdf.jena.tdb.storage" trunk

in both cases TDB directories where > 200 MByte.

After spending some time with different Google queries I was able to find

    http://tech.groups.yahoo.com/group/jena-dev/message/46144

what nicely describe the observed behavior. 

But even if we define this as an bug how MAC OS handles sparse file there is still the problem with the exploding number of open files that will killing the JVM (and possible even the host system).


On 19.03.2012, at 10:45, Daniel Spicar wrote:

> Hi Rupert,
> 
> I ran into a similar problem when I worked on a Jena SDB storage provider
> (not have to create separate databases for each Clerezza graph). Back then
> I didn't create a proper solution so I am interested in your approach. From
> what you described it sounds good to me.
> 

I created https://issues.apache.org/jira/browse/CLEREZZA-691 for SingleDatasetTdbTcProvide

I have implemented a SingleDatasetTdbTcProvide over the weekend. It passes already the MGraph related tests, but still fails the TcProvider tests as I need to add support for using the same graph name for both a MGrpah and a Graph (as required by the TcProviderTest). 
Is this really necessary or only something that is accidentally used by the TcProviderTest? Something that can not be "natively" supported when using a single Dataset as named graph names MUST be unique.

Currently I am developing this as part of the "rdf.jena.tdb.storage" as I think there is no need to have an own module for an 2nd variant of an TcProvider that is based on the same underlaying technology.

As soon as it passes the same set of tests as used for the "TdbTcProvider" I will share the code. I would also like to test it within Apache Stanbo, but this could be hard as I would need to change the Clerezza dependencies in the trunk from the Clerezza release to the current SNAPSHOT versions.

Would you prefer a patch or should I commit directly to trunk? An issue branches seams not to be needed as this additions will not affect current functionalities. WDYT?


> There are a couple of things to keep in mind. I think they are both handled
> on a higher layer and should work transparently but it's good to keep it in
> mind.
> 1. Graph permissions need to work. I think they work via the graph
> URI/name, so they may be handled transparently.
> 2. Make sure rdf.storage.externalizer works with your solution.
> 

I have never used those things. I will have a look, but it would be wise if someone with more knowledge can validate this after I have provided a first version

best
Rupert

> Best,
> Daniel
> 
> On 19 March 2012 09:16, Hasan Hasan <ha...@trialox.org> wrote:
> 
>> Hi all,
>> 
>> I generally agree to extend Clerezza to be able to support multiple
>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>> Although I am bit unhappy, due to the fact, that application developers
>> have to be aware of this.
>> Note that, new clerezza instances (at least my own build) do not anymore
>> generate 200 MB of index files for empty graphs, but merely 200K.
>> 
>> Regards
>> Hasan
>> 
>> 
>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>> 
>>> Hi David, stanbol & clerezza community
>>> 
>>> Short summary of the situation:
>>> 
>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>> provider. This causes the disc consumption and number of open files to
>>> explode. See the quoted emails for details
>>> 
>>> 
>>> @Stanbol  we are already discussion how to avoid the creation of such
>> many
>>> graphs
>>> 
>>> 
>>> @Clerezza the observed behavior of the TDB provider is also very
>> dangerous
>>> (at least for typical use cases in Apache Stanbol).
>>> 
>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>> solution for that as it suggests to use named graphs instead of isolated
>>> TDB instances for creating MGraphs.
>>> 
>>> To be honest this would be the optimal solution for our usages of
>> Clerezza
>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>> different TDB datasets.
>>> 
>>> Because of that I  would like to make the following proposal that
>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>> 
>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>> store Clerezza MGraphs in Jena TDB
>>> 
>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>> follows the currently used methodology to map Clerezza graphs to separate
>>> TDB datasets
>>> 
>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>> MGraphs in a single TDB dataset. This provider should also support
>>> "configurationFactory=true" (multiple instances). each instance would
>> use a
>>> different TDB dataset to store its MGrpahs.
>>> 
>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>> requires a configuration of the directory for the  TDB dataset as well
>> as a
>>> name (that can be used in Filters). This ensures full backward
>>> compatibility.
>>> 
>>> In environment - such as Stanbol - where you want to store multiple
>> graphs
>>> in the same TDB dataset you would need to provide a configuration for the
>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>> 
>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>> can assign a high enough service.ranking to the
>> SingleDatasetTdbTcProvider
>>> and normally use the TcManager to create your graphs.
>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider
>> and
>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>> 
>>> 
>>> WDYT
>>> Rupert
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>> 
>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>> 
>>>> Hi David, all
>>>> 
>>>> this could be the explanation for the failed build on the Jenkins
>> server
>>> when the SEO configuration for the Refactor engine was used in the
>> default
>>> configuration of the Full launcher
>>>> 
>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>> 
>>>> For me that looks like as if the RefactorEngine does create multiple
>>> Jena TDB instances for various created MGraphs. One needs to know the
>> even
>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>> 
>>>> I have no Idea how Clerezza manages this or how Ontonet creates
>> MGraphs,
>>> but I hope this can help in tracing this down.
>>>> 
>>>> best
>>>> Rupert
>>>> 
>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>> 
>>>>> Dears,
>>>>> 
>>>>> As I ran into disk issues, I found that this folder:
>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>> 
>>>>> where XX is the bundle of:
>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>> 
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>> 
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>> 
>>>>> 
>>>>> Any clues?
>>>>> 
>>>>> Thanks,
>>>>> David Riccitelli
>>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<
>>> 
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>> 
>>> 
>>> 
>>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Daniel Spicar <ds...@apache.org>.

Hi Rupert,

I ran into a similar problem when I worked on a Jena SDB storage provider
(not have to create separate databases for each Clerezza graph). Back then
I didn't create a proper solution so I am interested in your approach. From
what you described it sounds good to me.

There are a couple of things to keep in mind. I think they are both handled
on a higher layer and should work transparently but it's good to keep it in
mind.
1. Graph permissions need to work. I think they work via the graph
URI/name, so they may be handled transparently.
2. Make sure rdf.storage.externalizer works with your solution.

Best,
Daniel

On 19 March 2012 09:16, Hasan Hasan <ha...@trialox.org> wrote:

> Hi all,
>
> I generally agree to extend Clerezza to be able to support multiple
> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
> Although I am bit unhappy, due to the fact, that application developers
> have to be aware of this.
> Note that, new clerezza instances (at least my own build) do not anymore
> generate 200 MB of index files for empty graphs, but merely 200K.
>
> Regards
> Hasan
>
>
> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
>
> > Hi David, stanbol & clerezza community
> >
> > Short summary of the situation:
> >
> > The Ontonet component generate a lot of MGraphs using the Jena TDB
> > provider. This causes the disc consumption and number of open files to
> > explode. See the quoted emails for details
> >
> >
> > @Stanbol  we are already discussion how to avoid the creation of such
> many
> > graphs
> >
> >
> > @Clerezza the observed behavior of the TDB provider is also very
> dangerous
> > (at least for typical use cases in Apache Stanbol).
> >
> > Even targeting at a different CLEREZZA-467 maybe provides a possible
> > solution for that as it suggests to use named graphs instead of isolated
> > TDB instances for creating MGraphs.
> >
> > To be honest this would be the optimal solution for our usages of
> Clerezza
> > in Stanbol. However I assume that for a semantic CMS it is saver to use
> > different TDB datasets.
> >
> > Because of that I  would like to make the following proposal that
> > hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
> >
> > 1. AbstractTdbTcProvider: providing most of the functionality needed to
> > store Clerezza MGraphs in Jena TDB
> >
> > 2. TdbTcProvider: The same as now but now extending the abstract one. I
> > follows the currently used methodology to map Clerezza graphs to separate
> > TDB datasets
> >
> > 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
> > MGraphs in a single TDB dataset. This provider should also support
> > "configurationFactory=true" (multiple instances). each instance would
> use a
> > different TDB dataset to store its MGrpahs.
> >
> > By default the SingleDatasetTdbTcProvider would be inactive, because it
> > requires a configuration of the directory for the  TDB dataset as well
> as a
> > name (that can be used in Filters). This ensures full backward
> > compatibility.
> >
> > In environment - such as Stanbol - where you want to store multiple
> graphs
> > in the same TDB dataset you would need to provide a configuration for the
> > SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
> >
> > * if you just need a single TDB dataset that stores all MGraphs, than you
> > can assign a high enough service.ranking to the
> SingleDatasetTdbTcProvider
> > and normally use the TcManager to create your graphs.
> > * if you want to use single TDB datasets or a mix of the TdbTcProvider
> and
> > SingleDatasetTdbTcProvider's you will need to use according filters.
> >
> >
> > WDYT
> > Rupert
> >
> >
> > [1] https://issues.apache.org/jira/browse/CLEREZZA-467
> >
> > On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
> >
> > > Hi David, all
> > >
> > > this could be the explanation for the failed build on the Jenkins
> server
> > when the SEO configuration for the Refactor engine was used in the
> default
> > configuration of the Full launcher
> > >
> > > see http://markmail.org/message/sprwklaobdjankig for details.
> > >
> > > For me that looks like as if the RefactorEngine does create multiple
> > Jena TDB instances for various created MGraphs. One needs to know the
> even
> > for an empty graph Jena TDB creates ~200MByte of index files. So it is
> > important to map multiple MGraphs to different named graphs of the same
> > Jena TDB store.
> > >
> > > I have no Idea how Clerezza manages this or how Ontonet creates
> MGraphs,
> > but I hope this can help in tracing this down.
> > >
> > > best
> > > Rupert
> > >
> > > On 16.03.2012, at 10:30, David Riccitelli wrote:
> > >
> > >> Dears,
> > >>
> > >> As I ran into disk issues, I found that this folder:
> > >> sling/felix/bundleXXX/data/tdb-data/mgraph
> > >>
> > >> where XX is the bundle of:
> > >> Clerezza - SCB Jena TDB Storage Provider
> > >> org.apache.clerezza.rdf.jena.tdb.storage
> > >>
> > >> took almost 70 gbytes of disk space (then the disk space has been
> > >> exhausted).
> > >>
> > >> These are some of the files I found inside:
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> > >> 193M ./ontonet%3A%3Ainputstream%3Aontology602
> > >>
> > >>
> > >> Any clues?
> > >>
> > >> Thanks,
> > >> David Riccitelli
> > >>
> > >>
> >
> ********************************************************************************
> > >> InsideOut10 s.r.l.
> > >> P.IVA: IT-11381771002
> > >> Fax: +39 0110708239
> > >> ---
> > >> LinkedIn: http://it.linkedin.com/in/riccitelli
> > >> Twitter: ziodave
> > >> ---
> > >> Layar Partner Network<
> >
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> > >
> > >>
> >
> ********************************************************************************
> > >
> >
> >
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi Rupert, here are a few more numbers:

on the same setting I loaded the NCI ontology from 
http://www.mindswap.org/2003/CancerOntology/ (about 400k triples, 
lightly axiomatized with DL flavor ALE)

on the SingleTdbDatasetTcProvider the storage directory grew by 156 MiB 
(192 -> 348)

on the TdbTcProvider the newly created dir was 76 MiB above the initial 
capacity (192 -> 268)

Then I bzipped both directories to see if it was partly "filling" the 
initial 192 MiB :
- the SingleTdbDatasetTcProvider one shrunk to ~25 MiB
- the TdbTcProvider one shrunk to ~17 MiB

I guess this overhead is due to having to store a lot more quadruples 
due to the named graphs. I noticed that the files 
(GOSP|GPOS|GSPO|OSPG|POSG|SPOG).dat which I assume store quadruples are 
each 4 times as large in the SingleTdbDatasetTcProvider database, 
whereas the triples (OSP|POS|SPO).dat were the same size. I guess this 
redundancy is the price paid for fast access.

Perhaps mine is a fuzzy interpretation though? Still, it looks pretty 
good to me.

Best,

Alessandro


----------

On 4/4/12 7:31 PM, Rupert Westenthaler wrote:
> On 04.04.2012, at 19:18, Alessandro Adamou wrote:
>
>> Hi Rupert, all,
>>
>> just telling you that I have tried the SingleTdbDatasetTcProvider on the field with one of my use cases which involves many small ontologies (content design patterns).
>>
>> I've created ~20 graphs totalling about 500 triples
>>
>> On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew from an initial 184MiB to 248MiB
>>
>> I am yet to test large graphs, so I cannot tell if the overhead is given by named graph indexes or the triple storage, but this is already a big leap from the TdbTcProvider.
>>
> Thx for testing.
>
>> Did you already commit this component to rdf.jena.tdb.storage ?
>>
> No not yet, but I have made some improvements and fixed some bugs since the last patch attached to the Issue. I hope I will have some time to finish this later this week.
>
> best
> Rupert
>
>> Best,
>>
>> Alessandro
>>
>> On 3/19/12 9:16 AM, Hasan Hasan wrote:
>>> Hi all,
>>>
>>> I generally agree to extend Clerezza to be able to support multiple
>>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>>> Although I am bit unhappy, due to the fact, that application developers
>>> have to be aware of this.
>>> Note that, new clerezza instances (at least my own build) do not anymore
>>> generate 200 MB of index files for empty graphs, but merely 200K.
>>>
>>> Regards
>>> Hasan
>>>
>>>
>>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
>>> rupert.westenthaler@gmail.com>   wrote:
>>>
>>>> Hi David, stanbol&   clerezza community
>>>>
>>>> Short summary of the situation:
>>>>
>>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>>> provider. This causes the disc consumption and number of open files to
>>>> explode. See the quoted emails for details
>>>>
>>>>
>>>> @Stanbol  we are already discussion how to avoid the creation of such many
>>>> graphs
>>>>
>>>>
>>>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>>>> (at least for typical use cases in Apache Stanbol).
>>>>
>>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>>> solution for that as it suggests to use named graphs instead of isolated
>>>> TDB instances for creating MGraphs.
>>>>
>>>> To be honest this would be the optimal solution for our usages of Clerezza
>>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>>> different TDB datasets.
>>>>
>>>> Because of that I  would like to make the following proposal that
>>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>>>
>>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>>> store Clerezza MGraphs in Jena TDB
>>>>
>>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>>> follows the currently used methodology to map Clerezza graphs to separate
>>>> TDB datasets
>>>>
>>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>>> MGraphs in a single TDB dataset. This provider should also support
>>>> "configurationFactory=true" (multiple instances). each instance would use a
>>>> different TDB dataset to store its MGrpahs.
>>>>
>>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>>> requires a configuration of the directory for the  TDB dataset as well as a
>>>> name (that can be used in Filters). This ensures full backward
>>>> compatibility.
>>>>
>>>> In environment - such as Stanbol - where you want to store multiple graphs
>>>> in the same TDB dataset you would need to provide a configuration for the
>>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>>>
>>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>>>> and normally use the TcManager to create your graphs.
>>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>>>
>>>>
>>>> WDYT
>>>> Rupert
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>>>
>>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>>>
>>>>> Hi David, all
>>>>>
>>>>> this could be the explanation for the failed build on the Jenkins server
>>>> when the SEO configuration for the Refactor engine was used in the default
>>>> configuration of the Full launcher
>>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>>>
>>>>> For me that looks like as if the RefactorEngine does create multiple
>>>> Jena TDB instances for various created MGraphs. One needs to know the even
>>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>>> important to map multiple MGraphs to different named graphs of the same
>>>> Jena TDB store.
>>>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>>> but I hope this can help in tracing this down.
>>>>> best
>>>>> Rupert
>>>>>
>>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>>>
>>>>>> Dears,
>>>>>>
>>>>>> As I ran into disk issues, I found that this folder:
>>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>>>
>>>>>> where XX is the bundle of:
>>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>>>
>>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>>> exhausted).
>>>>>>
>>>>>> These are some of the files I found inside:
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>>>
>>>>>>
>>>>>> Any clues?
>>>>>>
>>>>>> Thanks,
>>>>>> David Riccitelli
>>>>>>
>>>>>>
>>>> ********************************************************************************
>>>>>> InsideOut10 s.r.l.
>>>>>> P.IVA: IT-11381771002
>>>>>> Fax: +39 0110708239
>>>>>> ---
>>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> Twitter: ziodave
>>>>>> ---
>>>>>> Layar Partner Network<
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> ********************************************************************************
>>>>
>>
>> -- 
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi Rupert, here are a few more numbers:

on the same setting I loaded the NCI ontology from 
http://www.mindswap.org/2003/CancerOntology/ (about 400k triples, 
lightly axiomatized with DL flavor ALE)

on the SingleTdbDatasetTcProvider the storage directory grew by 156 MiB 
(192 -> 348)

on the TdbTcProvider the newly created dir was 76 MiB above the initial 
capacity (192 -> 268)

Then I bzipped both directories to see if it was partly "filling" the 
initial 192 MiB :
- the SingleTdbDatasetTcProvider one shrunk to ~25 MiB
- the TdbTcProvider one shrunk to ~17 MiB

I guess this overhead is due to having to store a lot more quadruples 
due to the named graphs. I noticed that the files 
(GOSP|GPOS|GSPO|OSPG|POSG|SPOG).dat which I assume store quadruples are 
each 4 times as large in the SingleTdbDatasetTcProvider database, 
whereas the triples (OSP|POS|SPO).dat were the same size. I guess this 
redundancy is the price paid for fast access.

Perhaps mine is a fuzzy interpretation though? Still, it looks pretty 
good to me.

Best,

Alessandro


----------

On 4/4/12 7:31 PM, Rupert Westenthaler wrote:
> On 04.04.2012, at 19:18, Alessandro Adamou wrote:
>
>> Hi Rupert, all,
>>
>> just telling you that I have tried the SingleTdbDatasetTcProvider on the field with one of my use cases which involves many small ontologies (content design patterns).
>>
>> I've created ~20 graphs totalling about 500 triples
>>
>> On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew from an initial 184MiB to 248MiB
>>
>> I am yet to test large graphs, so I cannot tell if the overhead is given by named graph indexes or the triple storage, but this is already a big leap from the TdbTcProvider.
>>
> Thx for testing.
>
>> Did you already commit this component to rdf.jena.tdb.storage ?
>>
> No not yet, but I have made some improvements and fixed some bugs since the last patch attached to the Issue. I hope I will have some time to finish this later this week.
>
> best
> Rupert
>
>> Best,
>>
>> Alessandro
>>
>> On 3/19/12 9:16 AM, Hasan Hasan wrote:
>>> Hi all,
>>>
>>> I generally agree to extend Clerezza to be able to support multiple
>>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>>> Although I am bit unhappy, due to the fact, that application developers
>>> have to be aware of this.
>>> Note that, new clerezza instances (at least my own build) do not anymore
>>> generate 200 MB of index files for empty graphs, but merely 200K.
>>>
>>> Regards
>>> Hasan
>>>
>>>
>>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
>>> rupert.westenthaler@gmail.com>   wrote:
>>>
>>>> Hi David, stanbol&   clerezza community
>>>>
>>>> Short summary of the situation:
>>>>
>>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>>> provider. This causes the disc consumption and number of open files to
>>>> explode. See the quoted emails for details
>>>>
>>>>
>>>> @Stanbol  we are already discussion how to avoid the creation of such many
>>>> graphs
>>>>
>>>>
>>>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>>>> (at least for typical use cases in Apache Stanbol).
>>>>
>>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>>> solution for that as it suggests to use named graphs instead of isolated
>>>> TDB instances for creating MGraphs.
>>>>
>>>> To be honest this would be the optimal solution for our usages of Clerezza
>>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>>> different TDB datasets.
>>>>
>>>> Because of that I  would like to make the following proposal that
>>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>>>
>>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>>> store Clerezza MGraphs in Jena TDB
>>>>
>>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>>> follows the currently used methodology to map Clerezza graphs to separate
>>>> TDB datasets
>>>>
>>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>>> MGraphs in a single TDB dataset. This provider should also support
>>>> "configurationFactory=true" (multiple instances). each instance would use a
>>>> different TDB dataset to store its MGrpahs.
>>>>
>>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>>> requires a configuration of the directory for the  TDB dataset as well as a
>>>> name (that can be used in Filters). This ensures full backward
>>>> compatibility.
>>>>
>>>> In environment - such as Stanbol - where you want to store multiple graphs
>>>> in the same TDB dataset you would need to provide a configuration for the
>>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>>>
>>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>>>> and normally use the TcManager to create your graphs.
>>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>>>
>>>>
>>>> WDYT
>>>> Rupert
>>>>
>>>>
>>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>>>
>>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>>>
>>>>> Hi David, all
>>>>>
>>>>> this could be the explanation for the failed build on the Jenkins server
>>>> when the SEO configuration for the Refactor engine was used in the default
>>>> configuration of the Full launcher
>>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>>>
>>>>> For me that looks like as if the RefactorEngine does create multiple
>>>> Jena TDB instances for various created MGraphs. One needs to know the even
>>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>>> important to map multiple MGraphs to different named graphs of the same
>>>> Jena TDB store.
>>>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>>> but I hope this can help in tracing this down.
>>>>> best
>>>>> Rupert
>>>>>
>>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>>>
>>>>>> Dears,
>>>>>>
>>>>>> As I ran into disk issues, I found that this folder:
>>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>>>
>>>>>> where XX is the bundle of:
>>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>>>
>>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>>> exhausted).
>>>>>>
>>>>>> These are some of the files I found inside:
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>>>
>>>>>>
>>>>>> Any clues?
>>>>>>
>>>>>> Thanks,
>>>>>> David Riccitelli
>>>>>>
>>>>>>
>>>> ********************************************************************************
>>>>>> InsideOut10 s.r.l.
>>>>>> P.IVA: IT-11381771002
>>>>>> Fax: +39 0110708239
>>>>>> ---
>>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>>> Twitter: ziodave
>>>>>> ---
>>>>>> Layar Partner Network<
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> ********************************************************************************
>>>>
>>
>> -- 
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

On 04.04.2012, at 19:18, Alessandro Adamou wrote:

> Hi Rupert, all,
> 
> just telling you that I have tried the SingleTdbDatasetTcProvider on the field with one of my use cases which involves many small ontologies (content design patterns).
> 
> I've created ~20 graphs totalling about 500 triples
> 
> On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew from an initial 184MiB to 248MiB
> 
> I am yet to test large graphs, so I cannot tell if the overhead is given by named graph indexes or the triple storage, but this is already a big leap from the TdbTcProvider.
> 

Thx for testing. 

> Did you already commit this component to rdf.jena.tdb.storage ?
> 

No not yet, but I have made some improvements and fixed some bugs since the last patch attached to the Issue. I hope I will have some time to finish this later this week.

best
Rupert

> Best,
> 
> Alessandro
> 
> On 3/19/12 9:16 AM, Hasan Hasan wrote:
>> Hi all,
>> 
>> I generally agree to extend Clerezza to be able to support multiple
>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>> Although I am bit unhappy, due to the fact, that application developers
>> have to be aware of this.
>> Note that, new clerezza instances (at least my own build) do not anymore
>> generate 200 MB of index files for empty graphs, but merely 200K.
>> 
>> Regards
>> Hasan
>> 
>> 
>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
>> rupert.westenthaler@gmail.com>  wrote:
>> 
>>> Hi David, stanbol&  clerezza community
>>> 
>>> Short summary of the situation:
>>> 
>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>> provider. This causes the disc consumption and number of open files to
>>> explode. See the quoted emails for details
>>> 
>>> 
>>> @Stanbol  we are already discussion how to avoid the creation of such many
>>> graphs
>>> 
>>> 
>>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>>> (at least for typical use cases in Apache Stanbol).
>>> 
>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>> solution for that as it suggests to use named graphs instead of isolated
>>> TDB instances for creating MGraphs.
>>> 
>>> To be honest this would be the optimal solution for our usages of Clerezza
>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>> different TDB datasets.
>>> 
>>> Because of that I  would like to make the following proposal that
>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>> 
>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>> store Clerezza MGraphs in Jena TDB
>>> 
>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>> follows the currently used methodology to map Clerezza graphs to separate
>>> TDB datasets
>>> 
>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>> MGraphs in a single TDB dataset. This provider should also support
>>> "configurationFactory=true" (multiple instances). each instance would use a
>>> different TDB dataset to store its MGrpahs.
>>> 
>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>> requires a configuration of the directory for the  TDB dataset as well as a
>>> name (that can be used in Filters). This ensures full backward
>>> compatibility.
>>> 
>>> In environment - such as Stanbol - where you want to store multiple graphs
>>> in the same TDB dataset you would need to provide a configuration for the
>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>> 
>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>>> and normally use the TcManager to create your graphs.
>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>> 
>>> 
>>> WDYT
>>> Rupert
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>> 
>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>> 
>>>> Hi David, all
>>>> 
>>>> this could be the explanation for the failed build on the Jenkins server
>>> when the SEO configuration for the Refactor engine was used in the default
>>> configuration of the Full launcher
>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>> 
>>>> For me that looks like as if the RefactorEngine does create multiple
>>> Jena TDB instances for various created MGraphs. One needs to know the even
>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>> but I hope this can help in tracing this down.
>>>> best
>>>> Rupert
>>>> 
>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>> 
>>>>> Dears,
>>>>> 
>>>>> As I ran into disk issues, I found that this folder:
>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>> 
>>>>> where XX is the bundle of:
>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>> 
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>> 
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>> 
>>>>> 
>>>>> Any clues?
>>>>> 
>>>>> Thanks,
>>>>> David Riccitelli
>>>>> 
>>>>> 
>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> ********************************************************************************
>>> 
> 
> 
> -- 
> M.Sc. Alessandro Adamou
> 
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
> 
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
> 
> 
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
> 
> Not sent from my iSnobTechDevice
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

On 04.04.2012, at 19:18, Alessandro Adamou wrote:

> Hi Rupert, all,
> 
> just telling you that I have tried the SingleTdbDatasetTcProvider on the field with one of my use cases which involves many small ontologies (content design patterns).
> 
> I've created ~20 graphs totalling about 500 triples
> 
> On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew from an initial 184MiB to 248MiB
> 
> I am yet to test large graphs, so I cannot tell if the overhead is given by named graph indexes or the triple storage, but this is already a big leap from the TdbTcProvider.
> 

Thx for testing. 

> Did you already commit this component to rdf.jena.tdb.storage ?
> 

No not yet, but I have made some improvements and fixed some bugs since the last patch attached to the Issue. I hope I will have some time to finish this later this week.

best
Rupert

> Best,
> 
> Alessandro
> 
> On 3/19/12 9:16 AM, Hasan Hasan wrote:
>> Hi all,
>> 
>> I generally agree to extend Clerezza to be able to support multiple
>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>> Although I am bit unhappy, due to the fact, that application developers
>> have to be aware of this.
>> Note that, new clerezza instances (at least my own build) do not anymore
>> generate 200 MB of index files for empty graphs, but merely 200K.
>> 
>> Regards
>> Hasan
>> 
>> 
>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
>> rupert.westenthaler@gmail.com>  wrote:
>> 
>>> Hi David, stanbol&  clerezza community
>>> 
>>> Short summary of the situation:
>>> 
>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>> provider. This causes the disc consumption and number of open files to
>>> explode. See the quoted emails for details
>>> 
>>> 
>>> @Stanbol  we are already discussion how to avoid the creation of such many
>>> graphs
>>> 
>>> 
>>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>>> (at least for typical use cases in Apache Stanbol).
>>> 
>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>> solution for that as it suggests to use named graphs instead of isolated
>>> TDB instances for creating MGraphs.
>>> 
>>> To be honest this would be the optimal solution for our usages of Clerezza
>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>> different TDB datasets.
>>> 
>>> Because of that I  would like to make the following proposal that
>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>> 
>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>> store Clerezza MGraphs in Jena TDB
>>> 
>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>> follows the currently used methodology to map Clerezza graphs to separate
>>> TDB datasets
>>> 
>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>> MGraphs in a single TDB dataset. This provider should also support
>>> "configurationFactory=true" (multiple instances). each instance would use a
>>> different TDB dataset to store its MGrpahs.
>>> 
>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>> requires a configuration of the directory for the  TDB dataset as well as a
>>> name (that can be used in Filters). This ensures full backward
>>> compatibility.
>>> 
>>> In environment - such as Stanbol - where you want to store multiple graphs
>>> in the same TDB dataset you would need to provide a configuration for the
>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>> 
>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>>> and normally use the TcManager to create your graphs.
>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>> 
>>> 
>>> WDYT
>>> Rupert
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>> 
>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>> 
>>>> Hi David, all
>>>> 
>>>> this could be the explanation for the failed build on the Jenkins server
>>> when the SEO configuration for the Refactor engine was used in the default
>>> configuration of the Full launcher
>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>> 
>>>> For me that looks like as if the RefactorEngine does create multiple
>>> Jena TDB instances for various created MGraphs. One needs to know the even
>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>> but I hope this can help in tracing this down.
>>>> best
>>>> Rupert
>>>> 
>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>> 
>>>>> Dears,
>>>>> 
>>>>> As I ran into disk issues, I found that this folder:
>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>> 
>>>>> where XX is the bundle of:
>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>> 
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>> 
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>> 
>>>>> 
>>>>> Any clues?
>>>>> 
>>>>> Thanks,
>>>>> David Riccitelli
>>>>> 
>>>>> 
>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> ********************************************************************************
>>> 
> 
> 
> -- 
> M.Sc. Alessandro Adamou
> 
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
> 
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
> 
> 
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
> 
> Not sent from my iSnobTechDevice
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi Rupert, all,

just telling you that I have tried the SingleTdbDatasetTcProvider on the 
field with one of my use cases which involves many small ontologies 
(content design patterns).

I've created ~20 graphs totalling about 500 triples

On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew 
from an initial 184MiB to 248MiB

I am yet to test large graphs, so I cannot tell if the overhead is given 
by named graph indexes or the triple storage, but this is already a big 
leap from the TdbTcProvider.

Did you already commit this component to rdf.jena.tdb.storage ?

Best,

Alessandro

On 3/19/12 9:16 AM, Hasan Hasan wrote:
> Hi all,
>
> I generally agree to extend Clerezza to be able to support multiple
> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
> Although I am bit unhappy, due to the fact, that application developers
> have to be aware of this.
> Note that, new clerezza instances (at least my own build) do not anymore
> generate 200 MB of index files for empty graphs, but merely 200K.
>
> Regards
> Hasan
>
>
> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
> rupert.westenthaler@gmail.com>  wrote:
>
>> Hi David, stanbol&  clerezza community
>>
>> Short summary of the situation:
>>
>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>> provider. This causes the disc consumption and number of open files to
>> explode. See the quoted emails for details
>>
>>
>> @Stanbol  we are already discussion how to avoid the creation of such many
>> graphs
>>
>>
>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>> (at least for typical use cases in Apache Stanbol).
>>
>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>> solution for that as it suggests to use named graphs instead of isolated
>> TDB instances for creating MGraphs.
>>
>> To be honest this would be the optimal solution for our usages of Clerezza
>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>> different TDB datasets.
>>
>> Because of that I  would like to make the following proposal that
>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>
>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>> store Clerezza MGraphs in Jena TDB
>>
>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>> follows the currently used methodology to map Clerezza graphs to separate
>> TDB datasets
>>
>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>> MGraphs in a single TDB dataset. This provider should also support
>> "configurationFactory=true" (multiple instances). each instance would use a
>> different TDB dataset to store its MGrpahs.
>>
>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>> requires a configuration of the directory for the  TDB dataset as well as a
>> name (that can be used in Filters). This ensures full backward
>> compatibility.
>>
>> In environment - such as Stanbol - where you want to store multiple graphs
>> in the same TDB dataset you would need to provide a configuration for the
>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>
>> * if you just need a single TDB dataset that stores all MGraphs, than you
>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>> and normally use the TcManager to create your graphs.
>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>
>>
>> WDYT
>> Rupert
>>
>>
>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>
>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>
>>> Hi David, all
>>>
>>> this could be the explanation for the failed build on the Jenkins server
>> when the SEO configuration for the Refactor engine was used in the default
>> configuration of the Full launcher
>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>
>>> For me that looks like as if the RefactorEngine does create multiple
>> Jena TDB instances for various created MGraphs. One needs to know the even
>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>> important to map multiple MGraphs to different named graphs of the same
>> Jena TDB store.
>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>> but I hope this can help in tracing this down.
>>> best
>>> Rupert
>>>
>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>
>>>> Dears,
>>>>
>>>> As I ran into disk issues, I found that this folder:
>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>
>>>> where XX is the bundle of:
>>>> Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>>
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>
>>>>
>>>> Any clues?
>>>>
>>>> Thanks,
>>>> David Riccitelli
>>>>
>>>>
>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> ********************************************************************************
>>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi Rupert, all,

just telling you that I have tried the SingleTdbDatasetTcProvider on the 
field with one of my use cases which involves many small ontologies 
(content design patterns).

I've created ~20 graphs totalling about 500 triples

On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew 
from an initial 184MiB to 248MiB

I am yet to test large graphs, so I cannot tell if the overhead is given 
by named graph indexes or the triple storage, but this is already a big 
leap from the TdbTcProvider.

Did you already commit this component to rdf.jena.tdb.storage ?

Best,

Alessandro

On 3/19/12 9:16 AM, Hasan Hasan wrote:
> Hi all,
>
> I generally agree to extend Clerezza to be able to support multiple
> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
> Although I am bit unhappy, due to the fact, that application developers
> have to be aware of this.
> Note that, new clerezza instances (at least my own build) do not anymore
> generate 200 MB of index files for empty graphs, but merely 200K.
>
> Regards
> Hasan
>
>
> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
> rupert.westenthaler@gmail.com>  wrote:
>
>> Hi David, stanbol&  clerezza community
>>
>> Short summary of the situation:
>>
>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>> provider. This causes the disc consumption and number of open files to
>> explode. See the quoted emails for details
>>
>>
>> @Stanbol  we are already discussion how to avoid the creation of such many
>> graphs
>>
>>
>> @Clerezza the observed behavior of the TDB provider is also very dangerous
>> (at least for typical use cases in Apache Stanbol).
>>
>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>> solution for that as it suggests to use named graphs instead of isolated
>> TDB instances for creating MGraphs.
>>
>> To be honest this would be the optimal solution for our usages of Clerezza
>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>> different TDB datasets.
>>
>> Because of that I  would like to make the following proposal that
>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>
>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>> store Clerezza MGraphs in Jena TDB
>>
>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>> follows the currently used methodology to map Clerezza graphs to separate
>> TDB datasets
>>
>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>> MGraphs in a single TDB dataset. This provider should also support
>> "configurationFactory=true" (multiple instances). each instance would use a
>> different TDB dataset to store its MGrpahs.
>>
>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>> requires a configuration of the directory for the  TDB dataset as well as a
>> name (that can be used in Filters). This ensures full backward
>> compatibility.
>>
>> In environment - such as Stanbol - where you want to store multiple graphs
>> in the same TDB dataset you would need to provide a configuration for the
>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>
>> * if you just need a single TDB dataset that stores all MGraphs, than you
>> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
>> and normally use the TcManager to create your graphs.
>> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>
>>
>> WDYT
>> Rupert
>>
>>
>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>
>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>
>>> Hi David, all
>>>
>>> this could be the explanation for the failed build on the Jenkins server
>> when the SEO configuration for the Refactor engine was used in the default
>> configuration of the Full launcher
>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>
>>> For me that looks like as if the RefactorEngine does create multiple
>> Jena TDB instances for various created MGraphs. One needs to know the even
>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>> important to map multiple MGraphs to different named graphs of the same
>> Jena TDB store.
>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>> but I hope this can help in tracing this down.
>>> best
>>> Rupert
>>>
>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>
>>>> Dears,
>>>>
>>>> As I ran into disk issues, I found that this folder:
>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>
>>>> where XX is the bundle of:
>>>> Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>>
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>
>>>>
>>>> Any clues?
>>>>
>>>> Thanks,
>>>> David Riccitelli
>>>>
>>>>
>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> ********************************************************************************
>>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Hasan Hasan <ha...@trialox.org>.

Hi all,

I generally agree to extend Clerezza to be able to support multiple
requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
Although I am bit unhappy, due to the fact, that application developers
have to be aware of this.
Note that, new clerezza instances (at least my own build) do not anymore
generate 200 MB of index files for empty graphs, but merely 200K.

Regards
Hasan


On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi David, stanbol & clerezza community
>
> Short summary of the situation:
>
> The Ontonet component generate a lot of MGraphs using the Jena TDB
> provider. This causes the disc consumption and number of open files to
> explode. See the quoted emails for details
>
>
> @Stanbol  we are already discussion how to avoid the creation of such many
> graphs
>
>
> @Clerezza the observed behavior of the TDB provider is also very dangerous
> (at least for typical use cases in Apache Stanbol).
>
> Even targeting at a different CLEREZZA-467 maybe provides a possible
> solution for that as it suggests to use named graphs instead of isolated
> TDB instances for creating MGraphs.
>
> To be honest this would be the optimal solution for our usages of Clerezza
> in Stanbol. However I assume that for a semantic CMS it is saver to use
> different TDB datasets.
>
> Because of that I  would like to make the following proposal that
> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>
> 1. AbstractTdbTcProvider: providing most of the functionality needed to
> store Clerezza MGraphs in Jena TDB
>
> 2. TdbTcProvider: The same as now but now extending the abstract one. I
> follows the currently used methodology to map Clerezza graphs to separate
> TDB datasets
>
> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
> MGraphs in a single TDB dataset. This provider should also support
> "configurationFactory=true" (multiple instances). each instance would use a
> different TDB dataset to store its MGrpahs.
>
> By default the SingleDatasetTdbTcProvider would be inactive, because it
> requires a configuration of the directory for the  TDB dataset as well as a
> name (that can be used in Filters). This ensures full backward
> compatibility.
>
> In environment - such as Stanbol - where you want to store multiple graphs
> in the same TDB dataset you would need to provide a configuration for the
> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>
> * if you just need a single TDB dataset that stores all MGraphs, than you
> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
> and normally use the TcManager to create your graphs.
> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
> SingleDatasetTdbTcProvider's you will need to use according filters.
>
>
> WDYT
> Rupert
>
>
> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>
> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>
> > Hi David, all
> >
> > this could be the explanation for the failed build on the Jenkins server
> when the SEO configuration for the Refactor engine was used in the default
> configuration of the Full launcher
> >
> > see http://markmail.org/message/sprwklaobdjankig for details.
> >
> > For me that looks like as if the RefactorEngine does create multiple
> Jena TDB instances for various created MGraphs. One needs to know the even
> for an empty graph Jena TDB creates ~200MByte of index files. So it is
> important to map multiple MGraphs to different named graphs of the same
> Jena TDB store.
> >
> > I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
> but I hope this can help in tracing this down.
> >
> > best
> > Rupert
> >
> > On 16.03.2012, at 10:30, David Riccitelli wrote:
> >
> >> Dears,
> >>
> >> As I ran into disk issues, I found that this folder:
> >> sling/felix/bundleXXX/data/tdb-data/mgraph
> >>
> >> where XX is the bundle of:
> >> Clerezza - SCB Jena TDB Storage Provider
> >> org.apache.clerezza.rdf.jena.tdb.storage
> >>
> >> took almost 70 gbytes of disk space (then the disk space has been
> >> exhausted).
> >>
> >> These are some of the files I found inside:
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology602
> >>
> >>
> >> Any clues?
> >>
> >> Thanks,
> >> David Riccitelli
> >>
> >>
> ********************************************************************************
> >> InsideOut10 s.r.l.
> >> P.IVA: IT-11381771002
> >> Fax: +39 0110708239
> >> ---
> >> LinkedIn: http://it.linkedin.com/in/riccitelli
> >> Twitter: ziodave
> >> ---
> >> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
> >>
> ********************************************************************************
> >
>
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Reto Bachmann-Gmür <re...@apache.org>.

Hi Rupert,

I like your proposal but would suggest:
- SingleDatasetTdbTcProvider should not need a directory configured
- SingleDatasetTdbTcProvider should have the higher weight and thus be the
one used by default

I think there might be usecases were you want an graph to be isolated from
the rests, but I think the default behaviour should be the more perfomant
and less memory expensive SingleDatasetTdbTcProvider.

We could add a tool to clerezza that allows creating an mgraph in a
tc-provider other than the one with the highest weight.

Cheers,
Reto

On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi David, stanbol & clerezza community
>
> Short summary of the situation:
>
> The Ontonet component generate a lot of MGraphs using the Jena TDB
> provider. This causes the disc consumption and number of open files to
> explode. See the quoted emails for details
>
>
> @Stanbol  we are already discussion how to avoid the creation of such many
> graphs
>
>
> @Clerezza the observed behavior of the TDB provider is also very dangerous
> (at least for typical use cases in Apache Stanbol).
>
> Even targeting at a different CLEREZZA-467 maybe provides a possible
> solution for that as it suggests to use named graphs instead of isolated
> TDB instances for creating MGraphs.
>
> To be honest this would be the optimal solution for our usages of Clerezza
> in Stanbol. However I assume that for a semantic CMS it is saver to use
> different TDB datasets.
>
> Because of that I  would like to make the following proposal that
> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>
> 1. AbstractTdbTcProvider: providing most of the functionality needed to
> store Clerezza MGraphs in Jena TDB
>
> 2. TdbTcProvider: The same as now but now extending the abstract one. I
> follows the currently used methodology to map Clerezza graphs to separate
> TDB datasets
>
> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
> MGraphs in a single TDB dataset. This provider should also support
> "configurationFactory=true" (multiple instances). each instance would use a
> different TDB dataset to store its MGrpahs.
>
> By default the SingleDatasetTdbTcProvider would be inactive, because it
> requires a configuration of the directory for the  TDB dataset as well as a
> name (that can be used in Filters). This ensures full backward
> compatibility.
>
> In environment - such as Stanbol - where you want to store multiple graphs
> in the same TDB dataset you would need to provide a configuration for the
> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>
> * if you just need a single TDB dataset that stores all MGraphs, than you
> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
> and normally use the TcManager to create your graphs.
> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
> SingleDatasetTdbTcProvider's you will need to use according filters.
>
>
> WDYT
> Rupert
>
>
> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>
> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>
> > Hi David, all
> >
> > this could be the explanation for the failed build on the Jenkins server
> when the SEO configuration for the Refactor engine was used in the default
> configuration of the Full launcher
> >
> > see http://markmail.org/message/sprwklaobdjankig for details.
> >
> > For me that looks like as if the RefactorEngine does create multiple
> Jena TDB instances for various created MGraphs. One needs to know the even
> for an empty graph Jena TDB creates ~200MByte of index files. So it is
> important to map multiple MGraphs to different named graphs of the same
> Jena TDB store.
> >
> > I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
> but I hope this can help in tracing this down.
> >
> > best
> > Rupert
> >
> > On 16.03.2012, at 10:30, David Riccitelli wrote:
> >
> >> Dears,
> >>
> >> As I ran into disk issues, I found that this folder:
> >> sling/felix/bundleXXX/data/tdb-data/mgraph
> >>
> >> where XX is the bundle of:
> >> Clerezza - SCB Jena TDB Storage Provider
> >> org.apache.clerezza.rdf.jena.tdb.storage
> >>
> >> took almost 70 gbytes of disk space (then the disk space has been
> >> exhausted).
> >>
> >> These are some of the files I found inside:
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology602
> >>
> >>
> >> Any clues?
> >>
> >> Thanks,
> >> David Riccitelli
> >>
> >>
> ********************************************************************************
> >> InsideOut10 s.r.l.
> >> P.IVA: IT-11381771002
> >> Fax: +39 0110708239
> >> ---
> >> LinkedIn: http://it.linkedin.com/in/riccitelli
> >> Twitter: ziodave
> >> ---
> >> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
> >>
> ********************************************************************************
> >
>
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Hasan Hasan <ha...@trialox.org>.

Hi all,

I generally agree to extend Clerezza to be able to support multiple
requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
Although I am bit unhappy, due to the fact, that application developers
have to be aware of this.
Note that, new clerezza instances (at least my own build) do not anymore
generate 200 MB of index files for empty graphs, but merely 200K.

Regards
Hasan


On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi David, stanbol & clerezza community
>
> Short summary of the situation:
>
> The Ontonet component generate a lot of MGraphs using the Jena TDB
> provider. This causes the disc consumption and number of open files to
> explode. See the quoted emails for details
>
>
> @Stanbol  we are already discussion how to avoid the creation of such many
> graphs
>
>
> @Clerezza the observed behavior of the TDB provider is also very dangerous
> (at least for typical use cases in Apache Stanbol).
>
> Even targeting at a different CLEREZZA-467 maybe provides a possible
> solution for that as it suggests to use named graphs instead of isolated
> TDB instances for creating MGraphs.
>
> To be honest this would be the optimal solution for our usages of Clerezza
> in Stanbol. However I assume that for a semantic CMS it is saver to use
> different TDB datasets.
>
> Because of that I  would like to make the following proposal that
> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>
> 1. AbstractTdbTcProvider: providing most of the functionality needed to
> store Clerezza MGraphs in Jena TDB
>
> 2. TdbTcProvider: The same as now but now extending the abstract one. I
> follows the currently used methodology to map Clerezza graphs to separate
> TDB datasets
>
> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
> MGraphs in a single TDB dataset. This provider should also support
> "configurationFactory=true" (multiple instances). each instance would use a
> different TDB dataset to store its MGrpahs.
>
> By default the SingleDatasetTdbTcProvider would be inactive, because it
> requires a configuration of the directory for the  TDB dataset as well as a
> name (that can be used in Filters). This ensures full backward
> compatibility.
>
> In environment - such as Stanbol - where you want to store multiple graphs
> in the same TDB dataset you would need to provide a configuration for the
> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>
> * if you just need a single TDB dataset that stores all MGraphs, than you
> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
> and normally use the TcManager to create your graphs.
> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
> SingleDatasetTdbTcProvider's you will need to use according filters.
>
>
> WDYT
> Rupert
>
>
> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>
> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>
> > Hi David, all
> >
> > this could be the explanation for the failed build on the Jenkins server
> when the SEO configuration for the Refactor engine was used in the default
> configuration of the Full launcher
> >
> > see http://markmail.org/message/sprwklaobdjankig for details.
> >
> > For me that looks like as if the RefactorEngine does create multiple
> Jena TDB instances for various created MGraphs. One needs to know the even
> for an empty graph Jena TDB creates ~200MByte of index files. So it is
> important to map multiple MGraphs to different named graphs of the same
> Jena TDB store.
> >
> > I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
> but I hope this can help in tracing this down.
> >
> > best
> > Rupert
> >
> > On 16.03.2012, at 10:30, David Riccitelli wrote:
> >
> >> Dears,
> >>
> >> As I ran into disk issues, I found that this folder:
> >> sling/felix/bundleXXX/data/tdb-data/mgraph
> >>
> >> where XX is the bundle of:
> >> Clerezza - SCB Jena TDB Storage Provider
> >> org.apache.clerezza.rdf.jena.tdb.storage
> >>
> >> took almost 70 gbytes of disk space (then the disk space has been
> >> exhausted).
> >>
> >> These are some of the files I found inside:
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology602
> >>
> >>
> >> Any clues?
> >>
> >> Thanks,
> >> David Riccitelli
> >>
> >>
> ********************************************************************************
> >> InsideOut10 s.r.l.
> >> P.IVA: IT-11381771002
> >> Fax: +39 0110708239
> >> ---
> >> LinkedIn: http://it.linkedin.com/in/riccitelli
> >> Twitter: ziodave
> >> ---
> >> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
> >>
> ********************************************************************************
> >
>
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi David, stanbol & clerezza community

Short summary of the situation:

The Ontonet component generate a lot of MGraphs using the Jena TDB provider. This causes the disc consumption and number of open files to explode. See the quoted emails for details

@Stanbol  we are already discussion how to avoid the creation of such many graphs

@Clerezza the observed behavior of the TDB provider is also very dangerous (at least for typical use cases in Apache Stanbol).

Even targeting at a different CLEREZZA-467 maybe provides a possible solution for that as it suggests to use named graphs instead of isolated TDB instances for creating MGraphs.

To be honest this would be the optimal solution for our usages of Clerezza in Stanbol. However I assume that for a semantic CMS it is saver to use different TDB datasets.

Because of that I  would like to make the following proposal that hopefully covers both the needs of Apache Stanbol and Apache Clerezza.

1. AbstractTdbTcProvider: providing most of the functionality needed to store Clerezza MGraphs in Jena TDB

2. TdbTcProvider: The same as now but now extending the abstract one. I follows the currently used methodology to map Clerezza graphs to separate TDB datasets

3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all MGraphs in a single TDB dataset. This provider should also support "configurationFactory=true" (multiple instances). each instance would use a different TDB dataset to store its MGrpahs.

By default the SingleDatasetTdbTcProvider would be inactive, because it requires a configuration of the directory for the  TDB dataset as well as a name (that can be used in Filters). This ensures full backward compatibility.

In environment - such as Stanbol - where you want to store multiple graphs in the same TDB dataset you would need to provide a configuration for the SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:

* if you just need a single TDB dataset that stores all MGraphs, than you can assign a high enough service.ranking to the SingleDatasetTdbTcProvider and normally use the TcManager to create your graphs.
* if you want to use single TDB datasets or a mix of the TdbTcProvider and SingleDatasetTdbTcProvider's you will need to use according filters.

WDYT
Rupert

[1] https://issues.apache.org/jira/browse/CLEREZZA-467

On 16.03.2012, at 10:44, Rupert Westenthaler wrote:

> Hi David, all
> 
> this could be the explanation for the failed build on the Jenkins server when the SEO configuration for the Refactor engine was used in the default configuration of the Full launcher
> 
> see http://markmail.org/message/sprwklaobdjankig for details.
> 
> For me that looks like as if the RefactorEngine does create multiple Jena TDB instances for various created MGraphs. One needs to know the even for an empty graph Jena TDB creates ~200MByte of index files. So it is important to map multiple MGraphs to different named graphs of the same Jena TDB store.
> 
> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs, but I hope this can help in tracing this down.
> 
> best
> Rupert 
> 
> On 16.03.2012, at 10:30, David Riccitelli wrote:
> 
>> Dears,
>> 
>> As I ran into disk issues, I found that this folder:
>> sling/felix/bundleXXX/data/tdb-data/mgraph
>> 
>> where XX is the bundle of:
>> Clerezza - SCB Jena TDB Storage Provider
>> org.apache.clerezza.rdf.jena.tdb.storage
>> 
>> took almost 70 gbytes of disk space (then the disk space has been
>> exhausted).
>> 
>> These are some of the files I found inside:
>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>> 
>> 
>> Any clues?
>> 
>> Thanks,
>> David Riccitelli
>> 
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>> ********************************************************************************
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

It *could* depend on the way the refactor engine is creating 
OntologyInputSource objects. Rather than using some of its readily 
available implementations, it's re-implementing the whole interface 
every time.

Alessandro


On 3/16/12 10:55 AM, David Riccitelli wrote:
> After one analysis job the following resources were used:
>
>     - disk space in sling/felix/bundle85/data/tdb-data/mgraph raised from
>     1.8M to 5.5M.
>     - open files by java process raised from 734 to 1.681.
>
> BR
> David
>
> On Fri, Mar 16, 2012 at 11:53 AM, David Riccitelli<da...@insideout.io>wrote:
>
>> Hoping these more details can help solve this issue: in order to restart
>> Stanbol I have to clear up the sling/felix/bundle85/data/tdb-data/mgraph/
>> folder otherwise these files get loaded at start-up and the server breaks
>> with 'Too many open files'.
>>
>> BR
>> David
>>
>>
>> On Fri, Mar 16, 2012 at 11:49 AM, David Riccitelli<da...@insideout.io>wrote:
>>
>>> I'll try to add more details on what is happening now: all the analysis
>>> jobs are failing because of this error: java.io.IOException: Too many open
>>> files.
>>>
>>> I found more than 1.300 ontonet files open by the *java* process:
>>> $ lsof  | grep "mgraph/ontonet" | wc -l
>>> 1323
>>>
>>> e.g.
>>>
>>> sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.idn
>>>
>>> sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.dat
>>>
>>> It's important to note that I am the only user on this stanbol instance
>>> and the error is raised at the second analysis.
>>>
>>> I think I can easily help you reproduce this issue in case.
>>>
>>> BR
>>> David
>>>
>>> On Fri, Mar 16, 2012 at 11:44 AM, Rupert Westenthaler<
>>> rupert.westenthaler@gmail.com>  wrote:
>>>
>>>> Hi David, all
>>>>
>>>> this could be the explanation for the failed build on the Jenkins server
>>>> when the SEO configuration for the Refactor engine was used in the default
>>>> configuration of the Full launcher
>>>>
>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>>
>>>> For me that looks like as if the RefactorEngine does create multiple
>>>> Jena TDB instances for various created MGraphs. One needs to know the even
>>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>>> important to map multiple MGraphs to different named graphs of the same
>>>> Jena TDB store.
>>>>
>>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>>> but I hope this can help in tracing this down.
>>>>
>>>> best
>>>> Rupert
>>>>
>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>>
>>>>> Dears,
>>>>>
>>>>> As I ran into disk issues, I found that this folder:
>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>>
>>>>> where XX is the bundle of:
>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>>
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>>
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>>
>>>>>
>>>>> Any clues?
>>>>>
>>>>> Thanks,
>>>>> David Riccitelli
>>>>>
>>>>>
>>>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<
>>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>>>
>>>> ********************************************************************************
>>>>
>>>>
>>>
>>> --
>>> David Riccitelli
>>>
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>> ********************************************************************************
>>>
>>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

After one analysis job the following resources were used:

   - disk space in sling/felix/bundle85/data/tdb-data/mgraph raised from
   1.8M to 5.5M.
   - open files by java process raised from 734 to 1.681.

BR
David

On Fri, Mar 16, 2012 at 11:53 AM, David Riccitelli <da...@insideout.io>wrote:

> Hoping these more details can help solve this issue: in order to restart
> Stanbol I have to clear up the sling/felix/bundle85/data/tdb-data/mgraph/
> folder otherwise these files get loaded at start-up and the server breaks
> with 'Too many open files'.
>
> BR
> David
>
>
> On Fri, Mar 16, 2012 at 11:49 AM, David Riccitelli <da...@insideout.io>wrote:
>
>> I'll try to add more details on what is happening now: all the analysis
>> jobs are failing because of this error: java.io.IOException: Too many open
>> files.
>>
>> I found more than 1.300 ontonet files open by the *java* process:
>> $ lsof  | grep "mgraph/ontonet" | wc -l
>> 1323
>>
>> e.g.
>>
>> sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.idn
>>
>> sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.dat
>>
>> It's important to note that I am the only user on this stanbol instance
>> and the error is raised at the second analysis.
>>
>> I think I can easily help you reproduce this issue in case.
>>
>> BR
>> David
>>
>> On Fri, Mar 16, 2012 at 11:44 AM, Rupert Westenthaler <
>> rupert.westenthaler@gmail.com> wrote:
>>
>>> Hi David, all
>>>
>>> this could be the explanation for the failed build on the Jenkins server
>>> when the SEO configuration for the Refactor engine was used in the default
>>> configuration of the Full launcher
>>>
>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>
>>> For me that looks like as if the RefactorEngine does create multiple
>>> Jena TDB instances for various created MGraphs. One needs to know the even
>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>
>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>> but I hope this can help in tracing this down.
>>>
>>> best
>>> Rupert
>>>
>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>
>>> > Dears,
>>> >
>>> > As I ran into disk issues, I found that this folder:
>>> > sling/felix/bundleXXX/data/tdb-data/mgraph
>>> >
>>> > where XX is the bundle of:
>>> > Clerezza - SCB Jena TDB Storage Provider
>>> > org.apache.clerezza.rdf.jena.tdb.storage
>>> >
>>> > took almost 70 gbytes of disk space (then the disk space has been
>>> > exhausted).
>>> >
>>> > These are some of the files I found inside:
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>> > 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>> >
>>> >
>>> > Any clues?
>>> >
>>> > Thanks,
>>> > David Riccitelli
>>> >
>>> >
>>> ********************************************************************************
>>> > InsideOut10 s.r.l.
>>> > P.IVA: IT-11381771002
>>> > Fax: +39 0110708239
>>> > ---
>>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>>> > Twitter: ziodave
>>> > ---
>>> > Layar Partner Network<
>>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>> >
>>> >
>>> ********************************************************************************
>>>
>>>
>>
>>
>> --
>> David Riccitelli
>>
>>
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

Hoping these more details can help solve this issue: in order to restart
Stanbol I have to clear up the sling/felix/bundle85/data/tdb-data/mgraph/
folder otherwise these files get loaded at start-up and the server breaks
with 'Too many open files'.

BR
David

On Fri, Mar 16, 2012 at 11:49 AM, David Riccitelli <da...@insideout.io>wrote:

> I'll try to add more details on what is happening now: all the analysis
> jobs are failing because of this error: java.io.IOException: Too many open
> files.
>
> I found more than 1.300 ontonet files open by the *java* process:
> $ lsof  | grep "mgraph/ontonet" | wc -l
> 1323
>
> e.g.
>
> sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.idn
>
> sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.dat
>
> It's important to note that I am the only user on this stanbol instance
> and the error is raised at the second analysis.
>
> I think I can easily help you reproduce this issue in case.
>
> BR
> David
>
> On Fri, Mar 16, 2012 at 11:44 AM, Rupert Westenthaler <
> rupert.westenthaler@gmail.com> wrote:
>
>> Hi David, all
>>
>> this could be the explanation for the failed build on the Jenkins server
>> when the SEO configuration for the Refactor engine was used in the default
>> configuration of the Full launcher
>>
>> see http://markmail.org/message/sprwklaobdjankig for details.
>>
>> For me that looks like as if the RefactorEngine does create multiple Jena
>> TDB instances for various created MGraphs. One needs to know the even for
>> an empty graph Jena TDB creates ~200MByte of index files. So it is
>> important to map multiple MGraphs to different named graphs of the same
>> Jena TDB store.
>>
>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>> but I hope this can help in tracing this down.
>>
>> best
>> Rupert
>>
>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>
>> > Dears,
>> >
>> > As I ran into disk issues, I found that this folder:
>> > sling/felix/bundleXXX/data/tdb-data/mgraph
>> >
>> > where XX is the bundle of:
>> > Clerezza - SCB Jena TDB Storage Provider
>> > org.apache.clerezza.rdf.jena.tdb.storage
>> >
>> > took almost 70 gbytes of disk space (then the disk space has been
>> > exhausted).
>> >
>> > These are some of the files I found inside:
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology889
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology395
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology363
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology661
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology786
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology608
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology213
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology188
>> > 193M ./ontonet%3A%3Ainputstream%3Aontology602
>> >
>> >
>> > Any clues?
>> >
>> > Thanks,
>> > David Riccitelli
>> >
>> >
>> ********************************************************************************
>> > InsideOut10 s.r.l.
>> > P.IVA: IT-11381771002
>> > Fax: +39 0110708239
>> > ---
>> > LinkedIn: http://it.linkedin.com/in/riccitelli
>> > Twitter: ziodave
>> > ---
>> > Layar Partner Network<
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>> >
>> >
>> ********************************************************************************
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

Thanks Alessandro,

Find hereby:

- the text you are submitting for enhancement

text.txt attached.

- the recipe(s) you are using

seo_rules.sem attached.

- your enhancement chain configuration


   - tika ( required , TikaEngine)
   - langid ( required , LangIdEnhancementEngine)
   - ner ( required , NamedEntityExtractionEnhancementEngine)
   - dbpediaLinking ( required , NamedEntityTaggingEngine)
   - entityhubExtraction ( required , KeywordLinkingEngine)
   - seo_refactoring ( required , RefactorEnhancementEngine)

The seo_refactoring configuration is attached
(refactor-engine-configuration.png).

- anything else?

We're using a local DBpedia index (in sling/datafiles):
http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.6-insideOut10/dbpedia.solrindex.zip

Thanks,
David

On Fri, Mar 16, 2012 at 12:58 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:

> Hi David,
>
>
> On 3/16/12 10:49 AM, David Riccitelli wrote:
>
>> It's important to note that I am the only user on this stanbol instance
>> and
>> the error is raised at the second analysis.
>>
>> I think I can easily help you reproduce this issue in case.
>>
>
> Great, so I guess we would need:
>
> - the text you are submitting for enhancement
> - the recipe(s) you are using
> - your enhancement chain configuration
> - anything else?
>
> I am not the main Rules/Refactor Engine head, but perhaps I can help the
> engine create fewer persistent graphs.
>
> Best,
>
> Alessandro
>
>
>  On Fri, Mar 16, 2012 at 11:44 AM, Rupert Westenthaler<
>> rupert.westenthaler@gmail.com>  wrote:
>>
>>  Hi David, all
>>>
>>> this could be the explanation for the failed build on the Jenkins server
>>> when the SEO configuration for the Refactor engine was used in the
>>> default
>>> configuration of the Full launcher
>>>
>>> see http://markmail.org/message/**sprwklaobdjankig<http://markmail.org/message/sprwklaobdjankig>for details.
>>>
>>> For me that looks like as if the RefactorEngine does create multiple Jena
>>> TDB instances for various created MGraphs. One needs to know the even for
>>> an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>
>>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>>> but I hope this can help in tracing this down.
>>>
>>> best
>>> Rupert
>>>
>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>
>>>  Dears,
>>>>
>>>> As I ran into disk issues, I found that this folder:
>>>> sling/felix/bundleXXX/data/**tdb-data/mgraph
>>>>
>>>> where XX is the bundle of:
>>>> Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.**tdb.storage
>>>>
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>>
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology602
>>>>
>>>>
>>>> Any clues?
>>>>
>>>> Thanks,
>>>> David Riccitelli
>>>>
>>>>
>>>>  ****************************************************************
>>> ********************
>>>
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/**riccitelli<http://it.linkedin.com/in/riccitelli>
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<
>>>>
>>> http://www.layar.com/**publishing/developers/list/?**
>>> page=1&country=&city=&keyword=**insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>
>>>>
>>>>  ****************************************************************
>>> ********************
>>>
>>>
>>>
>>
>
> --
> M.Sc. Alessandro Adamou
>
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
>
>
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
>
> Not sent from my iSnobTechDevice
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi David,

On 3/16/12 10:49 AM, David Riccitelli wrote:
> It's important to note that I am the only user on this stanbol instance and
> the error is raised at the second analysis.
>
> I think I can easily help you reproduce this issue in case.

Great, so I guess we would need:

- the text you are submitting for enhancement
- the recipe(s) you are using
- your enhancement chain configuration
- anything else?

I am not the main Rules/Refactor Engine head, but perhaps I can help the 
engine create fewer persistent graphs.

Best,

Alessandro

> On Fri, Mar 16, 2012 at 11:44 AM, Rupert Westenthaler<
> rupert.westenthaler@gmail.com>  wrote:
>
>> Hi David, all
>>
>> this could be the explanation for the failed build on the Jenkins server
>> when the SEO configuration for the Refactor engine was used in the default
>> configuration of the Full launcher
>>
>> see http://markmail.org/message/sprwklaobdjankig for details.
>>
>> For me that looks like as if the RefactorEngine does create multiple Jena
>> TDB instances for various created MGraphs. One needs to know the even for
>> an empty graph Jena TDB creates ~200MByte of index files. So it is
>> important to map multiple MGraphs to different named graphs of the same
>> Jena TDB store.
>>
>> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
>> but I hope this can help in tracing this down.
>>
>> best
>> Rupert
>>
>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>
>>> Dears,
>>>
>>> As I ran into disk issues, I found that this folder:
>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>
>>> where XX is the bundle of:
>>> Clerezza - SCB Jena TDB Storage Provider
>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>
>>> took almost 70 gbytes of disk space (then the disk space has been
>>> exhausted).
>>>
>>> These are some of the files I found inside:
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>
>>>
>>> Any clues?
>>>
>>> Thanks,
>>> David Riccitelli
>>>
>>>
>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>
>> ********************************************************************************
>>
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

I'll try to add more details on what is happening now: all the analysis
jobs are failing because of this error: java.io.IOException: Too many open
files.

I found more than 1.300 ontonet files open by the *java* process:
$ lsof  | grep "mgraph/ontonet" | wc -l
1323

e.g.
sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.idn
sling/felix/bundle85/data/tdb-data/mgraph/ontonet%3A%3Ainputstream%3Aontology49/OSP.dat

It's important to note that I am the only user on this stanbol instance and
the error is raised at the second analysis.

I think I can easily help you reproduce this issue in case.

BR
David

On Fri, Mar 16, 2012 at 11:44 AM, Rupert Westenthaler <
rupert.westenthaler@gmail.com> wrote:

> Hi David, all
>
> this could be the explanation for the failed build on the Jenkins server
> when the SEO configuration for the Refactor engine was used in the default
> configuration of the Full launcher
>
> see http://markmail.org/message/sprwklaobdjankig for details.
>
> For me that looks like as if the RefactorEngine does create multiple Jena
> TDB instances for various created MGraphs. One needs to know the even for
> an empty graph Jena TDB creates ~200MByte of index files. So it is
> important to map multiple MGraphs to different named graphs of the same
> Jena TDB store.
>
> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
> but I hope this can help in tracing this down.
>
> best
> Rupert
>
> On 16.03.2012, at 10:30, David Riccitelli wrote:
>
> > Dears,
> >
> > As I ran into disk issues, I found that this folder:
> > sling/felix/bundleXXX/data/tdb-data/mgraph
> >
> > where XX is the bundle of:
> > Clerezza - SCB Jena TDB Storage Provider
> > org.apache.clerezza.rdf.jena.tdb.storage
> >
> > took almost 70 gbytes of disk space (then the disk space has been
> > exhausted).
> >
> > These are some of the files I found inside:
> > 193M ./ontonet%3A%3Ainputstream%3Aontology889
> > 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> > 193M ./ontonet%3A%3Ainputstream%3Aontology395
> > 193M ./ontonet%3A%3Ainputstream%3Aontology363
> > 193M ./ontonet%3A%3Ainputstream%3Aontology661
> > 193M ./ontonet%3A%3Ainputstream%3Aontology786
> > 193M ./ontonet%3A%3Ainputstream%3Aontology608
> > 193M ./ontonet%3A%3Ainputstream%3Aontology213
> > 193M ./ontonet%3A%3Ainputstream%3Aontology188
> > 193M ./ontonet%3A%3Ainputstream%3Aontology602
> >
> >
> > Any clues?
> >
> > Thanks,
> > David Riccitelli
> >
> >
> ********************************************************************************
> > InsideOut10 s.r.l.
> > P.IVA: IT-11381771002
> > Fax: +39 0110708239
> > ---
> > LinkedIn: http://it.linkedin.com/in/riccitelli
> > Twitter: ziodave
> > ---
> > Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
> >
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi David, stanbol & clerezza community

Short summary of the situation:

The Ontonet component generate a lot of MGraphs using the Jena TDB provider. This causes the disc consumption and number of open files to explode. See the quoted emails for details

@Stanbol  we are already discussion how to avoid the creation of such many graphs

@Clerezza the observed behavior of the TDB provider is also very dangerous (at least for typical use cases in Apache Stanbol).

Even targeting at a different CLEREZZA-467 maybe provides a possible solution for that as it suggests to use named graphs instead of isolated TDB instances for creating MGraphs.

To be honest this would be the optimal solution for our usages of Clerezza in Stanbol. However I assume that for a semantic CMS it is saver to use different TDB datasets.

Because of that I  would like to make the following proposal that hopefully covers both the needs of Apache Stanbol and Apache Clerezza.

1. AbstractTdbTcProvider: providing most of the functionality needed to store Clerezza MGraphs in Jena TDB

2. TdbTcProvider: The same as now but now extending the abstract one. I follows the currently used methodology to map Clerezza graphs to separate TDB datasets

3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all MGraphs in a single TDB dataset. This provider should also support "configurationFactory=true" (multiple instances). each instance would use a different TDB dataset to store its MGrpahs.

By default the SingleDatasetTdbTcProvider would be inactive, because it requires a configuration of the directory for the  TDB dataset as well as a name (that can be used in Filters). This ensures full backward compatibility.

In environment - such as Stanbol - where you want to store multiple graphs in the same TDB dataset you would need to provide a configuration for the SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:

* if you just need a single TDB dataset that stores all MGraphs, than you can assign a high enough service.ranking to the SingleDatasetTdbTcProvider and normally use the TcManager to create your graphs.
* if you want to use single TDB datasets or a mix of the TdbTcProvider and SingleDatasetTdbTcProvider's you will need to use according filters.

WDYT
Rupert

[1] https://issues.apache.org/jira/browse/CLEREZZA-467

On 16.03.2012, at 10:44, Rupert Westenthaler wrote:

> Hi David, all
> 
> this could be the explanation for the failed build on the Jenkins server when the SEO configuration for the Refactor engine was used in the default configuration of the Full launcher
> 
> see http://markmail.org/message/sprwklaobdjankig for details.
> 
> For me that looks like as if the RefactorEngine does create multiple Jena TDB instances for various created MGraphs. One needs to know the even for an empty graph Jena TDB creates ~200MByte of index files. So it is important to map multiple MGraphs to different named graphs of the same Jena TDB store.
> 
> I have no Idea how Clerezza manages this or how Ontonet creates MGraphs, but I hope this can help in tracing this down.
> 
> best
> Rupert 
> 
> On 16.03.2012, at 10:30, David Riccitelli wrote:
> 
>> Dears,
>> 
>> As I ran into disk issues, I found that this folder:
>> sling/felix/bundleXXX/data/tdb-data/mgraph
>> 
>> where XX is the bundle of:
>> Clerezza - SCB Jena TDB Storage Provider
>> org.apache.clerezza.rdf.jena.tdb.storage
>> 
>> took almost 70 gbytes of disk space (then the disk space has been
>> exhausted).
>> 
>> These are some of the files I found inside:
>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>> 
>> 
>> Any clues?
>> 
>> Thanks,
>> David Riccitelli
>> 
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>> ********************************************************************************
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi David, all

this could be the explanation for the failed build on the Jenkins server when the SEO configuration for the Refactor engine was used in the default configuration of the Full launcher

see http://markmail.org/message/sprwklaobdjankig for details.

For me that looks like as if the RefactorEngine does create multiple Jena TDB instances for various created MGraphs. One needs to know the even for an empty graph Jena TDB creates ~200MByte of index files. So it is important to map multiple MGraphs to different named graphs of the same Jena TDB store.

I have no Idea how Clerezza manages this or how Ontonet creates MGraphs, but I hope this can help in tracing this down.

best
Rupert 

On 16.03.2012, at 10:30, David Riccitelli wrote:

> Dears,
> 
> As I ran into disk issues, I found that this folder:
> sling/felix/bundleXXX/data/tdb-data/mgraph
> 
> where XX is the bundle of:
> Clerezza - SCB Jena TDB Storage Provider
> org.apache.clerezza.rdf.jena.tdb.storage
> 
> took almost 70 gbytes of disk space (then the disk space has been
> exhausted).
> 
> These are some of the files I found inside:
> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> 193M ./ontonet%3A%3Ainputstream%3Aontology602
> 
> 
> Any clues?
> 
> Thanks,
> David Riccitelli
> 
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
> ********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi Rupert,

On 3/16/12 11:28 AM, Rupert Westenthaler wrote:
> I see here two solutions:
>
> (1) use a in-memory graph implementation (SimpleMGraph or IndexedMGraph): This would be the preferred way in case such graphs are guaranteed to be reasonable in size
>
> (2) create a single graph and reuse it for all loaded Ontologies: This would be the preferred way if such graphs are just a temporary cache but there could be situations where they can get to big for holding them in memory

Again it's a matter of what should stay and what should leave, but 
that's the Refactor engine's call (if it's that engine doing it).

The graphs used by Stanbol Rules should be about the same order of 
magnitude as the enhancement graph, so I guess (1) could be ok for those.

If the recipes belong to the engine configuration, they should stay as 
(2), and so should any metamodel. but I would recommend at least one 
graph for each engine configuration.

Best,
Alessandro


>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>> Dears,
>>>
>>> As I ran into disk issues, I found that this folder:
>>>   sling/felix/bundleXXX/data/tdb-data/mgraph
>>>
>>> where XX is the bundle of:
>>>   Clerezza - SCB Jena TDB Storage Provider
>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>
>>> took almost 70 gbytes of disk space (then the disk space has been
>>> exhausted).
>>>
>>> These are some of the files I found inside:
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>
>>>
>>> Any clues?
>>>
>>> Thanks,
>>> David Riccitelli
>>>
>>> ********************************************************************************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>> ********************************************************************************
>>>
>>
>> -- 
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Alessandro

On 16.03.2012, at 10:56, Alessandro Adamou wrote:

> Hi David,
> 
> well, I guess that depends pretty much on how heavy the usage of OntoNet is in your Stanbol installation.
> 
> Those are graphs created when OntoNet has to load an ontology from its content rather than from a Web URI, so it cannot know the ontology ID earlier.
> 
> This happens e.g. by POSTing the ontology as the payload or by passing a GraphContentInputSource to the Java API.
> 
> Now I do not know why these graphs are created (perhaps the refactor engine could be loading some), but I do know that a Clerezza graph in Jena TDB occupies a LOT of disk space.
> 
> Suffice it to say that my bundled had stored nine graphs of <100 triples each. Their disk space was about 1.8 GB, but when I tried to make a zipfile out of it, it came out as about 2MB!
> 

I see here two solutions:

(1) use a in-memory graph implementation (SimpleMGraph or IndexedMGraph): This would be the preferred way in case such graphs are guaranteed to be reasonable in size

(2) create a single graph and reuse it for all loaded Ontologies: This would be the preferred way if such graphs are just a temporary cache but there could be situations where they can get to big for holding them in memory


WDYT
best
Rupert

> Alessandro
> 
> 
> On 3/16/12 10:30 AM, David Riccitelli wrote:
>> Dears,
>> 
>> As I ran into disk issues, I found that this folder:
>>  sling/felix/bundleXXX/data/tdb-data/mgraph
>> 
>> where XX is the bundle of:
>>  Clerezza - SCB Jena TDB Storage Provider
>> org.apache.clerezza.rdf.jena.tdb.storage
>> 
>> took almost 70 gbytes of disk space (then the disk space has been
>> exhausted).
>> 
>> These are some of the files I found inside:
>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>> 
>> 
>> Any clues?
>> 
>> Thanks,
>> David Riccitelli
>> 
>> ********************************************************************************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>> ********************************************************************************
>> 
> 
> 
> -- 
> M.Sc. Alessandro Adamou
> 
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
> 
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
> 
> 
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
> 
> Not sent from my iSnobTechDevice
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

I would expect the refactor engine to clean up any graphs resulting from 
an enhancement job, if any, and only preserve its non-volatile data 
(e.g. the recipes and the metamodel). Let me check what it does exactly.

Alessandro


On 3/16/12 11:02 AM, David Riccitelli wrote:
> Hi Alessandro,
>
> We're using the REST engines end-point to post analysis jobs. Should the
> ontonet files be deleted after an analysis?
>
> David
>
> On Fri, Mar 16, 2012 at 11:56 AM, Alessandro Adamou<ad...@cs.unibo.it>wrote:
>
>> Hi David,
>>
>> well, I guess that depends pretty much on how heavy the usage of OntoNet
>> is in your Stanbol installation.
>>
>> Those are graphs created when OntoNet has to load an ontology from its
>> content rather than from a Web URI, so it cannot know the ontology ID
>> earlier.
>>
>> This happens e.g. by POSTing the ontology as the payload or by passing a
>> GraphContentInputSource to the Java API.
>>
>> Now I do not know why these graphs are created (perhaps the refactor
>> engine could be loading some), but I do know that a Clerezza graph in Jena
>> TDB occupies a LOT of disk space.
>>
>> Suffice it to say that my bundled had stored nine graphs of<100 triples
>> each. Their disk space was about 1.8 GB, but when I tried to make a zipfile
>> out of it, it came out as about 2MB!
>>
>> Alessandro
>>
>>
>>
>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>
>>> Dears,
>>>
>>> As I ran into disk issues, I found that this folder:
>>>   sling/felix/bundleXXX/data/**tdb-data/mgraph
>>>
>>> where XX is the bundle of:
>>>   Clerezza - SCB Jena TDB Storage Provider
>>> org.apache.clerezza.rdf.jena.**tdb.storage
>>>
>>> took almost 70 gbytes of disk space (then the disk space has been
>>> exhausted).
>>>
>>> These are some of the files I found inside:
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology889
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology1041
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology395
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology363
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology661
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology786
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology608
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology213
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology188
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology602
>>>
>>>
>>> Any clues?
>>>
>>> Thanks,
>>> David Riccitelli
>>>
>>> ****************************************************************
>>> ********************
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/**riccitelli<http://it.linkedin.com/in/riccitelli>
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/**publishing/developers/list/?
>>> **page=1&country=&city=&keyword=**insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>> ****************************************************************
>>> ********************
>>>
>>>
>> --
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

Hi Alessandro,

We're using the REST engines end-point to post analysis jobs. Should the
ontonet files be deleted after an analysis?

David

On Fri, Mar 16, 2012 at 11:56 AM, Alessandro Adamou <ad...@cs.unibo.it>wrote:

> Hi David,
>
> well, I guess that depends pretty much on how heavy the usage of OntoNet
> is in your Stanbol installation.
>
> Those are graphs created when OntoNet has to load an ontology from its
> content rather than from a Web URI, so it cannot know the ontology ID
> earlier.
>
> This happens e.g. by POSTing the ontology as the payload or by passing a
> GraphContentInputSource to the Java API.
>
> Now I do not know why these graphs are created (perhaps the refactor
> engine could be loading some), but I do know that a Clerezza graph in Jena
> TDB occupies a LOT of disk space.
>
> Suffice it to say that my bundled had stored nine graphs of <100 triples
> each. Their disk space was about 1.8 GB, but when I tried to make a zipfile
> out of it, it came out as about 2MB!
>
> Alessandro
>
>
>
> On 3/16/12 10:30 AM, David Riccitelli wrote:
>
>> Dears,
>>
>> As I ran into disk issues, I found that this folder:
>>  sling/felix/bundleXXX/data/**tdb-data/mgraph
>>
>> where XX is the bundle of:
>>  Clerezza - SCB Jena TDB Storage Provider
>> org.apache.clerezza.rdf.jena.**tdb.storage
>>
>> took almost 70 gbytes of disk space (then the disk space has been
>> exhausted).
>>
>> These are some of the files I found inside:
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology889
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology1041
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology395
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology363
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology661
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology786
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology608
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology213
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology188
>> 193M ./ontonet%3A%3Ainputstream%**3Aontology602
>>
>>
>> Any clues?
>>
>> Thanks,
>> David Riccitelli
>>
>> ****************************************************************
>> ********************
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/**riccitelli<http://it.linkedin.com/in/riccitelli>
>> Twitter: ziodave
>> ---
>> Layar Partner Network<http://www.layar.com/**publishing/developers/list/?
>> **page=1&country=&city=&keyword=**insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>> >
>> ****************************************************************
>> ********************
>>
>>
>
> --
> M.Sc. Alessandro Adamou
>
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
>
>
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
>
> Not sent from my iSnobTechDevice
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

The customized ruleset is working as well... I'll keep it running and see
that it is stable.

I experienced another issue, which is unrelated so I'll open a different
thread.

Thanks for your help!

David

On Sat, Mar 17, 2012 at 10:01 AM, David Riccitelli <da...@insideout.io>wrote:

> Hi Alessandro,
>
> It's much better now. Disk usage
> in sling/felix/bundle85/data/tdb-data/mgraph folder is steady at 1.4M.
>
> Open files were at ~700 a start-up, they increased up to ~1.600 after two
> tests. Now after each test they jump at ~2.100 and then decrease back to
> ~1.600.
>
> And it's also much faster than before.
>
> I'll continue testing now with our customized ruleset.
>
> BR
> David
>
>
> On Fri, Mar 16, 2012 at 7:38 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:
>
>> Hi David,
>>
>> after quite some work today I rewrote part of the Refactor Engine to
>> avoid creating useless graphs.
>>
>> Many were blank ontologies created along with the SEO scope. They are no
>> longer created.
>>
>> Many of the other graphs that you see are due to the fact that the engine
>> merges together the entity signatures into an OntoNet session. Every such
>> signature ends up resulting in its own ontology and therefore a graph in
>> Clerezza/TDB.
>>
>> I have not modified this second behaviour, but I have seen to it that the
>> refactor engine now destroys its own session *and its contents* when
>> computeEnhancements() completes. This means a lot of space occupied during
>> analysis but freed up right thereafter.
>>
>> It's more brutal than I wanted it to be, but a better implementation will
>> come up once I add a couple new features to OntoNet that should make the
>> process more reasonable.
>>
>> On the upside, the engine code is now smaller by some 250 lines.
>>
>> It would be super if you could update and try it out.
>>
>> Thanks
>>
>> Alessandro
>>
>> P.S. now I'm glad I added the "ontonet" prefix to those graph names...
>>
>>
>>
>> On 3/16/12 12:40 PM, David Riccitelli wrote:
>>
>>>  From what I've seen so far, yes. But it could depend on your engine
>>>> configuration using a richer set of rules.
>>>>
>>>
>>> Same thing happens when we use the default rules set (seo_rules.sem) from
>>> SVN.
>>>
>>> We did not customize any other part of the installation with the
>>> exception
>>> of loading a local DBpedia index in sling/datafiles.
>>>
>>> David
>>>
>>> On Fri, Mar 16, 2012 at 12:27 PM, Alessandro Adamou<ad...@cs.unibo.it>*
>>> *wrote:
>>>
>>>  On 3/16/12 11:16 AM, David Riccitelli wrote:
>>>>
>>>>  Is this issue happening to us only?
>>>>>
>>>>>   From what I've seen so far, yes. But it could depend on your engine
>>>> configuration using a richer set of rules.
>>>>
>>>> Alessandro
>>>>
>>>>  On Fri, Mar 16, 2012 at 12:12 PM, Alessandro Adamou<adamou@cs.unibo.it
>>>> >**
>>>>
>>>>> wrote:
>>>>>
>>>>>  One thing that it would be great to do is to detect the ontology ID
>>>>>
>>>>>> *before* creating the TripleCollection in Clerezza, so any mappings
>>>>>> could
>>>>>> be done before storing.
>>>>>>
>>>>>> But I don't know how this can be done with not so much code.
>>>>>>
>>>>>> Perhaps creating an IndexedGraph, exploring its content, then creating
>>>>>> the
>>>>>> Graph in the TcManager with the same content and the right graph name,
>>>>>> then
>>>>>> finally clearing the IndexedGraph could work.
>>>>>>
>>>>>> But it still means having twice the resource usage (disk+memory) for a
>>>>>> period.
>>>>>>
>>>>>> Alessandro
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>>>>>>
>>>>>>  Hi David,
>>>>>>
>>>>>>> well, I guess that depends pretty much on how heavy the usage of
>>>>>>> OntoNet
>>>>>>> is in your Stanbol installation.
>>>>>>>
>>>>>>> Those are graphs created when OntoNet has to load an ontology from
>>>>>>> its
>>>>>>> content rather than from a Web URI, so it cannot know the ontology ID
>>>>>>> earlier.
>>>>>>>
>>>>>>> This happens e.g. by POSTing the ontology as the payload or by
>>>>>>> passing a
>>>>>>> GraphContentInputSource to the Java API.
>>>>>>>
>>>>>>> Now I do not know why these graphs are created (perhaps the refactor
>>>>>>> engine could be loading some), but I do know that a Clerezza graph in
>>>>>>> Jena
>>>>>>> TDB occupies a LOT of disk space.
>>>>>>>
>>>>>>> Suffice it to say that my bundled had stored nine graphs of<100
>>>>>>> triples
>>>>>>> each. Their disk space was about 1.8 GB, but when I tried to make a
>>>>>>> zipfile
>>>>>>> out of it, it came out as about 2MB!
>>>>>>>
>>>>>>> Alessandro
>>>>>>>
>>>>>>>
>>>>>>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>>>>>>
>>>>>>>  Dears,
>>>>>>>
>>>>>>>> As I ran into disk issues, I found that this folder:
>>>>>>>>  sling/felix/bundleXXX/data/******tdb-data/mgraph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> where XX is the bundle of:
>>>>>>>>  Clerezza - SCB Jena TDB Storage Provider
>>>>>>>> org.apache.clerezza.rdf.jena.******tdb.storage
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>>>>> exhausted).
>>>>>>>>
>>>>>>>> These are some of the files I found inside:
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology889
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology1041
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology395
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology363
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology661
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology786
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology608
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology213
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology188
>>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology602
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Any clues?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David Riccitelli
>>>>>>>>
>>>>>>>> ****************************************************************
>>>>>>>> ****
>>>>>>>> ************************
>>>>>>>>
>>>>>>>>
>>>>>>>> InsideOut10 s.r.l.
>>>>>>>> P.IVA: IT-11381771002
>>>>>>>> Fax: +39 0110708239
>>>>>>>> ---
>>>>>>>> LinkedIn: http://it.linkedin.com/in/******riccitelli<http://it.linkedin.com/in/****riccitelli>
>>>>>>>> <http://it.linkedin.**com/in/**riccitelli<http://it.linkedin.com/in/**riccitelli>
>>>>>>>> >
>>>>>>>> <http://it.linkedin.**com/in/**riccitelli<http://it.linkedin.**
>>>>>>>> com/in/riccitelli <http://it.linkedin.com/in/riccitelli>>
>>>>>>>> Twitter: ziodave
>>>>>>>> ---
>>>>>>>> Layar Partner Network<http://www.layar.com/******<http://www.layar.com/****>
>>>>>>>> <http://www.layar.com/**>
>>>>>>>> publishing/developers/list/?******page=1&country=&city=&**
>>>>>>>> keyword=****
>>>>>>>> insideout10&lpn=1<http://www.****layar.com/publishing/**
>>>>>>>> developers/list/?page=1&****country=&city=&keyword=****
>>>>>>>> insideout10&lpn=1<http://www.**layar.com/publishing/**
>>>>>>>> developers/list/?page=1&**country=&city=&keyword=**
>>>>>>>> insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>>>> >
>>>>>>>> ****************************************************************
>>>>>>>> ****
>>>>>>>> ************************
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   --
>>>>>>>
>>>>>> M.Sc. Alessandro Adamou
>>>>>>
>>>>>> Alma Mater Studiorum - Università di Bologna
>>>>>> Department of Computer Science
>>>>>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>>>>>
>>>>>> Semantic Technology Laboratory (STLab)
>>>>>> Institute for Cognitive Science and Technology (ISTC)
>>>>>> National Research Council (CNR)
>>>>>> Via Nomentana 56, 00161 Rome - Italy
>>>>>>
>>>>>>
>>>>>> "I will give you everything, so long as you do not demand anything."
>>>>>> (Ettore Petrolini, 1930)
>>>>>>
>>>>>> Not sent from my iSnobTechDevice
>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>> M.Sc. Alessandro Adamou
>>>>
>>>> Alma Mater Studiorum - Università di Bologna
>>>> Department of Computer Science
>>>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>>>
>>>> Semantic Technology Laboratory (STLab)
>>>> Institute for Cognitive Science and Technology (ISTC)
>>>> National Research Council (CNR)
>>>> Via Nomentana 56, 00161 Rome - Italy
>>>>
>>>>
>>>> "I will give you everything, so long as you do not demand anything."
>>>> (Ettore Petrolini, 1930)
>>>>
>>>> Not sent from my iSnobTechDevice
>>>>
>>>>
>>>>
>>>
>>
>> --
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>>
>
>
> --
> David Riccitelli
>
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>
> ********************************************************************************
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

Hi Alessandro,

It's much better now. Disk usage
in sling/felix/bundle85/data/tdb-data/mgraph folder is steady at 1.4M.

Open files were at ~700 a start-up, they increased up to ~1.600 after two
tests. Now after each test they jump at ~2.100 and then decrease back to
~1.600.

And it's also much faster than before.

I'll continue testing now with our customized ruleset.

BR
David

On Fri, Mar 16, 2012 at 7:38 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:

> Hi David,
>
> after quite some work today I rewrote part of the Refactor Engine to avoid
> creating useless graphs.
>
> Many were blank ontologies created along with the SEO scope. They are no
> longer created.
>
> Many of the other graphs that you see are due to the fact that the engine
> merges together the entity signatures into an OntoNet session. Every such
> signature ends up resulting in its own ontology and therefore a graph in
> Clerezza/TDB.
>
> I have not modified this second behaviour, but I have seen to it that the
> refactor engine now destroys its own session *and its contents* when
> computeEnhancements() completes. This means a lot of space occupied during
> analysis but freed up right thereafter.
>
> It's more brutal than I wanted it to be, but a better implementation will
> come up once I add a couple new features to OntoNet that should make the
> process more reasonable.
>
> On the upside, the engine code is now smaller by some 250 lines.
>
> It would be super if you could update and try it out.
>
> Thanks
>
> Alessandro
>
> P.S. now I'm glad I added the "ontonet" prefix to those graph names...
>
>
>
> On 3/16/12 12:40 PM, David Riccitelli wrote:
>
>>  From what I've seen so far, yes. But it could depend on your engine
>>> configuration using a richer set of rules.
>>>
>>
>> Same thing happens when we use the default rules set (seo_rules.sem) from
>> SVN.
>>
>> We did not customize any other part of the installation with the exception
>> of loading a local DBpedia index in sling/datafiles.
>>
>> David
>>
>> On Fri, Mar 16, 2012 at 12:27 PM, Alessandro Adamou<ad...@cs.unibo.it>**
>> wrote:
>>
>>  On 3/16/12 11:16 AM, David Riccitelli wrote:
>>>
>>>  Is this issue happening to us only?
>>>>
>>>>   From what I've seen so far, yes. But it could depend on your engine
>>> configuration using a richer set of rules.
>>>
>>> Alessandro
>>>
>>>  On Fri, Mar 16, 2012 at 12:12 PM, Alessandro Adamou<adamou@cs.unibo.it
>>> >**
>>>
>>>> wrote:
>>>>
>>>>  One thing that it would be great to do is to detect the ontology ID
>>>>
>>>>> *before* creating the TripleCollection in Clerezza, so any mappings
>>>>> could
>>>>> be done before storing.
>>>>>
>>>>> But I don't know how this can be done with not so much code.
>>>>>
>>>>> Perhaps creating an IndexedGraph, exploring its content, then creating
>>>>> the
>>>>> Graph in the TcManager with the same content and the right graph name,
>>>>> then
>>>>> finally clearing the IndexedGraph could work.
>>>>>
>>>>> But it still means having twice the resource usage (disk+memory) for a
>>>>> period.
>>>>>
>>>>> Alessandro
>>>>>
>>>>>
>>>>>
>>>>> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>>>>>
>>>>>  Hi David,
>>>>>
>>>>>> well, I guess that depends pretty much on how heavy the usage of
>>>>>> OntoNet
>>>>>> is in your Stanbol installation.
>>>>>>
>>>>>> Those are graphs created when OntoNet has to load an ontology from its
>>>>>> content rather than from a Web URI, so it cannot know the ontology ID
>>>>>> earlier.
>>>>>>
>>>>>> This happens e.g. by POSTing the ontology as the payload or by
>>>>>> passing a
>>>>>> GraphContentInputSource to the Java API.
>>>>>>
>>>>>> Now I do not know why these graphs are created (perhaps the refactor
>>>>>> engine could be loading some), but I do know that a Clerezza graph in
>>>>>> Jena
>>>>>> TDB occupies a LOT of disk space.
>>>>>>
>>>>>> Suffice it to say that my bundled had stored nine graphs of<100
>>>>>> triples
>>>>>> each. Their disk space was about 1.8 GB, but when I tried to make a
>>>>>> zipfile
>>>>>> out of it, it came out as about 2MB!
>>>>>>
>>>>>> Alessandro
>>>>>>
>>>>>>
>>>>>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>>>>>
>>>>>>  Dears,
>>>>>>
>>>>>>> As I ran into disk issues, I found that this folder:
>>>>>>>  sling/felix/bundleXXX/data/******tdb-data/mgraph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> where XX is the bundle of:
>>>>>>>  Clerezza - SCB Jena TDB Storage Provider
>>>>>>> org.apache.clerezza.rdf.jena.******tdb.storage
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>>>> exhausted).
>>>>>>>
>>>>>>> These are some of the files I found inside:
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology889
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology1041
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology395
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology363
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology661
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology786
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology608
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology213
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology188
>>>>>>> 193M ./ontonet%3A%3Ainputstream%******3Aontology602
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Any clues?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David Riccitelli
>>>>>>>
>>>>>>> ********************************************************************
>>>>>>> ************************
>>>>>>>
>>>>>>>
>>>>>>> InsideOut10 s.r.l.
>>>>>>> P.IVA: IT-11381771002
>>>>>>> Fax: +39 0110708239
>>>>>>> ---
>>>>>>> LinkedIn: http://it.linkedin.com/in/******riccitelli<http://it.linkedin.com/in/****riccitelli>
>>>>>>> <http://it.linkedin.**com/in/**riccitelli<http://it.linkedin.com/in/**riccitelli>
>>>>>>> >
>>>>>>> <http://it.linkedin.**com/in/**riccitelli<http://it.linkedin.**
>>>>>>> com/in/riccitelli <http://it.linkedin.com/in/riccitelli>>
>>>>>>> Twitter: ziodave
>>>>>>> ---
>>>>>>> Layar Partner Network<http://www.layar.com/******<http://www.layar.com/****>
>>>>>>> <http://www.layar.com/**>
>>>>>>> publishing/developers/list/?******page=1&country=&city=&**
>>>>>>> keyword=****
>>>>>>> insideout10&lpn=1<http://www.****layar.com/publishing/**
>>>>>>> developers/list/?page=1&****country=&city=&keyword=****
>>>>>>> insideout10&lpn=1<http://www.**layar.com/publishing/**
>>>>>>> developers/list/?page=1&**country=&city=&keyword=**insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>>> >
>>>>>>> ********************************************************************
>>>>>>> ************************
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>
>>>>> M.Sc. Alessandro Adamou
>>>>>
>>>>> Alma Mater Studiorum - Università di Bologna
>>>>> Department of Computer Science
>>>>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>>>>
>>>>> Semantic Technology Laboratory (STLab)
>>>>> Institute for Cognitive Science and Technology (ISTC)
>>>>> National Research Council (CNR)
>>>>> Via Nomentana 56, 00161 Rome - Italy
>>>>>
>>>>>
>>>>> "I will give you everything, so long as you do not demand anything."
>>>>> (Ettore Petrolini, 1930)
>>>>>
>>>>> Not sent from my iSnobTechDevice
>>>>>
>>>>>
>>>>>
>>>>>  --
>>> M.Sc. Alessandro Adamou
>>>
>>> Alma Mater Studiorum - Università di Bologna
>>> Department of Computer Science
>>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>>
>>> Semantic Technology Laboratory (STLab)
>>> Institute for Cognitive Science and Technology (ISTC)
>>> National Research Council (CNR)
>>> Via Nomentana 56, 00161 Rome - Italy
>>>
>>>
>>> "I will give you everything, so long as you do not demand anything."
>>> (Ettore Petrolini, 1930)
>>>
>>> Not sent from my iSnobTechDevice
>>>
>>>
>>>
>>
>
> --
> M.Sc. Alessandro Adamou
>
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
>
>
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
>
> Not sent from my iSnobTechDevice
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi David,

after quite some work today I rewrote part of the Refactor Engine to 
avoid creating useless graphs.

Many were blank ontologies created along with the SEO scope. They are no 
longer created.

Many of the other graphs that you see are due to the fact that the 
engine merges together the entity signatures into an OntoNet session. 
Every such signature ends up resulting in its own ontology and therefore 
a graph in Clerezza/TDB.

I have not modified this second behaviour, but I have seen to it that 
the refactor engine now destroys its own session *and its contents* when 
computeEnhancements() completes. This means a lot of space occupied 
during analysis but freed up right thereafter.

It's more brutal than I wanted it to be, but a better implementation 
will come up once I add a couple new features to OntoNet that should 
make the process more reasonable.

On the upside, the engine code is now smaller by some 250 lines.

It would be super if you could update and try it out.

Thanks

Alessandro

P.S. now I'm glad I added the "ontonet" prefix to those graph names...


On 3/16/12 12:40 PM, David Riccitelli wrote:
>>  From what I've seen so far, yes. But it could depend on your engine
>> configuration using a richer set of rules.
>
> Same thing happens when we use the default rules set (seo_rules.sem) from
> SVN.
>
> We did not customize any other part of the installation with the exception
> of loading a local DBpedia index in sling/datafiles.
>
> David
>
> On Fri, Mar 16, 2012 at 12:27 PM, Alessandro Adamou<ad...@cs.unibo.it>wrote:
>
>> On 3/16/12 11:16 AM, David Riccitelli wrote:
>>
>>> Is this issue happening to us only?
>>>
>>  From what I've seen so far, yes. But it could depend on your engine
>> configuration using a richer set of rules.
>>
>> Alessandro
>>
>>   On Fri, Mar 16, 2012 at 12:12 PM, Alessandro Adamou<ad...@cs.unibo.it>**
>>> wrote:
>>>
>>>   One thing that it would be great to do is to detect the ontology ID
>>>> *before* creating the TripleCollection in Clerezza, so any mappings could
>>>> be done before storing.
>>>>
>>>> But I don't know how this can be done with not so much code.
>>>>
>>>> Perhaps creating an IndexedGraph, exploring its content, then creating
>>>> the
>>>> Graph in the TcManager with the same content and the right graph name,
>>>> then
>>>> finally clearing the IndexedGraph could work.
>>>>
>>>> But it still means having twice the resource usage (disk+memory) for a
>>>> period.
>>>>
>>>> Alessandro
>>>>
>>>>
>>>>
>>>> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>>>>
>>>>   Hi David,
>>>>> well, I guess that depends pretty much on how heavy the usage of OntoNet
>>>>> is in your Stanbol installation.
>>>>>
>>>>> Those are graphs created when OntoNet has to load an ontology from its
>>>>> content rather than from a Web URI, so it cannot know the ontology ID
>>>>> earlier.
>>>>>
>>>>> This happens e.g. by POSTing the ontology as the payload or by passing a
>>>>> GraphContentInputSource to the Java API.
>>>>>
>>>>> Now I do not know why these graphs are created (perhaps the refactor
>>>>> engine could be loading some), but I do know that a Clerezza graph in
>>>>> Jena
>>>>> TDB occupies a LOT of disk space.
>>>>>
>>>>> Suffice it to say that my bundled had stored nine graphs of<100 triples
>>>>> each. Their disk space was about 1.8 GB, but when I tried to make a
>>>>> zipfile
>>>>> out of it, it came out as about 2MB!
>>>>>
>>>>> Alessandro
>>>>>
>>>>>
>>>>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>>>>
>>>>>   Dears,
>>>>>> As I ran into disk issues, I found that this folder:
>>>>>>   sling/felix/bundleXXX/data/****tdb-data/mgraph
>>>>>>
>>>>>>
>>>>>> where XX is the bundle of:
>>>>>>   Clerezza - SCB Jena TDB Storage Provider
>>>>>> org.apache.clerezza.rdf.jena.****tdb.storage
>>>>>>
>>>>>>
>>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>>> exhausted).
>>>>>>
>>>>>> These are some of the files I found inside:
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology889
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology1041
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology395
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology363
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology661
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology786
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology608
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology213
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology188
>>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology602
>>>>>>
>>>>>>
>>>>>> Any clues?
>>>>>>
>>>>>> Thanks,
>>>>>> David Riccitelli
>>>>>>
>>>>>> ****************************************************************
>>>>>> ************************
>>>>>>
>>>>>>
>>>>>> InsideOut10 s.r.l.
>>>>>> P.IVA: IT-11381771002
>>>>>> Fax: +39 0110708239
>>>>>> ---
>>>>>> LinkedIn: http://it.linkedin.com/in/****riccitelli<http://it.linkedin.com/in/**riccitelli>
>>>>>> <http://it.linkedin.**com/in/riccitelli<http://it.linkedin.com/in/riccitelli>
>>>>>> Twitter: ziodave
>>>>>> ---
>>>>>> Layar Partner Network<http://www.layar.com/****<http://www.layar.com/**>
>>>>>> publishing/developers/list/?****page=1&country=&city=&keyword=****
>>>>>> insideout10&lpn=1<http://www.**layar.com/publishing/**
>>>>>> developers/list/?page=1&**country=&city=&keyword=**insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>>> ****************************************************************
>>>>>> ************************
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>   --
>>>> M.Sc. Alessandro Adamou
>>>>
>>>> Alma Mater Studiorum - Università di Bologna
>>>> Department of Computer Science
>>>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>>>
>>>> Semantic Technology Laboratory (STLab)
>>>> Institute for Cognitive Science and Technology (ISTC)
>>>> National Research Council (CNR)
>>>> Via Nomentana 56, 00161 Rome - Italy
>>>>
>>>>
>>>> "I will give you everything, so long as you do not demand anything."
>>>> (Ettore Petrolini, 1930)
>>>>
>>>> Not sent from my iSnobTechDevice
>>>>
>>>>
>>>>
>> --
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

>
> From what I've seen so far, yes. But it could depend on your engine
> configuration using a richer set of rules.


Same thing happens when we use the default rules set (seo_rules.sem) from
SVN.

We did not customize any other part of the installation with the exception
of loading a local DBpedia index in sling/datafiles.

David

On Fri, Mar 16, 2012 at 12:27 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:

> On 3/16/12 11:16 AM, David Riccitelli wrote:
>
>> Is this issue happening to us only?
>>
>
> From what I've seen so far, yes. But it could depend on your engine
> configuration using a richer set of rules.
>
> Alessandro
>
>  On Fri, Mar 16, 2012 at 12:12 PM, Alessandro Adamou<ad...@cs.unibo.it>**
>> wrote:
>>
>>  One thing that it would be great to do is to detect the ontology ID
>>> *before* creating the TripleCollection in Clerezza, so any mappings could
>>> be done before storing.
>>>
>>> But I don't know how this can be done with not so much code.
>>>
>>> Perhaps creating an IndexedGraph, exploring its content, then creating
>>> the
>>> Graph in the TcManager with the same content and the right graph name,
>>> then
>>> finally clearing the IndexedGraph could work.
>>>
>>> But it still means having twice the resource usage (disk+memory) for a
>>> period.
>>>
>>> Alessandro
>>>
>>>
>>>
>>> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>>>
>>>  Hi David,
>>>>
>>>> well, I guess that depends pretty much on how heavy the usage of OntoNet
>>>> is in your Stanbol installation.
>>>>
>>>> Those are graphs created when OntoNet has to load an ontology from its
>>>> content rather than from a Web URI, so it cannot know the ontology ID
>>>> earlier.
>>>>
>>>> This happens e.g. by POSTing the ontology as the payload or by passing a
>>>> GraphContentInputSource to the Java API.
>>>>
>>>> Now I do not know why these graphs are created (perhaps the refactor
>>>> engine could be loading some), but I do know that a Clerezza graph in
>>>> Jena
>>>> TDB occupies a LOT of disk space.
>>>>
>>>> Suffice it to say that my bundled had stored nine graphs of<100 triples
>>>> each. Their disk space was about 1.8 GB, but when I tried to make a
>>>> zipfile
>>>> out of it, it came out as about 2MB!
>>>>
>>>> Alessandro
>>>>
>>>>
>>>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>>>
>>>>  Dears,
>>>>>
>>>>> As I ran into disk issues, I found that this folder:
>>>>>  sling/felix/bundleXXX/data/****tdb-data/mgraph
>>>>>
>>>>>
>>>>> where XX is the bundle of:
>>>>>  Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.****tdb.storage
>>>>>
>>>>>
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>>
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%****3Aontology602
>>>>>
>>>>>
>>>>> Any clues?
>>>>>
>>>>> Thanks,
>>>>> David Riccitelli
>>>>>
>>>>> ****************************************************************
>>>>> ************************
>>>>>
>>>>>
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/****riccitelli<http://it.linkedin.com/in/**riccitelli>
>>>>> <http://it.linkedin.**com/in/riccitelli<http://it.linkedin.com/in/riccitelli>
>>>>> >
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<http://www.layar.com/****<http://www.layar.com/**>
>>>>> publishing/developers/list/?****page=1&country=&city=&keyword=****
>>>>> insideout10&lpn=1<http://www.**layar.com/publishing/**
>>>>> developers/list/?page=1&**country=&city=&keyword=**insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>>> >
>>>>> ****************************************************************
>>>>> ************************
>>>>>
>>>>>
>>>>>
>>>>>
>>>>  --
>>> M.Sc. Alessandro Adamou
>>>
>>> Alma Mater Studiorum - Università di Bologna
>>> Department of Computer Science
>>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>>
>>> Semantic Technology Laboratory (STLab)
>>> Institute for Cognitive Science and Technology (ISTC)
>>> National Research Council (CNR)
>>> Via Nomentana 56, 00161 Rome - Italy
>>>
>>>
>>> "I will give you everything, so long as you do not demand anything."
>>> (Ettore Petrolini, 1930)
>>>
>>> Not sent from my iSnobTechDevice
>>>
>>>
>>>
>>
>
> --
> M.Sc. Alessandro Adamou
>
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
>
>
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
>
> Not sent from my iSnobTechDevice
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

On 3/16/12 11:16 AM, David Riccitelli wrote:
> Is this issue happening to us only?

 From what I've seen so far, yes. But it could depend on your engine 
configuration using a richer set of rules.

Alessandro

> On Fri, Mar 16, 2012 at 12:12 PM, Alessandro Adamou<ad...@cs.unibo.it>wrote:
>
>> One thing that it would be great to do is to detect the ontology ID
>> *before* creating the TripleCollection in Clerezza, so any mappings could
>> be done before storing.
>>
>> But I don't know how this can be done with not so much code.
>>
>> Perhaps creating an IndexedGraph, exploring its content, then creating the
>> Graph in the TcManager with the same content and the right graph name, then
>> finally clearing the IndexedGraph could work.
>>
>> But it still means having twice the resource usage (disk+memory) for a
>> period.
>>
>> Alessandro
>>
>>
>>
>> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>>
>>> Hi David,
>>>
>>> well, I guess that depends pretty much on how heavy the usage of OntoNet
>>> is in your Stanbol installation.
>>>
>>> Those are graphs created when OntoNet has to load an ontology from its
>>> content rather than from a Web URI, so it cannot know the ontology ID
>>> earlier.
>>>
>>> This happens e.g. by POSTing the ontology as the payload or by passing a
>>> GraphContentInputSource to the Java API.
>>>
>>> Now I do not know why these graphs are created (perhaps the refactor
>>> engine could be loading some), but I do know that a Clerezza graph in Jena
>>> TDB occupies a LOT of disk space.
>>>
>>> Suffice it to say that my bundled had stored nine graphs of<100 triples
>>> each. Their disk space was about 1.8 GB, but when I tried to make a zipfile
>>> out of it, it came out as about 2MB!
>>>
>>> Alessandro
>>>
>>>
>>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>>
>>>> Dears,
>>>>
>>>> As I ran into disk issues, I found that this folder:
>>>>   sling/felix/bundleXXX/data/**tdb-data/mgraph
>>>>
>>>> where XX is the bundle of:
>>>>   Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.**tdb.storage
>>>>
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>>
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology602
>>>>
>>>>
>>>> Any clues?
>>>>
>>>> Thanks,
>>>> David Riccitelli
>>>>
>>>> ************************************************************************************
>>>>
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/**riccitelli<http://it.linkedin.com/in/riccitelli>
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/**
>>>> publishing/developers/list/?**page=1&country=&city=&keyword=**
>>>> insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>> ************************************************************************************
>>>>
>>>>
>>>>
>>>
>> --
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by David Riccitelli <da...@insideout.io>.

Is this issue happening to us only?

On Fri, Mar 16, 2012 at 12:12 PM, Alessandro Adamou <ad...@cs.unibo.it>wrote:

> One thing that it would be great to do is to detect the ontology ID
> *before* creating the TripleCollection in Clerezza, so any mappings could
> be done before storing.
>
> But I don't know how this can be done with not so much code.
>
> Perhaps creating an IndexedGraph, exploring its content, then creating the
> Graph in the TcManager with the same content and the right graph name, then
> finally clearing the IndexedGraph could work.
>
> But it still means having twice the resource usage (disk+memory) for a
> period.
>
> Alessandro
>
>
>
> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>
>> Hi David,
>>
>> well, I guess that depends pretty much on how heavy the usage of OntoNet
>> is in your Stanbol installation.
>>
>> Those are graphs created when OntoNet has to load an ontology from its
>> content rather than from a Web URI, so it cannot know the ontology ID
>> earlier.
>>
>> This happens e.g. by POSTing the ontology as the payload or by passing a
>> GraphContentInputSource to the Java API.
>>
>> Now I do not know why these graphs are created (perhaps the refactor
>> engine could be loading some), but I do know that a Clerezza graph in Jena
>> TDB occupies a LOT of disk space.
>>
>> Suffice it to say that my bundled had stored nine graphs of <100 triples
>> each. Their disk space was about 1.8 GB, but when I tried to make a zipfile
>> out of it, it came out as about 2MB!
>>
>> Alessandro
>>
>>
>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>
>>> Dears,
>>>
>>> As I ran into disk issues, I found that this folder:
>>>  sling/felix/bundleXXX/data/**tdb-data/mgraph
>>>
>>> where XX is the bundle of:
>>>  Clerezza - SCB Jena TDB Storage Provider
>>> org.apache.clerezza.rdf.jena.**tdb.storage
>>>
>>> took almost 70 gbytes of disk space (then the disk space has been
>>> exhausted).
>>>
>>> These are some of the files I found inside:
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology889
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology1041
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology395
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology363
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology661
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology786
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology608
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology213
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology188
>>> 193M ./ontonet%3A%3Ainputstream%**3Aontology602
>>>
>>>
>>> Any clues?
>>>
>>> Thanks,
>>> David Riccitelli
>>>
>>> ************************************************************************************
>>>
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/**riccitelli<http://it.linkedin.com/in/riccitelli>
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/**
>>> publishing/developers/list/?**page=1&country=&city=&keyword=**
>>> insideout10&lpn=1<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>> >
>>> ************************************************************************************
>>>
>>>
>>>
>>
>>
>
> --
> M.Sc. Alessandro Adamou
>
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
>
>
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
>
> Not sent from my iSnobTechDevice
>
>


-- 
David Riccitelli

********************************************************************************
InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
********************************************************************************

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi Rupert,

I've been trying to implement your proposed solution for the Ontology ID 
lookahead with the MGraph wrapper.

I'm trying to make it simple now, then I will need to detect the 
[ontologyIRI, versionIRI] pair

However, BufferedInputStream.mark(int) does not seem to set the read 
limit for me. No matter what value I set (even -1), Parser.parse() 
always goes through the whole graph, and when I try to reset() it after 
finding the ontologyID I always get an IOException("Stream closed")

I tried values much greater and much smaller than the file size in 
bytes, and tried to move the triple early and late in the file, no dice.

Perhaps I should just set a limit on the triples instead, but I wouldn't 
want to read through a 100MiB file just to use the first 100 triples for 
guessing the ID. However, this could be inevitable since most formats 
require to read the last chunk of a file in order to "close" the RDF 
code (such as a </rdf:RDF> tag or so), but perhaps a SAX parser could 
work anyway?

any clue?

Alessandro


On 3/16/12 11:50 AM, Rupert Westenthaler wrote:
> Hi Alessandro
>
> Something like this could work:
>
> This suggests to
> * provide an MGraph wrapper that skips all triples other than the one need to determine the OntologyID
> * Use a BufferedInputStream and mark the beginning
> * Parse to your MGraphWrapper until you can determine the OntologyID
> * throw some exception to stop the parsing
> * reset the stream
> * process the OntologyID
> * If you need to import the parsed ontology you can reuse the resetted stream
>
> Here is how the code might look.
>
> class MyMGraph extends SimpleMGraph {
>
>       String ontologyId;
>
>      @Override
>      protected boolean performAdd(Triple triple) {
>
>            //fitler the interesting Triple
>            if(triple is interesting){
>                super.perfomAdd(triple)
>            }
>            //check the currently available triples for the Ontology ID
>            checkOntologyId();
>
>           if(ontologyId != null){
>               throw new RuntimeException(); //stop importing
>           }
>           //TODO: add an limit to the triples you read
>      }
>
>      public getOntologyID(){
>          return id
>      }
>
>
> }
>
>
> If you use a BufferedInputStream you could do the following
>
> BufferedInputStream bIn = new BufferedInputStream(in);
> bIn.mark(Integer.MAX_VALUE); //set an appropriate limit
> MyMGraph  graph = new MyMGraph();
> try {
>      parser.parse(graph,inputStream,rdfFormat)
> } catch(RuntimeException e){ }
> if(graph.getOntologyId() != null){
>      bIn.reset(); //reset set the stream to the start
>      //now do the logic you need to do
> } else { //No OntologyID found
>      //do some error handling
> }
>
>
> WDYT
> Rupert


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Thank you Rupert, this looks like a good idea indeed!

There are some policies to be decided first, e.g. if we discover that a 
graph with that name is already stored, we have to decide whether to 
replace, add, merge etc. (see STANBOL-426) . and this also depends on 
what artifact has the "ownership" of that graph (a scope/space, a 
session, or nobody).

But it will be much easier to understand once owl:versionIRI support is 
complete in STANBOL-524

But this is for later. I will create a new ticket and link it with the 
above and STANBOL-518, then post your code sample there.

For now, it should be enough to solve the problem on the refactor engine 
level.

Best,

Alessandro

On 3/16/12 11:50 AM, Rupert Westenthaler wrote:
> Hi Alessandro
>
> Something like this could work:
>
> This suggests to
> * provide an MGraph wrapper that skips all triples other than the one need to determine the OntologyID
> * Use a BufferedInputStream and mark the beginning
> * Parse to your MGraphWrapper until you can determine the OntologyID
> * throw some exception to stop the parsing
> * reset the stream
> * process the OntologyID
> * If you need to import the parsed ontology you can reuse the resetted stream
>
> Here is how the code might look.
>
> class MyMGraph extends SimpleMGraph {
>
>       String ontologyId;
>
>      @Override
>      protected boolean performAdd(Triple triple) {
>
>            //fitler the interesting Triple
>            if(triple is interesting){
>                super.perfomAdd(triple)
>            }
>            //check the currently available triples for the Ontology ID
>            checkOntologyId();
>
>           if(ontologyId != null){
>               throw new RuntimeException(); //stop importing
>           }
>           //TODO: add an limit to the triples you read
>      }
>
>      public getOntologyID(){
>          return id
>      }
>
>
> }
>
>
> If you use a BufferedInputStream you could do the following
>
> BufferedInputStream bIn = new BufferedInputStream(in);
> bIn.mark(Integer.MAX_VALUE); //set an appropriate limit
> MyMGraph  graph = new MyMGraph();
> try {
>      parser.parse(graph,inputStream,rdfFormat)
> } catch(RuntimeException e){ }
> if(graph.getOntologyId() != null){
>      bIn.reset(); //reset set the stream to the start
>      //now do the logic you need to do
> } else { //No OntologyID found
>      //do some error handling
> }
>
>
> WDYT
> Rupert
>
> On 16.03.2012, at 11:12, Alessandro Adamou wrote:
>
>> One thing that it would be great to do is to detect the ontology ID *before* creating the TripleCollection in Clerezza, so any mappings could be done before storing.
>>
>> But I don't know how this can be done with not so much code.
>>
>> Perhaps creating an IndexedGraph, exploring its content, then creating the Graph in the TcManager with the same content and the right graph name, then finally clearing the IndexedGraph could work.
>>
>> But it still means having twice the resource usage (disk+memory) for a period.
>>
>> Alessandro
>>
>>
>> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>>> Hi David,
>>>
>>> well, I guess that depends pretty much on how heavy the usage of OntoNet is in your Stanbol installation.
>>>
>>> Those are graphs created when OntoNet has to load an ontology from its content rather than from a Web URI, so it cannot know the ontology ID earlier.
>>>
>>> This happens e.g. by POSTing the ontology as the payload or by passing a GraphContentInputSource to the Java API.
>>>
>>> Now I do not know why these graphs are created (perhaps the refactor engine could be loading some), but I do know that a Clerezza graph in Jena TDB occupies a LOT of disk space.
>>>
>>> Suffice it to say that my bundled had stored nine graphs of<100 triples each. Their disk space was about 1.8 GB, but when I tried to make a zipfile out of it, it came out as about 2MB!
>>>
>>> Alessandro
>>>
>>>
>>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>>> Dears,
>>>>
>>>> As I ran into disk issues, I found that this folder:
>>>>   sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>
>>>> where XX is the bundle of:
>>>>   Clerezza - SCB Jena TDB Storage Provider
>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>
>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>> exhausted).
>>>>
>>>> These are some of the files I found inside:
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>
>>>>
>>>> Any clues?
>>>>
>>>> Thanks,
>>>> David Riccitelli
>>>>
>>>> ********************************************************************************
>>>> InsideOut10 s.r.l.
>>>> P.IVA: IT-11381771002
>>>> Fax: +39 0110708239
>>>> ---
>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>> Twitter: ziodave
>>>> ---
>>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>>> ********************************************************************************
>>>>
>>>
>>
>> -- 
>> M.Sc. Alessandro Adamou
>>
>> Alma Mater Studiorum - Università di Bologna
>> Department of Computer Science
>> Mura Anteo Zamboni 7, 40127 Bologna - Italy
>>
>> Semantic Technology Laboratory (STLab)
>> Institute for Cognitive Science and Technology (ISTC)
>> National Research Council (CNR)
>> Via Nomentana 56, 00161 Rome - Italy
>>
>>
>> "I will give you everything, so long as you do not demand anything."
>> (Ettore Petrolini, 1930)
>>
>> Not sent from my iSnobTechDevice
>>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi Alessandro

Something like this could work:

This suggests to
* provide an MGraph wrapper that skips all triples other than the one need to determine the OntologyID
* Use a BufferedInputStream and mark the beginning
* Parse to your MGraphWrapper until you can determine the OntologyID
* throw some exception to stop the parsing
* reset the stream
* process the OntologyID
* If you need to import the parsed ontology you can reuse the resetted stream

Here is how the code might look.

class MyMGraph extends SimpleMGraph {

     String ontologyId;

    @Override
    protected boolean performAdd(Triple triple) {

          //fitler the interesting Triple
          if(triple is interesting){
              super.perfomAdd(triple)
          }
          //check the currently available triples for the Ontology ID
          checkOntologyId(); 
         
         if(ontologyId != null){
             throw new RuntimeException(); //stop importing
         }
         //TODO: add an limit to the triples you read
    }
   
    public getOntologyID(){
        return id
    }


}


If you use a BufferedInputStream you could do the following

BufferedInputStream bIn = new BufferedInputStream(in);
bIn.mark(Integer.MAX_VALUE); //set an appropriate limit
MyMGraph  graph = new MyMGraph();
try {
    parser.parse(graph,inputStream,rdfFormat)
} catch(RuntimeException e){ }
if(graph.getOntologyId() != null){
    bIn.reset(); //reset set the stream to the start
    //now do the logic you need to do
} else { //No OntologyID found
    //do some error handling
}


WDYT
Rupert

On 16.03.2012, at 11:12, Alessandro Adamou wrote:

> One thing that it would be great to do is to detect the ontology ID *before* creating the TripleCollection in Clerezza, so any mappings could be done before storing.
> 
> But I don't know how this can be done with not so much code.
> 
> Perhaps creating an IndexedGraph, exploring its content, then creating the Graph in the TcManager with the same content and the right graph name, then finally clearing the IndexedGraph could work.
> 
> But it still means having twice the resource usage (disk+memory) for a period.
> 
> Alessandro
> 
> 
> On 3/16/12 10:56 AM, Alessandro Adamou wrote:
>> Hi David,
>> 
>> well, I guess that depends pretty much on how heavy the usage of OntoNet is in your Stanbol installation.
>> 
>> Those are graphs created when OntoNet has to load an ontology from its content rather than from a Web URI, so it cannot know the ontology ID earlier.
>> 
>> This happens e.g. by POSTing the ontology as the payload or by passing a GraphContentInputSource to the Java API.
>> 
>> Now I do not know why these graphs are created (perhaps the refactor engine could be loading some), but I do know that a Clerezza graph in Jena TDB occupies a LOT of disk space.
>> 
>> Suffice it to say that my bundled had stored nine graphs of <100 triples each. Their disk space was about 1.8 GB, but when I tried to make a zipfile out of it, it came out as about 2MB!
>> 
>> Alessandro
>> 
>> 
>> On 3/16/12 10:30 AM, David Riccitelli wrote:
>>> Dears,
>>> 
>>> As I ran into disk issues, I found that this folder:
>>>  sling/felix/bundleXXX/data/tdb-data/mgraph
>>> 
>>> where XX is the bundle of:
>>>  Clerezza - SCB Jena TDB Storage Provider
>>> org.apache.clerezza.rdf.jena.tdb.storage
>>> 
>>> took almost 70 gbytes of disk space (then the disk space has been
>>> exhausted).
>>> 
>>> These are some of the files I found inside:
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>> 
>>> 
>>> Any clues?
>>> 
>>> Thanks,
>>> David Riccitelli
>>> 
>>> ******************************************************************************** 
>>> InsideOut10 s.r.l.
>>> P.IVA: IT-11381771002
>>> Fax: +39 0110708239
>>> ---
>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>> Twitter: ziodave
>>> ---
>>> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>>> ******************************************************************************** 
>>> 
>> 
>> 
> 
> 
> -- 
> M.Sc. Alessandro Adamou
> 
> Alma Mater Studiorum - Università di Bologna
> Department of Computer Science
> Mura Anteo Zamboni 7, 40127 Bologna - Italy
> 
> Semantic Technology Laboratory (STLab)
> Institute for Cognitive Science and Technology (ISTC)
> National Research Council (CNR)
> Via Nomentana 56, 00161 Rome - Italy
> 
> 
> "I will give you everything, so long as you do not demand anything."
> (Ettore Petrolini, 1930)
> 
> Not sent from my iSnobTechDevice
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

One thing that it would be great to do is to detect the ontology ID 
*before* creating the TripleCollection in Clerezza, so any mappings 
could be done before storing.

But I don't know how this can be done with not so much code.

Perhaps creating an IndexedGraph, exploring its content, then creating 
the Graph in the TcManager with the same content and the right graph 
name, then finally clearing the IndexedGraph could work.

But it still means having twice the resource usage (disk+memory) for a 
period.

Alessandro


On 3/16/12 10:56 AM, Alessandro Adamou wrote:
> Hi David,
>
> well, I guess that depends pretty much on how heavy the usage of 
> OntoNet is in your Stanbol installation.
>
> Those are graphs created when OntoNet has to load an ontology from its 
> content rather than from a Web URI, so it cannot know the ontology ID 
> earlier.
>
> This happens e.g. by POSTing the ontology as the payload or by passing 
> a GraphContentInputSource to the Java API.
>
> Now I do not know why these graphs are created (perhaps the refactor 
> engine could be loading some), but I do know that a Clerezza graph in 
> Jena TDB occupies a LOT of disk space.
>
> Suffice it to say that my bundled had stored nine graphs of <100 
> triples each. Their disk space was about 1.8 GB, but when I tried to 
> make a zipfile out of it, it came out as about 2MB!
>
> Alessandro
>
>
> On 3/16/12 10:30 AM, David Riccitelli wrote:
>> Dears,
>>
>> As I ran into disk issues, I found that this folder:
>>   sling/felix/bundleXXX/data/tdb-data/mgraph
>>
>> where XX is the bundle of:
>>   Clerezza - SCB Jena TDB Storage Provider
>> org.apache.clerezza.rdf.jena.tdb.storage
>>
>> took almost 70 gbytes of disk space (then the disk space has been
>> exhausted).
>>
>> These are some of the files I found inside:
>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>
>>
>> Any clues?
>>
>> Thanks,
>> David Riccitelli
>>
>> ******************************************************************************** 
>>
>> InsideOut10 s.r.l.
>> P.IVA: IT-11381771002
>> Fax: +39 0110708239
>> ---
>> LinkedIn: http://it.linkedin.com/in/riccitelli
>> Twitter: ziodave
>> ---
>> Layar Partner 
>> Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
>> ******************************************************************************** 
>>
>>
>
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Posted by Alessandro Adamou <ad...@cs.unibo.it>.

Hi David,

well, I guess that depends pretty much on how heavy the usage of OntoNet 
is in your Stanbol installation.

Those are graphs created when OntoNet has to load an ontology from its 
content rather than from a Web URI, so it cannot know the ontology ID 
earlier.

This happens e.g. by POSTing the ontology as the payload or by passing a 
GraphContentInputSource to the Java API.

Now I do not know why these graphs are created (perhaps the refactor 
engine could be loading some), but I do know that a Clerezza graph in 
Jena TDB occupies a LOT of disk space.

Suffice it to say that my bundled had stored nine graphs of <100 triples 
each. Their disk space was about 1.8 GB, but when I tried to make a 
zipfile out of it, it came out as about 2MB!

Alessandro


On 3/16/12 10:30 AM, David Riccitelli wrote:
> Dears,
>
> As I ran into disk issues, I found that this folder:
>   sling/felix/bundleXXX/data/tdb-data/mgraph
>
> where XX is the bundle of:
>   Clerezza - SCB Jena TDB Storage Provider
> org.apache.clerezza.rdf.jena.tdb.storage
>
> took almost 70 gbytes of disk space (then the disk space has been
> exhausted).
>
> These are some of the files I found inside:
> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>
>
> Any clues?
>
> Thanks,
> David Riccitelli
>
> ********************************************************************************
> InsideOut10 s.r.l.
> P.IVA: IT-11381771002
> Fax: +39 0110708239
> ---
> LinkedIn: http://it.linkedin.com/in/riccitelli
> Twitter: ziodave
> ---
> Layar Partner Network<http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1>
> ********************************************************************************
>


-- 
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice