You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Rob Hall <kv...@gmail.com> on 2012/12/04 21:24:11 UTC

DatasetImpl and Model retrieval

*tl:dr;* I had to extend DatasetImpl to bypass internal caching in order to
be able to retrieve the same models that I added to the Dataset. I believe
the only performance hit that I can see will be that the Models which wrap
the underlying graphs will not be garbage collected. Is this a correct
assumption regarding the effect of this change, and are my expectations
regarding the behavior of dataset in error?

*background:* I extended the OntModel interface (and created a
corresponding extension to OntModelImpl) to facilitate some domain-specific
entity representations and actions. I wished to execute queries over sets
of these Models. These queries were to include their sub-models, and
potentially inferred triples from any attached reasoner.

To allow this, I added instances of my new implementation to a Jena
in-memory Dataset.

So far, this has worked as I had expected (with fairly thorough integration
testing taking place). Recently, I discovered a crash whenever I had added
a sufficient number of models to the system. When I went to retrieve a
named model, I could not cast it back to its original type. The following
bit of code from DatasetImpl make this clear why:

    private Model graph2model(Graph graph)
    {
        // Called from readers -- outer synchronation needed.
        Model model = cache.get(graph) ;
        if ( model == null )
        {
            model = ModelFactory.createModelForGraph(graph) ;
            cache.put(graph, model) ;
        }
        return model ;
    }

Inside DatasetImpl, a LRU cache is used to keep track of recently requested
(or inserted) models. After executing for some time, the original model has
been lost, and a subsequent request (and cast) results in a
ClassCastException.

*workaround:* My workaround would have been to extend DatasetImpl and
override the 4 methods that interact with the cache to, instead, work with
a standard HashMap. As those methods are private, am presently creating my
own implementation.

*question:* My workaround makes several assumptions, and I wanted to be
certain that they made sense.

*1) *My use of Dataset is outside of its expected use. Multiple requests to
get the same model should not expect to retrieve the same implementation,
let alone the same instance, of the model that they saw on a previous
request. (Ie: *that this isn't a bug*)

*2)* The caching that occurs within DatasetImpl primarily exists to
minimize held references to held objects in order to facilitate java
garbage collection (the Model may be garbage collected, but its backing
graph, retained by the Dataset, will not). Therefore, *overriding this
behavior will have no effect on applications which are already retaining
references to those Models*.

Thanks in advance,
Rob

Re: DatasetImpl and Model retrieval

Posted by Andy Seaborne <an...@apache.org>.

I've made the 4 private methods protected and added a protected 
"createCache".

> By this rational, I'd consider DatasetImpl creating models (in instances
> where the client code is not requesting a created model) a bug. There is no
> reason for an abstraction, whose only purpose is to relate a collection of
> models to the collection of their backing graphs, to create new models to
> then represent those graphs.

There is a cache for the case of databases with very large numbers of 
graphs persistently on disk.  If the DatasetImpl is churning through 
models, they will slowly fill up memory in a long-running program. 
Models aren't particularly cheap currently because of their own internal 
caches and state EnhGraph).  And just creating a model each call through 
would not be good for code that looks like

loop:
     dataset.getNamedModel("fixed name"). .. model operations ...


Architecturally, maybe a reworking where the usual Model works 
(statelessly) over a Graph and the EnhGraph machinary only appears for 
OntModels would be good.

	Andy

Re: DatasetImpl and Model retrieval

Posted by Rob Hall <kv...@gmail.com>.

My initial instinct for creating the new domain models was to use
composition (either through dynamic proxy objects, or hand writing the
entire interface). In the end, however, my design paralleled the one used
in jena for the same reasons that OntModelImpl doesn't extend InfModel by
composition. In fact, this problem exists as much for OntModels and
InfModels as it does for my implementation. Dataset only produces Model
implementations, thus breaking basic use cases outside of my domain, as
well.

I suggestion that a new model should never be created except if one
is requesting the creation of one that doesn't exist in the graph already.
One reason for this rational is, that if I add a complex model (such as an
InfModel) to the Dataset, the DatasetImpl will pass the model's backing
graph (a union of a graph dynamically produced by inference, and a graph of
the contained tripples) into the DatasetGraph. When the DatasetImpl then
creates a new model object to wrap that graph, it loses the representation
of sub models / etc that existed, while retaining those sub-graphs in the
data. DatasetImpl, as a consequence, fails to do anything except become a
thin wrapper to calling sdg.addNamedGraph.

Furthermore, these models are lightweight, as you mentioned; preparing them
for garbage collection or, alternatively, creating them every time, both
seem needlessly chatty for almost no gain and provable loss. Meanwhile, the
API not only becomes more intuitive / useful for anyone working with
something more complicated than Model, but also remains consistent with its
treatment of the backing graphs.

By this rational, I'd consider DatasetImpl creating models (in instances
where the client code is not requesting a created model) a bug. There is no
reason for an abstraction, whose only purpose is to relate a collection of
models to the collection of their backing graphs, to create new models to
then represent those graphs.

On Tue, Dec 4, 2012 at 4:15 PM, Stephen Allen <sa...@apache.org> wrote:

> On Tue, Dec 4, 2012 at 3:24 PM, Rob Hall <kv...@gmail.com> wrote:
> > *tl:dr;* I had to extend DatasetImpl to bypass internal caching in order
> to
> > be able to retrieve the same models that I added to the Dataset. I
> believe
> > the only performance hit that I can see will be that the Models which
> wrap
> > the underlying graphs will not be garbage collected. Is this a correct
> > assumption regarding the effect of this change, and are my expectations
> > regarding the behavior of dataset in error?
> >
> > *background:* I extended the OntModel interface (and created a
> > corresponding extension to OntModelImpl) to facilitate some
> domain-specific
> > entity representations and actions. I wished to execute queries over sets
> > of these Models. These queries were to include their sub-models, and
> > potentially inferred triples from any attached reasoner.
> >
>
> I think a better way to handle this is via composition rather than
> inheritance from the OntModelImpl class.  If your extended model class
> does not need direct access to the instance variables of OntModelImpl,
> then this way is a lot more flexible.  Especially if the
> implementation of OntModelImpl changes in the future.  Essentially,
> make your custom Model a stateless wrapper around another Model.
>
> Make your own interface that extends OntModel, but make your
> implementation class implement that new interface and delegate all
> calls to an OntModel that is passed into the contructor.  This way
> your class very cheap to construct, and you can create a new one
> backed by any OntModel that you retreive from a Dataset.
>
>
> > To allow this, I added instances of my new implementation to a Jena
> > in-memory Dataset.
> >
> > So far, this has worked as I had expected (with fairly thorough
> integration
> > testing taking place). Recently, I discovered a crash whenever I had
> added
> > a sufficient number of models to the system. When I went to retrieve a
> > named model, I could not cast it back to its original type. The following
> > bit of code from DatasetImpl make this clear why:
> >
> >     private Model graph2model(Graph graph)
> >     {
> >         // Called from readers -- outer synchronation needed.
> >         Model model = cache.get(graph) ;
> >         if ( model == null )
> >         {
> >             model = ModelFactory.createModelForGraph(graph) ;
> >             cache.put(graph, model) ;
> >         }
> >         return model ;
> >     }
> >
> > Inside DatasetImpl, a LRU cache is used to keep track of recently
> requested
> > (or inserted) models. After executing for some time, the original model
> has
> > been lost, and a subsequent request (and cast) results in a
> > ClassCastException.
> >
>
> I think that cache needs to be eliminated.  It is probably a relic
> from the days of expensive object creation.
>
>
> > *workaround:* My workaround would have been to extend DatasetImpl and
> > override the 4 methods that interact with the cache to, instead, work
> with
> > a standard HashMap. As those methods are private, am presently creating
> my
> > own implementation.
> >
> > *question:* My workaround makes several assumptions, and I wanted to be
> > certain that they made sense.
> >
> > *1) *My use of Dataset is outside of its expected use. Multiple requests
> to
> > get the same model should not expect to retrieve the same implementation,
> > let alone the same instance, of the model that they saw on a previous
> > request. (Ie: *that this isn't a bug*)
>
> Model is meant to be a stateless wrapper around an underlying Graph,
> and Dataset is a similarly a wrapper around a DatasetGraph.  Therefore
> it is expected that you would different instances of Model back from a
> Dataset than you added.
>
>
> > *2)* The caching that occurs within DatasetImpl primarily exists to
> > minimize held references to held objects in order to facilitate java
> > garbage collection (the Model may be garbage collected, but its backing
> > graph, retained by the Dataset, will not). Therefore, *overriding this
> > behavior will have no effect on applications which are already retaining
> > references to those Models*.
>
> The caching that is there now probably isn't needed, and it should be
> returning a new Model instance every time.
>
> -Stephen
>

-- 
Robert Trevor Hall
Phone: (315) 719-5039

Re: DatasetImpl and Model retrieval

Posted by Stephen Allen <sa...@apache.org>.

On Tue, Dec 4, 2012 at 3:24 PM, Rob Hall <kv...@gmail.com> wrote:
> *tl:dr;* I had to extend DatasetImpl to bypass internal caching in order to
> be able to retrieve the same models that I added to the Dataset. I believe
> the only performance hit that I can see will be that the Models which wrap
> the underlying graphs will not be garbage collected. Is this a correct
> assumption regarding the effect of this change, and are my expectations
> regarding the behavior of dataset in error?
>
> *background:* I extended the OntModel interface (and created a
> corresponding extension to OntModelImpl) to facilitate some domain-specific
> entity representations and actions. I wished to execute queries over sets
> of these Models. These queries were to include their sub-models, and
> potentially inferred triples from any attached reasoner.
>

I think a better way to handle this is via composition rather than
inheritance from the OntModelImpl class.  If your extended model class
does not need direct access to the instance variables of OntModelImpl,
then this way is a lot more flexible.  Especially if the
implementation of OntModelImpl changes in the future.  Essentially,
make your custom Model a stateless wrapper around another Model.

Make your own interface that extends OntModel, but make your
implementation class implement that new interface and delegate all
calls to an OntModel that is passed into the contructor.  This way
your class very cheap to construct, and you can create a new one
backed by any OntModel that you retreive from a Dataset.


> To allow this, I added instances of my new implementation to a Jena
> in-memory Dataset.
>
> So far, this has worked as I had expected (with fairly thorough integration
> testing taking place). Recently, I discovered a crash whenever I had added
> a sufficient number of models to the system. When I went to retrieve a
> named model, I could not cast it back to its original type. The following
> bit of code from DatasetImpl make this clear why:
>
>     private Model graph2model(Graph graph)
>     {
>         // Called from readers -- outer synchronation needed.
>         Model model = cache.get(graph) ;
>         if ( model == null )
>         {
>             model = ModelFactory.createModelForGraph(graph) ;
>             cache.put(graph, model) ;
>         }
>         return model ;
>     }
>
> Inside DatasetImpl, a LRU cache is used to keep track of recently requested
> (or inserted) models. After executing for some time, the original model has
> been lost, and a subsequent request (and cast) results in a
> ClassCastException.
>

I think that cache needs to be eliminated.  It is probably a relic
from the days of expensive object creation.


> *workaround:* My workaround would have been to extend DatasetImpl and
> override the 4 methods that interact with the cache to, instead, work with
> a standard HashMap. As those methods are private, am presently creating my
> own implementation.
>
> *question:* My workaround makes several assumptions, and I wanted to be
> certain that they made sense.
>
> *1) *My use of Dataset is outside of its expected use. Multiple requests to
> get the same model should not expect to retrieve the same implementation,
> let alone the same instance, of the model that they saw on a previous
> request. (Ie: *that this isn't a bug*)

Model is meant to be a stateless wrapper around an underlying Graph,
and Dataset is a similarly a wrapper around a DatasetGraph.  Therefore
it is expected that you would different instances of Model back from a
Dataset than you added.


> *2)* The caching that occurs within DatasetImpl primarily exists to
> minimize held references to held objects in order to facilitate java
> garbage collection (the Model may be garbage collected, but its backing
> graph, retained by the Dataset, will not). Therefore, *overriding this
> behavior will have no effect on applications which are already retaining
> references to those Models*.

The caching that is there now probably isn't needed, and it should be
returning a new Model instance every time.

-Stephen