You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@clerezza.apache.org by Daniel Spicar <ds...@apache.org> on 2011/10/26 15:36:33 UTC

RE: Weak Performance of "application/json+rdf" serializer on big TripleCollections (CLEREZZA-643)

Rupert provided a patch to improve serialization performance (thanks for the
effort!). I reviewed his Patch and have written my comments on the JIRA
page. But I think we need to discuss the issues I raise there. In summary:

- neither the patch nor the current implementations work reliably with very
large graphs (larger than memeory)
- the patch is significantly faster than the current implementation
- the current implementation is easier to quick-fix for very large graphs
(but also very slow)

There is a sketch of a better solution that should allow us to be faster and
not limited by memory size. It is based on sorted iterators. However these
iterators need to be supplied by the underlying TripleCollections and that
will require more changes to the core of Clerezza.

Because both, the current implementation and the patch doe not really work
on "big" TripleCollection (when big means really really big) the question we
should discuss its:
a) keep everything as it is and solve the problem properly (possibly as
described in the issue)
b) quick fix the current implementation (slow performance)  + schedule a
proper solution
c) apply the patch (fast but graphs limited to available memory size) +
schedule a proper solution

My favorite is c.

What do you think?

Re: Weak Performance of "application/json+rdf" serializer on big TripleCollections (CLEREZZA-643)

Posted by Hasan Hasan <ha...@trialox.org>.

Since the performance gain reported is quite big, and there seems to be no
other implications beside increasing code complexity, I would also go for
option c.
I am not sure here whether we need to call for a vote.

regards
hasan

On Wed, Oct 26, 2011 at 5:30 PM, Tsuyoshi Ito <ts...@trialox.org> wrote:

> I prefer also option C
>
> Cheers
> Tsuy
>
> On Oct 26, 2011, at 5:14 PM, Tommaso Teofili wrote:
>
> > same here; I'd go with C option :)
> > Tommaso
> >
> > 2011/10/26 Daniel Spicar <ds...@apache.org>
> >
> >> the JIRA issue can be found here:
> >> https://issues.apache.org/jira/browse/CLEREZZA-643
> >>
> >> On Wed, Oct 26, 2011 at 3:36 PM, Daniel Spicar <ds...@apache.org>
> wrote:
> >>
> >>> Rupert provided a patch to improve serialization performance (thanks
> for
> >>> the effort!). I reviewed his Patch and have written my comments on the
> >> JIRA
> >>> page. But I think we need to discuss the issues I raise there. In
> >> summary:
> >>>
> >>> - neither the patch nor the current implementations work reliably with
> >> very
> >>> large graphs (larger than memeory)
> >>> - the patch is significantly faster than the current implementation
> >>> - the current implementation is easier to quick-fix for very large
> graphs
> >>> (but also very slow)
> >>>
> >>> There is a sketch of a better solution that should allow us to be
> faster
> >>> and not limited by memory size. It is based on sorted iterators.
> However
> >>> these iterators need to be supplied by the underlying TripleCollections
> >> and
> >>> that will require more changes to the core of Clerezza.
> >>>
> >>> Because both, the current implementation and the patch doe not really
> >> work
> >>> on "big" TripleCollection (when big means really really big) the
> question
> >> we
> >>> should discuss its:
> >>> a) keep everything as it is and solve the problem properly (possibly as
> >>> described in the issue)
> >>> b) quick fix the current implementation (slow performance)  + schedule
> a
> >>> proper solution
> >>> c) apply the patch (fast but graphs limited to available memory size) +
> >>> schedule a proper solution
> >>>
> >>> My favorite is c.
> >>>
> >>> What do you think?
> >>>
> >>
>
> --trialox ag-------------------------------------
>   tsuyoshi ito
>  hardturmstrasse 101
>  8005 zuerich
>
>

Re: Weak Performance of "application/json+rdf" serializer on big TripleCollections (CLEREZZA-643)

Posted by Tsuyoshi Ito <ts...@trialox.org>.

I prefer also option C

Cheers
Tsuy

On Oct 26, 2011, at 5:14 PM, Tommaso Teofili wrote:

> same here; I'd go with C option :)
> Tommaso
> 
> 2011/10/26 Daniel Spicar <ds...@apache.org>
> 
>> the JIRA issue can be found here:
>> https://issues.apache.org/jira/browse/CLEREZZA-643
>> 
>> On Wed, Oct 26, 2011 at 3:36 PM, Daniel Spicar <ds...@apache.org> wrote:
>> 
>>> Rupert provided a patch to improve serialization performance (thanks for
>>> the effort!). I reviewed his Patch and have written my comments on the
>> JIRA
>>> page. But I think we need to discuss the issues I raise there. In
>> summary:
>>> 
>>> - neither the patch nor the current implementations work reliably with
>> very
>>> large graphs (larger than memeory)
>>> - the patch is significantly faster than the current implementation
>>> - the current implementation is easier to quick-fix for very large graphs
>>> (but also very slow)
>>> 
>>> There is a sketch of a better solution that should allow us to be faster
>>> and not limited by memory size. It is based on sorted iterators. However
>>> these iterators need to be supplied by the underlying TripleCollections
>> and
>>> that will require more changes to the core of Clerezza.
>>> 
>>> Because both, the current implementation and the patch doe not really
>> work
>>> on "big" TripleCollection (when big means really really big) the question
>> we
>>> should discuss its:
>>> a) keep everything as it is and solve the problem properly (possibly as
>>> described in the issue)
>>> b) quick fix the current implementation (slow performance)  + schedule a
>>> proper solution
>>> c) apply the patch (fast but graphs limited to available memory size) +
>>> schedule a proper solution
>>> 
>>> My favorite is c.
>>> 
>>> What do you think?
>>> 
>> 

--trialox ag-------------------------------------
  tsuyoshi ito
  hardturmstrasse 101 
  8005 zuerich

Re: Weak Performance of "application/json+rdf" serializer on big TripleCollections (CLEREZZA-643)

Posted by Tommaso Teofili <to...@gmail.com>.

same here; I'd go with C option :)
Tommaso

2011/10/26 Daniel Spicar <ds...@apache.org>

> the JIRA issue can be found here:
> https://issues.apache.org/jira/browse/CLEREZZA-643
>
> On Wed, Oct 26, 2011 at 3:36 PM, Daniel Spicar <ds...@apache.org> wrote:
>
> > Rupert provided a patch to improve serialization performance (thanks for
> > the effort!). I reviewed his Patch and have written my comments on the
> JIRA
> > page. But I think we need to discuss the issues I raise there. In
> summary:
> >
> > - neither the patch nor the current implementations work reliably with
> very
> > large graphs (larger than memeory)
> > - the patch is significantly faster than the current implementation
> > - the current implementation is easier to quick-fix for very large graphs
> > (but also very slow)
> >
> > There is a sketch of a better solution that should allow us to be faster
> > and not limited by memory size. It is based on sorted iterators. However
> > these iterators need to be supplied by the underlying TripleCollections
> and
> > that will require more changes to the core of Clerezza.
> >
> > Because both, the current implementation and the patch doe not really
> work
> > on "big" TripleCollection (when big means really really big) the question
> we
> > should discuss its:
> > a) keep everything as it is and solve the problem properly (possibly as
> > described in the issue)
> > b) quick fix the current implementation (slow performance)  + schedule a
> > proper solution
> > c) apply the patch (fast but graphs limited to available memory size) +
> > schedule a proper solution
> >
> > My favorite is c.
> >
> > What do you think?
> >
>

Re: Weak Performance of "application/json+rdf" serializer on big TripleCollections (CLEREZZA-643)

Posted by Daniel Spicar <ds...@apache.org>.

the JIRA issue can be found here:
https://issues.apache.org/jira/browse/CLEREZZA-643

On Wed, Oct 26, 2011 at 3:36 PM, Daniel Spicar <ds...@apache.org> wrote:

> Rupert provided a patch to improve serialization performance (thanks for
> the effort!). I reviewed his Patch and have written my comments on the JIRA
> page. But I think we need to discuss the issues I raise there. In summary:
>
> - neither the patch nor the current implementations work reliably with very
> large graphs (larger than memeory)
> - the patch is significantly faster than the current implementation
> - the current implementation is easier to quick-fix for very large graphs
> (but also very slow)
>
> There is a sketch of a better solution that should allow us to be faster
> and not limited by memory size. It is based on sorted iterators. However
> these iterators need to be supplied by the underlying TripleCollections and
> that will require more changes to the core of Clerezza.
>
> Because both, the current implementation and the patch doe not really work
> on "big" TripleCollection (when big means really really big) the question we
> should discuss its:
> a) keep everything as it is and solve the problem properly (possibly as
> described in the issue)
> b) quick fix the current implementation (slow performance)  + schedule a
> proper solution
> c) apply the patch (fast but graphs limited to available memory size) +
> schedule a proper solution
>
> My favorite is c.
>
> What do you think?
>