You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@clerezza.apache.org by "Daniel Spicar (Commented) (JIRA)" <ji...@apache.org> on 2011/10/26 14:35:32 UTC

[jira] [Commented] (CLEREZZA-643) Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser using Platform encoding instead of UTF-8

    [ https://issues.apache.org/jira/browse/CLEREZZA-643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135900#comment-13135900 ] 

Daniel Spicar commented on CLEREZZA-643:
----------------------------------------

Thank you for your contribution, Rupert.

I inspected the patch you submitted and the current version of the code. An improved RDF-JSON serializer is something we would really like. In real-world applications we use Clerezza for, one major problem is bad performance and/or excessive memory consumption. We are dealing with huge graphs there. Therefore your contribution sounds really promising. But as I am not the author of the original code, I reviewed the original code as well with respect to the above scenario. My focus was determining whether the implementations scale with large Graphs.

Comments on the Serializer:
- As I understand the rdf-json specification, sorting of output is not required, only grouping by subject and predicate. Therefore I don't think the more expensive subject-predicate sort (that you commented out but still included) is necessary. Or am I missing something? Can this part be safely removed?

- The original code (unpatched) does NOT properly stream the serialization. This is a concern when the source graph contains too many unique subjects/predicates/objects, because all the generated JSONObjects/Arrays are stored in memory before being written to the output stream. This is especially concerning when many BLOBs are stored in the graph.

- The patch does correctly stream the serialization, but it loads the entire source graph into memory for sorting (toArray call at line 99). Again this may easily exceed available memory. The original code does not load the entire source graph into memory as it uses filter (when the underlying graph is backed by a TripleStore). The iterators returned by filter only access the data in the graph for one triple at a time upon the call to next().

Conclusion:
I think none of the solutions can support graphs that exceed memory size. I assume the unpatched version can deal with slightly larger graphs than your solution but that is irrelevant. We need a solution that will work reliably with graphs larger than memory size. As you mentioned an optimal solution would exploit a sorted (or at least grouped) iterator provided by the underlying TripleCollection. I think that is the approach we need to take to solve this issue in a scalable manner.

Now the there is the question whether to accept your patch for Clerezza until we implement a better solution. I am not sure. Your solution is a significant improvement in terms of speed of serialization, but the original code is easier to quick-fix such that the results are streamed properly to the outputstream (I think exploiting the JSON simple streaming interface may do the trick). So the question seems to be, what is more important, a solution that, while possibly very slow, will not exceed available memory, or a solution that significantly improves serialization performance. 

My opinion is that since we seemed to live so far with a solution that can not deal with very large graphs anyway, the speed improvement may be more valuable. However we need to get working on a better solution as described above.

I think we should raise this issue on the mailing list for discussion.
                
> Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser using Platform encoding instead of UTF-8
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLEREZZA-643
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-643
>             Project: Clerezza
>          Issue Type: Improvement
>            Reporter: Rupert Westenthaler
>            Assignee: Daniel Spicar
>         Attachments: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch
>
>
> Both the "application/json+rdf" serializer and parser use platform specific encodings instead of UTF-8.
> In addition the serializer suffers from very poor performance on big graphs (at least when using SimpleMGrpah)
> After some digging in the Code I came to the conclusion that this is because of the use of multiple TripleCollection.filter(..) calls fist to filter all predicates for an subject and than all objects for each subject/predicate combination. A trying to serialize a graph with 50k triples ended in several minutes 100% CPU.
> With the next comment I will provide a patch with an implementation based on a sorted array of the triples. With this method one can serialize graphs with 100k in about 1sec. This patch also changes encoding to UTF-8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira