You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@clerezza.apache.org by "Rupert Westenthaler (Updated) (JIRA)" <ji...@apache.org> on 2011/10/13 23:29:11 UTC

[jira] [Updated] (CLEREZZA-643) Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser using Platform encoding instead of UTF-8

     [ https://issues.apache.org/jira/browse/CLEREZZA-643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rupert Westenthaler updated CLEREZZA-643:
-----------------------------------------

    Attachment: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch

To solve this I created an alternative implementation that

* copies the Triples to an Array
* uses Arrays.sort with an comparator based on the Subject to sort the triples
* iterates over the triples until the subjects changes while storing predicate/object values in an intermediate map
* directly writes the JSON data for each subject to a buffered writer. It dose NOT create the JSON objects for all sub jets of the serialized TripleCollection

This implementation serializes a Graph with 100k triples in about 1sec on my machine.
The source also includes a lot of comments about different approaches. I kept such comments mainly to document the different approaches I tried during testing and optimizing.

I also implemented a method (RdfJsonSerializerProviderTest#testBigGraph()) that can create a RDF graph (mix of URIs, bNodes, TypedLiterals and PlainLiterals) that can be used for testing. Currently the generated graph is 10 times serialized to get rid of JIT compilation side effects. 
Currently the @Test annotation of this test is serialized because it is more intended to test performance related implications of different implementations than to test the validity of the generated json+rdf.

Two final notes: 

* Sorting the triples of the parsed collection is only the second best way. It would be even better if one could get a sorted iterator directly from a triple collection. e.g. Jena TDB by default provides an iterator based on the SPO index that happens to be sorted based on subjects.
* The Apache Stanbol JSON-LD serializer referenced by CLEREZZA-642 suffers also from similar problems as the current JSON+RDF serializer. 

                
> Weak Performance of "application/json+rdf" serializer on big TripleCollections and Serialzer/Parser using Platform encoding instead of UTF-8
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLEREZZA-643
>                 URL: https://issues.apache.org/jira/browse/CLEREZZA-643
>             Project: Clerezza
>          Issue Type: Improvement
>            Reporter: Rupert Westenthaler
>         Attachments: rdf.rdfjson-arrays.sort_based_serializer_and_UTF-8.patch
>
>
> Both the "application/json+rdf" serializer and parser use platform specific encodings instead of UTF-8.
> In addition the serializer suffers from very poor performance on big graphs (at least when using SimpleMGrpah)
> After some digging in the Code I came to the conclusion that this is because of the use of multiple TripleCollection.filter(..) calls fist to filter all predicates for an subject and than all objects for each subject/predicate combination. A trying to serialize a graph with 50k triples ended in several minutes 100% CPU.
> With the next comment I will provide a patch with an implementation based on a sorted array of the triples. With this method one can serialize graphs with 100k in about 1sec. This patch also changes encoding to UTF-8.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira