You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Scarlet Remilia <kn...@outlook.com> on 2019/06/05 16:59:05 UTC

Jena Model is serializable in Java?

Hello everyone,
I get a problem about Jena and Spark.
I use Jena Model to handle some RDF models in my spark executor, but I get a error:

java.io.NotSerializableException: org.apache.jena.rdf.model.impl.ModelCom

Serialization stack:
        - object not serializable (class: org.apache.jena.rdf.model.impl.ModelCom)
        - field (class: org.nari.r2rml.entities.Template, name: model, type: interface org.apache.jena.rdf.model.Model)
        - object (class org.nari.r2rml.entities.Template, org.nari.r2rml.entities.Template@23dc70c1)
        - field (class: org.nari.r2rml.entities.PredicateObjectMap, name: objectTemplate, type: class org.nari.r2rml.entities.Template)
        - object (class org.nari.r2rml.entities.PredicateObjectMap, org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
        - writeObject data (class: java.util.ArrayList)
        - object (class java.util.ArrayList, [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
        - field (class: org.nari.r2rml.entities.LogicalTableMapping, name: predicateObjectMaps, type: class java.util.ArrayList)
        - object (class org.nari.r2rml.entities.LogicalTableMapping, org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
        - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction, name: logicalTableMapping, type: class org.nari.r2rml.entities.LogicalTableMapping)
        - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction, org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
        - field (class: org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4, type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
        - object (class org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
        at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
        at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
        at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
        at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
        ... 33 more

All these classes implement serializable interface.
So how could I serialize Jena model java object?

Thanks very much!


Jason

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

Re: Jena Model is serializable in Java?

Posted by Andy Seaborne <an...@apache.org>.


On 06/06/2019 02:12, Dan Davis wrote:
> Jason,
> 
> I would argue that you should exchange a Set of triples, so you can take
> advantage of Spark's distributed nature.  Your logic can materialize that
> list into a Graph or Model when needed to operate on it.   Andy is right
> about being careful about the size - you may want to build a specialized
> set that throws if the set is too large, and you may want to experiment
> with it.
> 
> Andy,
> 
> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
> for fast parse? 

https://jena.apache.org/documentation/io/rdf-binary.html

It's about x2 faster than N-triples to parse, and about the same time to 
write.

     Andy

Re: Jena Model is serializable in Java?

Posted by Siddhesh Rane <ki...@gmail.com>.

On Wed, Jul 3, 2019 at 12:27 PM Lorenz B.
<bu...@informatik.uni-leipzig.de> wrote:
>
> This code can only work if the Turtle data isn't distributed across
> partitions in SPARK as Turtle isn't a splittable format like N-Triples
> would be. I'm wondering if you did consider this in your application?

Yes this aspect was considered.
To begin, every partition contained a list of article URIs, then an
entire partition was mapped to a Model by fetching all triples where
the subject was the article URI.
This way it was guaranteed that a Model would be on one machine.

> > Serializing Model worked for me using Spark Dataset API. Model was not
> > associated with any database.
> > Models were constructed from turtle string and then transformed to
> > another serializable form.
> >
> > ...
> >     private final Encoder<Model> MODEL_ENCODER = Encoders.bean(Model.class);
> > ...
> >             .mapPartitions((Iterator<String> it) -> {...}, Encoders.STRING())
> >             .map(ttl -> {
> >                 Model m =
> > ModelFactory.createDefaultModel().read(IOUtils.toInputStream(ttl,
> > "UTF-8"), null, "TTL");
> >                 m = getProperties(m);
> >                 m = populateLabels(m);
> >                 return m;
> >             }, MODEL_ENCODER)
> >             .flatMap(this::extractRelationalSentences, RELATION_ENCODER)
> > ...
> >
> > Full code here [1]
> >
> > Siddhesh
> >
> > [1] https://github.com/SiddheshRane/sparkl/blob/0bd5b267ffaffdc2dc2d7e59a5b07f09706be8d2/src/main/java/siddhesh/sparkl/TagArticlesFromFile.java#L149
> >
> > On Sat, Jun 8, 2019 at 5:05 PM Andy Seaborne <an...@apache.org> wrote:
> >> Hi Jason
> >>
> >> On 08/06/2019 11:49, Scarlet Remilia wrote:
> >>> Hello everyone,
> >>>
> >>>
> >>>
> >>> I changed model to triples in RDD/Dataset, but there is a question.
> >>>
> >>> I have triples in Dataset of Spark now, and I need to put them into a Model or something else ,then output them into a file or TDB or somewhere else.
> >>>
> >>> As Dan mentioned before, is there any binary syntax for RDF?
> >> https://jena.apache.org/documentation/io/rdf-binary.html
> >> org.apache.jena.riot.thrift.*
> >>
> >>> Or Is Jena supported distributed model to handling billions triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a quite problem for me.
> >> (It's MR+SW - multiple reader AND single writer)
> >>
> >> Are you wanting to load smallish units of triples from multiple sources?
> >>
> >> Maybe you want to have all the streams send their output to a queue (in
> >> blocks, not triple by triple) and have TDB load from that queue.
> >> Multiple StreamRDF to a single StreamRDF, load the StreamTDB.
> >>
> >> There is the TDB2 parallel loader - that is, loading from a single
> >> source using internal parallelism, not loading from parallel inputs.
> >> (It's 5 threads for triples, more for quads). It load from a StreamRDF.
> >>
> >> NB - it can consume all the server's I/O bandwidth and a lot of CPU to
> >> make the machine unusable for anything else. It is quite hardware dependent.
> >>
> >>      Andy
> >>
> >>>
> >>>
> >>> Thank you very much!
> >>>
> >>> Jason
> >>>
> >>>
> >>>
> >>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Andy Seaborne <an...@apache.org>
> >>> Sent: Thursday, June 6, 2019 6:35:41 PM
> >>> To: users@jena.apache.org
> >>> Subject: Re: Jena Model is serializable in Java?
> >>>
> >>>
> >>>
> >>> On 06/06/2019 08:57, Scarlet Remilia wrote:
> >>>> Hello everyone,
> >>>>
> >>>> My use case is a r2rml implementation, which could support millions or billions rows from RDBMS and distributed parse them into RDF.
> >>>> For now, We try to setup some small models in different spark executors to parse individually, and finally union them all.
> >>> That sounds more like a stream usage.
> >>>
> >>> Jena's StreamRDF and collect to a set (model or graph don't sound like
> >>> they do anything for your application - sound like you are just using
> >>> them as container of triples to move around.
> >>>
> >>>> I think RDD[Triple] is a good idea, but I need to review exist code to change model into triples.
> >>>>
> >>>> an RDF syntax and write-then-read the RDF is also a resolution but is too loose. It’s very hard to manage these files, especially there are too many small models mentioned above.
> >>>>
> >>>> Thanks,
> >>>> Jason
> >>>>
> >>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> >>>>
> >>>> From: Lorenz B.<ma...@informatik.uni-leipzig.de>
> >>>> Sent: Thursday, June 6, 2019 15:32
> >>>> To: users@jena.apache.org<ma...@jena.apache.org>
> >>>> Subject: Re: Jena Model is serializable in Java?
> >>>>
> >>>> I don't see why one would want to share Model instances via Spark. I
> >>>> mean, it's possible via wrapping it inside an object which is
> >>>> serializable or some other wrapper method:
> >>>>
> >>>> object ModelWrapper extends Serializable {
> >>>> lazy val model = ...
> >>>> }
> >>>>
> >>>> rdd.map(s => ModelWrapper.model. ... )
> >>>>
> >>>>
> >>>> This makes the model being attached to some static code that can't be
> >>>> changed during runtime and that's what Spark needs.
> >>>>
> >>>> Ideally, you'd use some broadcast variable, but indeed those are just
> >>>> use to share smaller entities among the different Spark workers. For
> >>>> smaller models like a schema this would work and is supposed to be more
> >>>> efficient than having joins etc. (yes, there are also broadcast joins in
> >>>> Spark, still data would be distributed during processing) - but it
> >>>> depends ...
> >>>>
> >>>> I don't know your use-case nor why you need a Model, but what we did
> >>>> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
> >>>> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
> >>>> and this is the only way to scale when using very large datasets.
> >>>> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
> >>>> pretty easy. For Dataset you have to define a custom encoder (Kryo
> >>>> encoder works though).
> >>>>
> >>>> But as already mentioned, your use-case or application would be needed
> >>>> to give further advice if necessary.
> >>>>
> >>>>> Jason,
> >>>>>
> >>>>> I would argue that you should exchange a Set of triples, so you can take
> >>>>> advantage of Spark's distributed nature.  Your logic can materialize that
> >>>>> list into a Graph or Model when needed to operate on it.   Andy is right
> >>>>> about being careful about the size - you may want to build a specialized
> >>>>> set that throws if the set is too large, and you may want to experiment
> >>>>> with it.
> >>>>>
> >>>>> Andy,
> >>>>>
> >>>>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
> >>>>> for fast parse?  I'm recalling Michael Stonebraker's response to the
> >>>>> BigTable paper -
> >>>>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
> >>>>> and also gSOAP and other binary XML formats.  To this paper, the Google
> >>>>> BigTable authors then responded that they don't use loose serializations
> >>>>> such as provided by HDFS, but instead use structured data.
> >>>>>
> >>>>> This is hugely important to Jason's question because this is one of the
> >>>>> benefits of using Spark instead of HDFS - Spark will handle distributing a
> >>>>> huge dataset to multiple systems so that algorithm authors can operate on a
> >>>>> vector (of Jena models?) far too large to fit in one machine.
> >>>>>
> >>>>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
> >>>>>
> >>>>>> Hi Jason,
> >>>>>>
> >>>>>> Models aren't serializable, nor are Graphs (the more system oriented
> >>>>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
> >>>>>> send a list of triples.
> >>>>>>
> >>>>>> Or use an RDF syntax and write-then-read the RDF.
> >>>>>>
> >>>>>> But are the models small? RDF graph aren't always small so moving them
> >>>>>> around may be expensive.
> >>>>>>
> >>>>>>        Andy
> >>>>>>
> >>>>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
> >>>>>>> Hello everyone,
> >>>>>>> I get a problem about Jena and Spark.
> >>>>>>> I use Jena Model to handle some RDF models in my spark executor, but I
> >>>>>> get a error:
> >>>>>>> java.io.NotSerializableException:
> >>>>>> org.apache.jena.rdf.model.impl.ModelCom
> >>>>>>> Serialization stack:
> >>>>>>>            - object not serializable (class:
> >>>>>> org.apache.jena.rdf.model.impl.ModelCom)
> >>>>>>>            - field (class: org.nari.r2rml.entities.Template, name: model,
> >>>>>> type: interface org.apache.jena.rdf.model.Model)
> >>>>>>>            - object (class org.nari.r2rml.entities.Template,
> >>>>>> org.nari.r2rml.entities.Template@23dc70c1)
> >>>>>>>            - field (class: org.nari.r2rml.entities.PredicateObjectMap,
> >>>>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
> >>>>>>>            - object (class org.nari.r2rml.entities.PredicateObjectMap,
> >>>>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
> >>>>>>>            - writeObject data (class: java.util.ArrayList)
> >>>>>>>            - object (class java.util.ArrayList,
> >>>>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
> >>>>>>>            - field (class: org.nari.r2rml.entities.LogicalTableMapping,
> >>>>>> name: predicateObjectMaps, type: class java.util.ArrayList)
> >>>>>>>            - object (class org.nari.r2rml.entities.LogicalTableMapping,
> >>>>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
> >>>>>>>            - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
> >>>>>> name: logicalTableMapping, type: class
> >>>>>> org.nari.r2rml.entities.LogicalTableMapping)
> >>>>>>>            - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
> >>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
> >>>>>>>            - field (class:
> >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
> >>>>>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
> >>>>>>>            - object (class
> >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
> >>>>>>>            at
> >>>>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> >>>>>>>            at
> >>>>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> >>>>>>>            at
> >>>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> >>>>>>>            at
> >>>>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
> >>>>>>>            ... 33 more
> >>>>>>>
> >>>>>>> All these classes implement serializable interface.
> >>>>>>> So how could I serialize Jena model java object?
> >>>>>>>
> >>>>>>> Thanks very much!
> >>>>>>>
> >>>>>>>
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> >>>>>> Windows 10
> >>>> --
> >>>> Lorenz Bühmann
> >>>> AKSW group, University of Leipzig
> >>>> Group: http://aksw.org - semantic web research center
> >>>>
> >>>>
> >
> >
> > --
> > Your greatest regret is the email ID you choose in 8th grade
> >
> >
> --
> Lorenz Bühmann
> AKSW group, University of Leipzig
> Group: http://aksw.org - semantic web research center
>


-- 
Your greatest regret is the email ID you choose in 8th grade

Re: Jena Model is serializable in Java?

Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.

This code can only work if the Turtle data isn't distributed across
partitions in SPARK as Turtle isn't a splittable format like N-Triples
would be. I'm wondering if you did consider this in your application?

> Serializing Model worked for me using Spark Dataset API. Model was not
> associated with any database.
> Models were constructed from turtle string and then transformed to
> another serializable form.
>
> ...
>     private final Encoder<Model> MODEL_ENCODER = Encoders.bean(Model.class);
> ...
>             .mapPartitions((Iterator<String> it) -> {...}, Encoders.STRING())
>             .map(ttl -> {
>                 Model m =
> ModelFactory.createDefaultModel().read(IOUtils.toInputStream(ttl,
> "UTF-8"), null, "TTL");
>                 m = getProperties(m);
>                 m = populateLabels(m);
>                 return m;
>             }, MODEL_ENCODER)
>             .flatMap(this::extractRelationalSentences, RELATION_ENCODER)
> ...
>
> Full code here [1]
>
> Siddhesh
>
> [1] https://github.com/SiddheshRane/sparkl/blob/0bd5b267ffaffdc2dc2d7e59a5b07f09706be8d2/src/main/java/siddhesh/sparkl/TagArticlesFromFile.java#L149
>
> On Sat, Jun 8, 2019 at 5:05 PM Andy Seaborne <an...@apache.org> wrote:
>> Hi Jason
>>
>> On 08/06/2019 11:49, Scarlet Remilia wrote:
>>> Hello everyone,
>>>
>>>
>>>
>>> I changed model to triples in RDD/Dataset, but there is a question.
>>>
>>> I have triples in Dataset of Spark now, and I need to put them into a Model or something else ,then output them into a file or TDB or somewhere else.
>>>
>>> As Dan mentioned before, is there any binary syntax for RDF?
>> https://jena.apache.org/documentation/io/rdf-binary.html
>> org.apache.jena.riot.thrift.*
>>
>>> Or Is Jena supported distributed model to handling billions triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a quite problem for me.
>> (It's MR+SW - multiple reader AND single writer)
>>
>> Are you wanting to load smallish units of triples from multiple sources?
>>
>> Maybe you want to have all the streams send their output to a queue (in
>> blocks, not triple by triple) and have TDB load from that queue.
>> Multiple StreamRDF to a single StreamRDF, load the StreamTDB.
>>
>> There is the TDB2 parallel loader - that is, loading from a single
>> source using internal parallelism, not loading from parallel inputs.
>> (It's 5 threads for triples, more for quads). It load from a StreamRDF.
>>
>> NB - it can consume all the server's I/O bandwidth and a lot of CPU to
>> make the machine unusable for anything else. It is quite hardware dependent.
>>
>>      Andy
>>
>>>
>>>
>>> Thank you very much!
>>>
>>> Jason
>>>
>>>
>>>
>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>>>
>>>
>>>
>>> ________________________________
>>> From: Andy Seaborne <an...@apache.org>
>>> Sent: Thursday, June 6, 2019 6:35:41 PM
>>> To: users@jena.apache.org
>>> Subject: Re: Jena Model is serializable in Java?
>>>
>>>
>>>
>>> On 06/06/2019 08:57, Scarlet Remilia wrote:
>>>> Hello everyone,
>>>>
>>>> My use case is a r2rml implementation, which could support millions or billions rows from RDBMS and distributed parse them into RDF.
>>>> For now, We try to setup some small models in different spark executors to parse individually, and finally union them all.
>>> That sounds more like a stream usage.
>>>
>>> Jena's StreamRDF and collect to a set (model or graph don't sound like
>>> they do anything for your application - sound like you are just using
>>> them as container of triples to move around.
>>>
>>>> I think RDD[Triple] is a good idea, but I need to review exist code to change model into triples.
>>>>
>>>> an RDF syntax and write-then-read the RDF is also a resolution but is too loose. It’s very hard to manage these files, especially there are too many small models mentioned above.
>>>>
>>>> Thanks,
>>>> Jason
>>>>
>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>>>>
>>>> From: Lorenz B.<ma...@informatik.uni-leipzig.de>
>>>> Sent: Thursday, June 6, 2019 15:32
>>>> To: users@jena.apache.org<ma...@jena.apache.org>
>>>> Subject: Re: Jena Model is serializable in Java?
>>>>
>>>> I don't see why one would want to share Model instances via Spark. I
>>>> mean, it's possible via wrapping it inside an object which is
>>>> serializable or some other wrapper method:
>>>>
>>>> object ModelWrapper extends Serializable {
>>>> lazy val model = ...
>>>> }
>>>>
>>>> rdd.map(s => ModelWrapper.model. ... )
>>>>
>>>>
>>>> This makes the model being attached to some static code that can't be
>>>> changed during runtime and that's what Spark needs.
>>>>
>>>> Ideally, you'd use some broadcast variable, but indeed those are just
>>>> use to share smaller entities among the different Spark workers. For
>>>> smaller models like a schema this would work and is supposed to be more
>>>> efficient than having joins etc. (yes, there are also broadcast joins in
>>>> Spark, still data would be distributed during processing) - but it
>>>> depends ...
>>>>
>>>> I don't know your use-case nor why you need a Model, but what we did
>>>> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
>>>> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
>>>> and this is the only way to scale when using very large datasets.
>>>> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
>>>> pretty easy. For Dataset you have to define a custom encoder (Kryo
>>>> encoder works though).
>>>>
>>>> But as already mentioned, your use-case or application would be needed
>>>> to give further advice if necessary.
>>>>
>>>>> Jason,
>>>>>
>>>>> I would argue that you should exchange a Set of triples, so you can take
>>>>> advantage of Spark's distributed nature.  Your logic can materialize that
>>>>> list into a Graph or Model when needed to operate on it.   Andy is right
>>>>> about being careful about the size - you may want to build a specialized
>>>>> set that throws if the set is too large, and you may want to experiment
>>>>> with it.
>>>>>
>>>>> Andy,
>>>>>
>>>>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
>>>>> for fast parse?  I'm recalling Michael Stonebraker's response to the
>>>>> BigTable paper -
>>>>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
>>>>> and also gSOAP and other binary XML formats.  To this paper, the Google
>>>>> BigTable authors then responded that they don't use loose serializations
>>>>> such as provided by HDFS, but instead use structured data.
>>>>>
>>>>> This is hugely important to Jason's question because this is one of the
>>>>> benefits of using Spark instead of HDFS - Spark will handle distributing a
>>>>> huge dataset to multiple systems so that algorithm authors can operate on a
>>>>> vector (of Jena models?) far too large to fit in one machine.
>>>>>
>>>>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>>> Hi Jason,
>>>>>>
>>>>>> Models aren't serializable, nor are Graphs (the more system oriented
>>>>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>>>>>> send a list of triples.
>>>>>>
>>>>>> Or use an RDF syntax and write-then-read the RDF.
>>>>>>
>>>>>> But are the models small? RDF graph aren't always small so moving them
>>>>>> around may be expensive.
>>>>>>
>>>>>>        Andy
>>>>>>
>>>>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>>>>>> Hello everyone,
>>>>>>> I get a problem about Jena and Spark.
>>>>>>> I use Jena Model to handle some RDF models in my spark executor, but I
>>>>>> get a error:
>>>>>>> java.io.NotSerializableException:
>>>>>> org.apache.jena.rdf.model.impl.ModelCom
>>>>>>> Serialization stack:
>>>>>>>            - object not serializable (class:
>>>>>> org.apache.jena.rdf.model.impl.ModelCom)
>>>>>>>            - field (class: org.nari.r2rml.entities.Template, name: model,
>>>>>> type: interface org.apache.jena.rdf.model.Model)
>>>>>>>            - object (class org.nari.r2rml.entities.Template,
>>>>>> org.nari.r2rml.entities.Template@23dc70c1)
>>>>>>>            - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>>>>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>>>>>            - object (class org.nari.r2rml.entities.PredicateObjectMap,
>>>>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>>>>>            - writeObject data (class: java.util.ArrayList)
>>>>>>>            - object (class java.util.ArrayList,
>>>>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>>>>>            - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>>>>>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>>>>>            - object (class org.nari.r2rml.entities.LogicalTableMapping,
>>>>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>>>>>            - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>>>>> name: logicalTableMapping, type: class
>>>>>> org.nari.r2rml.entities.LogicalTableMapping)
>>>>>>>            - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>>>>>            - field (class:
>>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>>>>>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>>>>>            - object (class
>>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>>>>>            at
>>>>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>>>>>            at
>>>>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>>>>>            at
>>>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>>>>>            at
>>>>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>>>>>            ... 33 more
>>>>>>>
>>>>>>> All these classes implement serializable interface.
>>>>>>> So how could I serialize Jena model java object?
>>>>>>>
>>>>>>> Thanks very much!
>>>>>>>
>>>>>>>
>>>>>>> Jason
>>>>>>>
>>>>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>>> Windows 10
>>>> --
>>>> Lorenz Bühmann
>>>> AKSW group, University of Leipzig
>>>> Group: http://aksw.org - semantic web research center
>>>>
>>>>
>
>
> --
> Your greatest regret is the email ID you choose in 8th grade
>
>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Re: Jena Model is serializable in Java?

Posted by Siddhesh Rane <ki...@gmail.com>.

Serializing Model worked for me using Spark Dataset API. Model was not
associated with any database.
Models were constructed from turtle string and then transformed to
another serializable form.

...
    private final Encoder<Model> MODEL_ENCODER = Encoders.bean(Model.class);
...
            .mapPartitions((Iterator<String> it) -> {...}, Encoders.STRING())
            .map(ttl -> {
                Model m =
ModelFactory.createDefaultModel().read(IOUtils.toInputStream(ttl,
"UTF-8"), null, "TTL");
                m = getProperties(m);
                m = populateLabels(m);
                return m;
            }, MODEL_ENCODER)
            .flatMap(this::extractRelationalSentences, RELATION_ENCODER)
...

Full code here [1]

Siddhesh

[1] https://github.com/SiddheshRane/sparkl/blob/0bd5b267ffaffdc2dc2d7e59a5b07f09706be8d2/src/main/java/siddhesh/sparkl/TagArticlesFromFile.java#L149

On Sat, Jun 8, 2019 at 5:05 PM Andy Seaborne <an...@apache.org> wrote:
>
> Hi Jason
>
> On 08/06/2019 11:49, Scarlet Remilia wrote:
> > Hello everyone,
> >
> >
> >
> > I changed model to triples in RDD/Dataset, but there is a question.
> >
> > I have triples in Dataset of Spark now, and I need to put them into a Model or something else ,then output them into a file or TDB or somewhere else.
> >
> > As Dan mentioned before, is there any binary syntax for RDF?
>
> https://jena.apache.org/documentation/io/rdf-binary.html
> org.apache.jena.riot.thrift.*
>
> > Or Is Jena supported distributed model to handling billions triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a quite problem for me.
>
> (It's MR+SW - multiple reader AND single writer)
>
> Are you wanting to load smallish units of triples from multiple sources?
>
> Maybe you want to have all the streams send their output to a queue (in
> blocks, not triple by triple) and have TDB load from that queue.
> Multiple StreamRDF to a single StreamRDF, load the StreamTDB.
>
> There is the TDB2 parallel loader - that is, loading from a single
> source using internal parallelism, not loading from parallel inputs.
> (It's 5 threads for triples, more for quads). It load from a StreamRDF.
>
> NB - it can consume all the server's I/O bandwidth and a lot of CPU to
> make the machine unusable for anything else. It is quite hardware dependent.
>
>      Andy
>
> >
> >
> >
> > Thank you very much!
> >
> > Jason
> >
> >
> >
> > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> >
> >
> >
> > ________________________________
> > From: Andy Seaborne <an...@apache.org>
> > Sent: Thursday, June 6, 2019 6:35:41 PM
> > To: users@jena.apache.org
> > Subject: Re: Jena Model is serializable in Java?
> >
> >
> >
> > On 06/06/2019 08:57, Scarlet Remilia wrote:
> >> Hello everyone,
> >>
> >> My use case is a r2rml implementation, which could support millions or billions rows from RDBMS and distributed parse them into RDF.
> >> For now, We try to setup some small models in different spark executors to parse individually, and finally union them all.
> >
> > That sounds more like a stream usage.
> >
> > Jena's StreamRDF and collect to a set (model or graph don't sound like
> > they do anything for your application - sound like you are just using
> > them as container of triples to move around.
> >
> >> I think RDD[Triple] is a good idea, but I need to review exist code to change model into triples.
> >>
> >> an RDF syntax and write-then-read the RDF is also a resolution but is too loose. It’s very hard to manage these files, especially there are too many small models mentioned above.
> >>
> >> Thanks,
> >> Jason
> >>
> >> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> >>
> >> From: Lorenz B.<ma...@informatik.uni-leipzig.de>
> >> Sent: Thursday, June 6, 2019 15:32
> >> To: users@jena.apache.org<ma...@jena.apache.org>
> >> Subject: Re: Jena Model is serializable in Java?
> >>
> >> I don't see why one would want to share Model instances via Spark. I
> >> mean, it's possible via wrapping it inside an object which is
> >> serializable or some other wrapper method:
> >>
> >> object ModelWrapper extends Serializable {
> >> lazy val model = ...
> >> }
> >>
> >> rdd.map(s => ModelWrapper.model. ... )
> >>
> >>
> >> This makes the model being attached to some static code that can't be
> >> changed during runtime and that's what Spark needs.
> >>
> >> Ideally, you'd use some broadcast variable, but indeed those are just
> >> use to share smaller entities among the different Spark workers. For
> >> smaller models like a schema this would work and is supposed to be more
> >> efficient than having joins etc. (yes, there are also broadcast joins in
> >> Spark, still data would be distributed during processing) - but it
> >> depends ...
> >>
> >> I don't know your use-case nor why you need a Model, but what we did
> >> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
> >> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
> >> and this is the only way to scale when using very large datasets.
> >> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
> >> pretty easy. For Dataset you have to define a custom encoder (Kryo
> >> encoder works though).
> >>
> >> But as already mentioned, your use-case or application would be needed
> >> to give further advice if necessary.
> >>
> >>> Jason,
> >>>
> >>> I would argue that you should exchange a Set of triples, so you can take
> >>> advantage of Spark's distributed nature.  Your logic can materialize that
> >>> list into a Graph or Model when needed to operate on it.   Andy is right
> >>> about being careful about the size - you may want to build a specialized
> >>> set that throws if the set is too large, and you may want to experiment
> >>> with it.
> >>>
> >>> Andy,
> >>>
> >>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
> >>> for fast parse?  I'm recalling Michael Stonebraker's response to the
> >>> BigTable paper -
> >>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
> >>> and also gSOAP and other binary XML formats.  To this paper, the Google
> >>> BigTable authors then responded that they don't use loose serializations
> >>> such as provided by HDFS, but instead use structured data.
> >>>
> >>> This is hugely important to Jason's question because this is one of the
> >>> benefits of using Spark instead of HDFS - Spark will handle distributing a
> >>> huge dataset to multiple systems so that algorithm authors can operate on a
> >>> vector (of Jena models?) far too large to fit in one machine.
> >>>
> >>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
> >>>
> >>>> Hi Jason,
> >>>>
> >>>> Models aren't serializable, nor are Graphs (the more system oriented
> >>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
> >>>> send a list of triples.
> >>>>
> >>>> Or use an RDF syntax and write-then-read the RDF.
> >>>>
> >>>> But are the models small? RDF graph aren't always small so moving them
> >>>> around may be expensive.
> >>>>
> >>>>        Andy
> >>>>
> >>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
> >>>>> Hello everyone,
> >>>>> I get a problem about Jena and Spark.
> >>>>> I use Jena Model to handle some RDF models in my spark executor, but I
> >>>> get a error:
> >>>>> java.io.NotSerializableException:
> >>>> org.apache.jena.rdf.model.impl.ModelCom
> >>>>> Serialization stack:
> >>>>>            - object not serializable (class:
> >>>> org.apache.jena.rdf.model.impl.ModelCom)
> >>>>>            - field (class: org.nari.r2rml.entities.Template, name: model,
> >>>> type: interface org.apache.jena.rdf.model.Model)
> >>>>>            - object (class org.nari.r2rml.entities.Template,
> >>>> org.nari.r2rml.entities.Template@23dc70c1)
> >>>>>            - field (class: org.nari.r2rml.entities.PredicateObjectMap,
> >>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
> >>>>>            - object (class org.nari.r2rml.entities.PredicateObjectMap,
> >>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
> >>>>>            - writeObject data (class: java.util.ArrayList)
> >>>>>            - object (class java.util.ArrayList,
> >>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
> >>>>>            - field (class: org.nari.r2rml.entities.LogicalTableMapping,
> >>>> name: predicateObjectMaps, type: class java.util.ArrayList)
> >>>>>            - object (class org.nari.r2rml.entities.LogicalTableMapping,
> >>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
> >>>>>            - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
> >>>> name: logicalTableMapping, type: class
> >>>> org.nari.r2rml.entities.LogicalTableMapping)
> >>>>>            - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
> >>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
> >>>>>            - field (class:
> >>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
> >>>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
> >>>>>            - object (class
> >>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
> >>>>>            at
> >>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> >>>>>            at
> >>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> >>>>>            at
> >>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> >>>>>            at
> >>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
> >>>>>            ... 33 more
> >>>>>
> >>>>> All these classes implement serializable interface.
> >>>>> So how could I serialize Jena model java object?
> >>>>>
> >>>>> Thanks very much!
> >>>>>
> >>>>>
> >>>>> Jason
> >>>>>
> >>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> >>>> Windows 10
> >>>>>
> >> --
> >> Lorenz Bühmann
> >> AKSW group, University of Leipzig
> >> Group: http://aksw.org - semantic web research center
> >>
> >>
> >



--
Your greatest regret is the email ID you choose in 8th grade

Re: Jena Model is serializable in Java?

Posted by Andy Seaborne <an...@apache.org>.

Hi Jason

On 08/06/2019 11:49, Scarlet Remilia wrote:
> Hello everyone,
> 
> 
> 
> I changed model to triples in RDD/Dataset, but there is a question.
> 
> I have triples in Dataset of Spark now, and I need to put them into a Model or something else ,then output them into a file or TDB or somewhere else.
> 
> As Dan mentioned before, is there any binary syntax for RDF?

https://jena.apache.org/documentation/io/rdf-binary.html
org.apache.jena.riot.thrift.*

> Or Is Jena supported distributed model to handling billions triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a quite problem for me.

(It's MR+SW - multiple reader AND single writer)

Are you wanting to load smallish units of triples from multiple sources?

Maybe you want to have all the streams send their output to a queue (in 
blocks, not triple by triple) and have TDB load from that queue. 
Multiple StreamRDF to a single StreamRDF, load the StreamTDB.

There is the TDB2 parallel loader - that is, loading from a single 
source using internal parallelism, not loading from parallel inputs. 
(It's 5 threads for triples, more for quads). It load from a StreamRDF.

NB - it can consume all the server's I/O bandwidth and a lot of CPU to 
make the machine unusable for anything else. It is quite hardware dependent.

     Andy

> 
> 
> 
> Thank you very much!
> 
> Jason
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> 
> 
> ________________________________
> From: Andy Seaborne <an...@apache.org>
> Sent: Thursday, June 6, 2019 6:35:41 PM
> To: users@jena.apache.org
> Subject: Re: Jena Model is serializable in Java?
> 
> 
> 
> On 06/06/2019 08:57, Scarlet Remilia wrote:
>> Hello everyone,
>>
>> My use case is a r2rml implementation, which could support millions or billions rows from RDBMS and distributed parse them into RDF.
>> For now, We try to setup some small models in different spark executors to parse individually, and finally union them all.
> 
> That sounds more like a stream usage.
> 
> Jena's StreamRDF and collect to a set (model or graph don't sound like
> they do anything for your application - sound like you are just using
> them as container of triples to move around.
> 
>> I think RDD[Triple] is a good idea, but I need to review exist code to change model into triples.
>>
>> an RDF syntax and write-then-read the RDF is also a resolution but is too loose. It’s very hard to manage these files, especially there are too many small models mentioned above.
>>
>> Thanks,
>> Jason
>>
>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>>
>> From: Lorenz B.<ma...@informatik.uni-leipzig.de>
>> Sent: Thursday, June 6, 2019 15:32
>> To: users@jena.apache.org<ma...@jena.apache.org>
>> Subject: Re: Jena Model is serializable in Java?
>>
>> I don't see why one would want to share Model instances via Spark. I
>> mean, it's possible via wrapping it inside an object which is
>> serializable or some other wrapper method:
>>
>> object ModelWrapper extends Serializable {
>> lazy val model = ...
>> }
>>
>> rdd.map(s => ModelWrapper.model. ... )
>>
>>
>> This makes the model being attached to some static code that can't be
>> changed during runtime and that's what Spark needs.
>>
>> Ideally, you'd use some broadcast variable, but indeed those are just
>> use to share smaller entities among the different Spark workers. For
>> smaller models like a schema this would work and is supposed to be more
>> efficient than having joins etc. (yes, there are also broadcast joins in
>> Spark, still data would be distributed during processing) - but it
>> depends ...
>>
>> I don't know your use-case nor why you need a Model, but what we did
>> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
>> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
>> and this is the only way to scale when using very large datasets.
>> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
>> pretty easy. For Dataset you have to define a custom encoder (Kryo
>> encoder works though).
>>
>> But as already mentioned, your use-case or application would be needed
>> to give further advice if necessary.
>>
>>> Jason,
>>>
>>> I would argue that you should exchange a Set of triples, so you can take
>>> advantage of Spark's distributed nature.  Your logic can materialize that
>>> list into a Graph or Model when needed to operate on it.   Andy is right
>>> about being careful about the size - you may want to build a specialized
>>> set that throws if the set is too large, and you may want to experiment
>>> with it.
>>>
>>> Andy,
>>>
>>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
>>> for fast parse?  I'm recalling Michael Stonebraker's response to the
>>> BigTable paper -
>>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
>>> and also gSOAP and other binary XML formats.  To this paper, the Google
>>> BigTable authors then responded that they don't use loose serializations
>>> such as provided by HDFS, but instead use structured data.
>>>
>>> This is hugely important to Jason's question because this is one of the
>>> benefits of using Spark instead of HDFS - Spark will handle distributing a
>>> huge dataset to multiple systems so that algorithm authors can operate on a
>>> vector (of Jena models?) far too large to fit in one machine.
>>>
>>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> Hi Jason,
>>>>
>>>> Models aren't serializable, nor are Graphs (the more system oriented
>>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>>>> send a list of triples.
>>>>
>>>> Or use an RDF syntax and write-then-read the RDF.
>>>>
>>>> But are the models small? RDF graph aren't always small so moving them
>>>> around may be expensive.
>>>>
>>>>        Andy
>>>>
>>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>>>> Hello everyone,
>>>>> I get a problem about Jena and Spark.
>>>>> I use Jena Model to handle some RDF models in my spark executor, but I
>>>> get a error:
>>>>> java.io.NotSerializableException:
>>>> org.apache.jena.rdf.model.impl.ModelCom
>>>>> Serialization stack:
>>>>>            - object not serializable (class:
>>>> org.apache.jena.rdf.model.impl.ModelCom)
>>>>>            - field (class: org.nari.r2rml.entities.Template, name: model,
>>>> type: interface org.apache.jena.rdf.model.Model)
>>>>>            - object (class org.nari.r2rml.entities.Template,
>>>> org.nari.r2rml.entities.Template@23dc70c1)
>>>>>            - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>>>            - object (class org.nari.r2rml.entities.PredicateObjectMap,
>>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>>>            - writeObject data (class: java.util.ArrayList)
>>>>>            - object (class java.util.ArrayList,
>>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>>>            - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>>>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>>>            - object (class org.nari.r2rml.entities.LogicalTableMapping,
>>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>>>            - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>>> name: logicalTableMapping, type: class
>>>> org.nari.r2rml.entities.LogicalTableMapping)
>>>>>            - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>>>            - field (class:
>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>>>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>>>            - object (class
>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>>>            at
>>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>>>            at
>>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>>>            at
>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>>>            at
>>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>>>            ... 33 more
>>>>>
>>>>> All these classes implement serializable interface.
>>>>> So how could I serialize Jena model java object?
>>>>>
>>>>> Thanks very much!
>>>>>
>>>>>
>>>>> Jason
>>>>>
>>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>> Windows 10
>>>>>
>> --
>> Lorenz Bühmann
>> AKSW group, University of Leipzig
>> Group: http://aksw.org - semantic web research center
>>
>>
>

RE: Jena Model is serializable in Java?

Posted by Scarlet Remilia <kn...@outlook.com>.

Hello everyone,



I changed model to triples in RDD/Dataset, but there is a question.

I have triples in Dataset of Spark now, and I need to put them into a Model or something else ,then output them into a file or TDB or somewhere else.

As Dan mentioned before, is there any binary syntax for RDF? Or Is Jena supported distributed model to handling billions triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a quite problem for me.



Thank you very much!

Jason



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



________________________________
From: Andy Seaborne <an...@apache.org>
Sent: Thursday, June 6, 2019 6:35:41 PM
To: users@jena.apache.org
Subject: Re: Jena Model is serializable in Java?



On 06/06/2019 08:57, Scarlet Remilia wrote:
> Hello everyone,
>
> My use case is a r2rml implementation, which could support millions or billions rows from RDBMS and distributed parse them into RDF.
> For now, We try to setup some small models in different spark executors to parse individually, and finally union them all.

That sounds more like a stream usage.

Jena's StreamRDF and collect to a set (model or graph don't sound like
they do anything for your application - sound like you are just using
them as container of triples to move around.

> I think RDD[Triple] is a good idea, but I need to review exist code to change model into triples.
>
> an RDF syntax and write-then-read the RDF is also a resolution but is too loose. It’s very hard to manage these files, especially there are too many small models mentioned above.
>
> Thanks,
> Jason
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
> From: Lorenz B.<ma...@informatik.uni-leipzig.de>
> Sent: Thursday, June 6, 2019 15:32
> To: users@jena.apache.org<ma...@jena.apache.org>
> Subject: Re: Jena Model is serializable in Java?
>
> I don't see why one would want to share Model instances via Spark. I
> mean, it's possible via wrapping it inside an object which is
> serializable or some other wrapper method:
>
> object ModelWrapper extends Serializable {
> lazy val model = ...
> }
>
> rdd.map(s => ModelWrapper.model. ... )
>
>
> This makes the model being attached to some static code that can't be
> changed during runtime and that's what Spark needs.
>
> Ideally, you'd use some broadcast variable, but indeed those are just
> use to share smaller entities among the different Spark workers. For
> smaller models like a schema this would work and is supposed to be more
> efficient than having joins etc. (yes, there are also broadcast joins in
> Spark, still data would be distributed during processing) - but it
> depends ...
>
> I don't know your use-case nor why you need a Model, but what we did
> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
> and this is the only way to scale when using very large datasets.
> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
> pretty easy. For Dataset you have to define a custom encoder (Kryo
> encoder works though).
>
> But as already mentioned, your use-case or application would be needed
> to give further advice if necessary.
>
>> Jason,
>>
>> I would argue that you should exchange a Set of triples, so you can take
>> advantage of Spark's distributed nature.  Your logic can materialize that
>> list into a Graph or Model when needed to operate on it.   Andy is right
>> about being careful about the size - you may want to build a specialized
>> set that throws if the set is too large, and you may want to experiment
>> with it.
>>
>> Andy,
>>
>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
>> for fast parse?  I'm recalling Michael Stonebraker's response to the
>> BigTable paper -
>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
>> and also gSOAP and other binary XML formats.  To this paper, the Google
>> BigTable authors then responded that they don't use loose serializations
>> such as provided by HDFS, but instead use structured data.
>>
>> This is hugely important to Jason's question because this is one of the
>> benefits of using Spark instead of HDFS - Spark will handle distributing a
>> huge dataset to multiple systems so that algorithm authors can operate on a
>> vector (of Jena models?) far too large to fit in one machine.
>>
>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
>>
>>> Hi Jason,
>>>
>>> Models aren't serializable, nor are Graphs (the more system oriented
>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>>> send a list of triples.
>>>
>>> Or use an RDF syntax and write-then-read the RDF.
>>>
>>> But are the models small? RDF graph aren't always small so moving them
>>> around may be expensive.
>>>
>>>       Andy
>>>
>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>>> Hello everyone,
>>>> I get a problem about Jena and Spark.
>>>> I use Jena Model to handle some RDF models in my spark executor, but I
>>> get a error:
>>>> java.io.NotSerializableException:
>>> org.apache.jena.rdf.model.impl.ModelCom
>>>> Serialization stack:
>>>>           - object not serializable (class:
>>> org.apache.jena.rdf.model.impl.ModelCom)
>>>>           - field (class: org.nari.r2rml.entities.Template, name: model,
>>> type: interface org.apache.jena.rdf.model.Model)
>>>>           - object (class org.nari.r2rml.entities.Template,
>>> org.nari.r2rml.entities.Template@23dc70c1)
>>>>           - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>>           - object (class org.nari.r2rml.entities.PredicateObjectMap,
>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>>           - writeObject data (class: java.util.ArrayList)
>>>>           - object (class java.util.ArrayList,
>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>>           - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>>           - object (class org.nari.r2rml.entities.LogicalTableMapping,
>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>>           - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>> name: logicalTableMapping, type: class
>>> org.nari.r2rml.entities.LogicalTableMapping)
>>>>           - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>>           - field (class:
>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>>           - object (class
>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>>           at
>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>>           at
>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>>           at
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>>           at
>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>>           ... 33 more
>>>>
>>>> All these classes implement serializable interface.
>>>> So how could I serialize Jena model java object?
>>>>
>>>> Thanks very much!
>>>>
>>>>
>>>> Jason
>>>>
>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> Windows 10
>>>>
> --
> Lorenz Bühmann
> AKSW group, University of Leipzig
> Group: http://aksw.org - semantic web research center
>
>

Re: Jena Model is serializable in Java?

Posted by Andy Seaborne <an...@apache.org>.


On 06/06/2019 08:57, Scarlet Remilia wrote:
> Hello everyone,
> 
> My use case is a r2rml implementation, which could support millions or billions rows from RDBMS and distributed parse them into RDF.
> For now, We try to setup some small models in different spark executors to parse individually, and finally union them all.

That sounds more like a stream usage.

Jena's StreamRDF and collect to a set (model or graph don't sound like 
they do anything for your application - sound like you are just using 
them as container of triples to move around.

> I think RDD[Triple] is a good idea, but I need to review exist code to change model into triples.
> 
> an RDF syntax and write-then-read the RDF is also a resolution but is too loose. It’s very hard to manage these files, especially there are too many small models mentioned above.
> 
> Thanks,
> Jason
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> From: Lorenz B.<ma...@informatik.uni-leipzig.de>
> Sent: Thursday, June 6, 2019 15:32
> To: users@jena.apache.org<ma...@jena.apache.org>
> Subject: Re: Jena Model is serializable in Java?
> 
> I don't see why one would want to share Model instances via Spark. I
> mean, it's possible via wrapping it inside an object which is
> serializable or some other wrapper method:
> 
> object ModelWrapper extends Serializable {
> lazy val model = ...
> }
> 
> rdd.map(s => ModelWrapper.model. ... )
> 
> 
> This makes the model being attached to some static code that can't be
> changed during runtime and that's what Spark needs.
> 
> Ideally, you'd use some broadcast variable, but indeed those are just
> use to share smaller entities among the different Spark workers. For
> smaller models like a schema this would work and is supposed to be more
> efficient than having joins etc. (yes, there are also broadcast joins in
> Spark, still data would be distributed during processing) - but it
> depends ...
> 
> I don't know your use-case nor why you need a Model, but what we did
> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
> and this is the only way to scale when using very large datasets.
> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
> pretty easy. For Dataset you have to define a custom encoder (Kryo
> encoder works though).
> 
> But as already mentioned, your use-case or application would be needed
> to give further advice if necessary.
> 
>> Jason,
>>
>> I would argue that you should exchange a Set of triples, so you can take
>> advantage of Spark's distributed nature.  Your logic can materialize that
>> list into a Graph or Model when needed to operate on it.   Andy is right
>> about being careful about the size - you may want to build a specialized
>> set that throws if the set is too large, and you may want to experiment
>> with it.
>>
>> Andy,
>>
>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
>> for fast parse?  I'm recalling Michael Stonebraker's response to the
>> BigTable paper -
>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
>> and also gSOAP and other binary XML formats.  To this paper, the Google
>> BigTable authors then responded that they don't use loose serializations
>> such as provided by HDFS, but instead use structured data.
>>
>> This is hugely important to Jason's question because this is one of the
>> benefits of using Spark instead of HDFS - Spark will handle distributing a
>> huge dataset to multiple systems so that algorithm authors can operate on a
>> vector (of Jena models?) far too large to fit in one machine.
>>
>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
>>
>>> Hi Jason,
>>>
>>> Models aren't serializable, nor are Graphs (the more system oriented
>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>>> send a list of triples.
>>>
>>> Or use an RDF syntax and write-then-read the RDF.
>>>
>>> But are the models small? RDF graph aren't always small so moving them
>>> around may be expensive.
>>>
>>>       Andy
>>>
>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>>> Hello everyone,
>>>> I get a problem about Jena and Spark.
>>>> I use Jena Model to handle some RDF models in my spark executor, but I
>>> get a error:
>>>> java.io.NotSerializableException:
>>> org.apache.jena.rdf.model.impl.ModelCom
>>>> Serialization stack:
>>>>           - object not serializable (class:
>>> org.apache.jena.rdf.model.impl.ModelCom)
>>>>           - field (class: org.nari.r2rml.entities.Template, name: model,
>>> type: interface org.apache.jena.rdf.model.Model)
>>>>           - object (class org.nari.r2rml.entities.Template,
>>> org.nari.r2rml.entities.Template@23dc70c1)
>>>>           - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>>           - object (class org.nari.r2rml.entities.PredicateObjectMap,
>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>>           - writeObject data (class: java.util.ArrayList)
>>>>           - object (class java.util.ArrayList,
>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>>           - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>>           - object (class org.nari.r2rml.entities.LogicalTableMapping,
>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>>           - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>> name: logicalTableMapping, type: class
>>> org.nari.r2rml.entities.LogicalTableMapping)
>>>>           - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>>           - field (class:
>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>>           - object (class
>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>>           at
>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>>           at
>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>>           at
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>>           at
>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>>           ... 33 more
>>>>
>>>> All these classes implement serializable interface.
>>>> So how could I serialize Jena model java object?
>>>>
>>>> Thanks very much!
>>>>
>>>>
>>>> Jason
>>>>
>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> Windows 10
>>>>
> --
> Lorenz Bühmann
> AKSW group, University of Leipzig
> Group: http://aksw.org - semantic web research center
> 
>

RE: Jena Model is serializable in Java?

Posted by Scarlet Remilia <kn...@outlook.com>.

Hello everyone,

My use case is a r2rml implementation, which could support millions or billions rows from RDBMS and distributed parse them into RDF.
For now, We try to setup some small models in different spark executors to parse individually, and finally union them all.

I think RDD[Triple] is a good idea, but I need to review exist code to change model into triples.

an RDF syntax and write-then-read the RDF is also a resolution but is too loose. It’s very hard to manage these files, especially there are too many small models mentioned above.

Thanks,
Jason

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Lorenz B.<ma...@informatik.uni-leipzig.de>
Sent: Thursday, June 6, 2019 15:32
To: users@jena.apache.org<ma...@jena.apache.org>
Subject: Re: Jena Model is serializable in Java?

I don't see why one would want to share Model instances via Spark. I
mean, it's possible via wrapping it inside an object which is
serializable or some other wrapper method:

object ModelWrapper extends Serializable {
lazy val model = ...
}

rdd.map(s => ModelWrapper.model. ... )


This makes the model being attached to some static code that can't be
changed during runtime and that's what Spark needs.

Ideally, you'd use some broadcast variable, but indeed those are just
use to share smaller entities among the different Spark workers. For
smaller models like a schema this would work and is supposed to be more
efficient than having joins etc. (yes, there are also broadcast joins in
Spark, still data would be distributed during processing) - but it
depends ...

I don't know your use-case nor why you need a Model, but what we did
when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
and this is the only way to scale when using very large datasets.
Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
pretty easy. For Dataset you have to define a custom encoder (Kryo
encoder works though).

But as already mentioned, your use-case or application would be needed
to give further advice if necessary.

> Jason,
>
> I would argue that you should exchange a Set of triples, so you can take
> advantage of Spark's distributed nature.  Your logic can materialize that
> list into a Graph or Model when needed to operate on it.   Andy is right
> about being careful about the size - you may want to build a specialized
> set that throws if the set is too large, and you may want to experiment
> with it.
>
> Andy,
>
> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
> for fast parse?  I'm recalling Michael Stonebraker's response to the
> BigTable paper -
> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
> and also gSOAP and other binary XML formats.  To this paper, the Google
> BigTable authors then responded that they don't use loose serializations
> such as provided by HDFS, but instead use structured data.
>
> This is hugely important to Jason's question because this is one of the
> benefits of using Spark instead of HDFS - Spark will handle distributing a
> huge dataset to multiple systems so that algorithm authors can operate on a
> vector (of Jena models?) far too large to fit in one machine.
>
> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
>
>> Hi Jason,
>>
>> Models aren't serializable, nor are Graphs (the more system oriented
>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>> send a list of triples.
>>
>> Or use an RDF syntax and write-then-read the RDF.
>>
>> But are the models small? RDF graph aren't always small so moving them
>> around may be expensive.
>>
>>      Andy
>>
>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>> Hello everyone,
>>> I get a problem about Jena and Spark.
>>> I use Jena Model to handle some RDF models in my spark executor, but I
>> get a error:
>>> java.io.NotSerializableException:
>> org.apache.jena.rdf.model.impl.ModelCom
>>> Serialization stack:
>>>          - object not serializable (class:
>> org.apache.jena.rdf.model.impl.ModelCom)
>>>          - field (class: org.nari.r2rml.entities.Template, name: model,
>> type: interface org.apache.jena.rdf.model.Model)
>>>          - object (class org.nari.r2rml.entities.Template,
>> org.nari.r2rml.entities.Template@23dc70c1)
>>>          - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>          - object (class org.nari.r2rml.entities.PredicateObjectMap,
>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>          - writeObject data (class: java.util.ArrayList)
>>>          - object (class java.util.ArrayList,
>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>          - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>          - object (class org.nari.r2rml.entities.LogicalTableMapping,
>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>          - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
>> name: logicalTableMapping, type: class
>> org.nari.r2rml.entities.LogicalTableMapping)
>>>          - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>          - field (class:
>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>          - object (class
>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>          at
>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>          at
>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>          at
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>          at
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>          ... 33 more
>>>
>>> All these classes implement serializable interface.
>>> So how could I serialize Jena model java object?
>>>
>>> Thanks very much!
>>>
>>>
>>> Jason
>>>
>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>> Windows 10
>>>
--
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Re: Jena Model is serializable in Java?

Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.

I don't see why one would want to share Model instances via Spark. I
mean, it's possible via wrapping it inside an object which is
serializable or some other wrapper method:

object ModelWrapper extends Serializable {
lazy val model = ...
}

rdd.map(s => ModelWrapper.model. ... )


This makes the model being attached to some static code that can't be
changed during runtime and that's what Spark needs.

Ideally, you'd use some broadcast variable, but indeed those are just
use to share smaller entities among the different Spark workers. For
smaller models like a schema this would work and is supposed to be more
efficient than having joins etc. (yes, there are also broadcast joins in
Spark, still data would be distributed during processing) - but it
depends ...

I don't know your use-case nor why you need a Model, but what we did
when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
and this is the only way to scale when using very large datasets.
Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
pretty easy. For Dataset you have to define a custom encoder (Kryo
encoder works though).

But as already mentioned, your use-case or application would be needed
to give further advice if necessary.

> Jason,
>
> I would argue that you should exchange a Set of triples, so you can take
> advantage of Spark's distributed nature.  Your logic can materialize that
> list into a Graph or Model when needed to operate on it.   Andy is right
> about being careful about the size - you may want to build a specialized
> set that throws if the set is too large, and you may want to experiment
> with it.
>
> Andy,
>
> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
> for fast parse?  I'm recalling Michael Stonebraker's response to the
> BigTable paper -
> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
> and also gSOAP and other binary XML formats.  To this paper, the Google
> BigTable authors then responded that they don't use loose serializations
> such as provided by HDFS, but instead use structured data.
>
> This is hugely important to Jason's question because this is one of the
> benefits of using Spark instead of HDFS - Spark will handle distributing a
> huge dataset to multiple systems so that algorithm authors can operate on a
> vector (of Jena models?) far too large to fit in one machine.
>
> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:
>
>> Hi Jason,
>>
>> Models aren't serializable, nor are Graphs (the more system oriented
>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>> send a list of triples.
>>
>> Or use an RDF syntax and write-then-read the RDF.
>>
>> But are the models small? RDF graph aren't always small so moving them
>> around may be expensive.
>>
>>      Andy
>>
>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>> Hello everyone,
>>> I get a problem about Jena and Spark.
>>> I use Jena Model to handle some RDF models in my spark executor, but I
>> get a error:
>>> java.io.NotSerializableException:
>> org.apache.jena.rdf.model.impl.ModelCom
>>> Serialization stack:
>>>          - object not serializable (class:
>> org.apache.jena.rdf.model.impl.ModelCom)
>>>          - field (class: org.nari.r2rml.entities.Template, name: model,
>> type: interface org.apache.jena.rdf.model.Model)
>>>          - object (class org.nari.r2rml.entities.Template,
>> org.nari.r2rml.entities.Template@23dc70c1)
>>>          - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>          - object (class org.nari.r2rml.entities.PredicateObjectMap,
>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>          - writeObject data (class: java.util.ArrayList)
>>>          - object (class java.util.ArrayList,
>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>          - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>          - object (class org.nari.r2rml.entities.LogicalTableMapping,
>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>          - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
>> name: logicalTableMapping, type: class
>> org.nari.r2rml.entities.LogicalTableMapping)
>>>          - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>          - field (class:
>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>          - object (class
>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>          at
>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>          at
>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>          at
>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>          at
>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>          ... 33 more
>>>
>>> All these classes implement serializable interface.
>>> So how could I serialize Jena model java object?
>>>
>>> Thanks very much!
>>>
>>>
>>> Jason
>>>
>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>> Windows 10
>>>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Re: Jena Model is serializable in Java?

Posted by Dan Davis <da...@gmail.com>.

Jason,

I would argue that you should exchange a Set of triples, so you can take
advantage of Spark's distributed nature.  Your logic can materialize that
list into a Graph or Model when needed to operate on it.   Andy is right
about being careful about the size - you may want to build a specialized
set that throws if the set is too large, and you may want to experiment
with it.

Andy,

Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
for fast parse?  I'm recalling Michael Stonebraker's response to the
BigTable paper -
https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
and also gSOAP and other binary XML formats.  To this paper, the Google
BigTable authors then responded that they don't use loose serializations
such as provided by HDFS, but instead use structured data.

This is hugely important to Jason's question because this is one of the
benefits of using Spark instead of HDFS - Spark will handle distributing a
huge dataset to multiple systems so that algorithm authors can operate on a
vector (of Jena models?) far too large to fit in one machine.

On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <an...@apache.org> wrote:

> Hi Jason,
>
> Models aren't serializable, nor are Graphs (the more system oriented
> view of RDF) through  Triples, Quads and Node are serializable.  You can
> send a list of triples.
>
> Or use an RDF syntax and write-then-read the RDF.
>
> But are the models small? RDF graph aren't always small so moving them
> around may be expensive.
>
>      Andy
>
> On 05/06/2019 17:59, Scarlet Remilia wrote:
> > Hello everyone,
> > I get a problem about Jena and Spark.
> > I use Jena Model to handle some RDF models in my spark executor, but I
> get a error:
> >
> > java.io.NotSerializableException:
> org.apache.jena.rdf.model.impl.ModelCom
> >
> > Serialization stack:
> >          - object not serializable (class:
> org.apache.jena.rdf.model.impl.ModelCom)
> >          - field (class: org.nari.r2rml.entities.Template, name: model,
> type: interface org.apache.jena.rdf.model.Model)
> >          - object (class org.nari.r2rml.entities.Template,
> org.nari.r2rml.entities.Template@23dc70c1)
> >          - field (class: org.nari.r2rml.entities.PredicateObjectMap,
> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
> >          - object (class org.nari.r2rml.entities.PredicateObjectMap,
> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
> >          - writeObject data (class: java.util.ArrayList)
> >          - object (class java.util.ArrayList,
> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
> >          - field (class: org.nari.r2rml.entities.LogicalTableMapping,
> name: predicateObjectMaps, type: class java.util.ArrayList)
> >          - object (class org.nari.r2rml.entities.LogicalTableMapping,
> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
> >          - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
> name: logicalTableMapping, type: class
> org.nari.r2rml.entities.LogicalTableMapping)
> >          - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
> >          - field (class:
> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
> >          - object (class
> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
> >          at
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> >          at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> >          at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> >          at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
> >          ... 33 more
> >
> > All these classes implement serializable interface.
> > So how could I serialize Jena model java object?
> >
> > Thanks very much!
> >
> >
> > Jason
> >
> > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
> >
> >
>

Re: Jena Model is serializable in Java?

Posted by Andy Seaborne <an...@apache.org>.

Hi Jason,

Models aren't serializable, nor are Graphs (the more system oriented 
view of RDF) through  Triples, Quads and Node are serializable.  You can 
send a list of triples.

Or use an RDF syntax and write-then-read the RDF.

But are the models small? RDF graph aren't always small so moving them 
around may be expensive.

     Andy

On 05/06/2019 17:59, Scarlet Remilia wrote:
> Hello everyone,
> I get a problem about Jena and Spark.
> I use Jena Model to handle some RDF models in my spark executor, but I get a error:
> 
> java.io.NotSerializableException: org.apache.jena.rdf.model.impl.ModelCom
> 
> Serialization stack:
>          - object not serializable (class: org.apache.jena.rdf.model.impl.ModelCom)
>          - field (class: org.nari.r2rml.entities.Template, name: model, type: interface org.apache.jena.rdf.model.Model)
>          - object (class org.nari.r2rml.entities.Template, org.nari.r2rml.entities.Template@23dc70c1)
>          - field (class: org.nari.r2rml.entities.PredicateObjectMap, name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>          - object (class org.nari.r2rml.entities.PredicateObjectMap, org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>          - writeObject data (class: java.util.ArrayList)
>          - object (class java.util.ArrayList, [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>          - field (class: org.nari.r2rml.entities.LogicalTableMapping, name: predicateObjectMaps, type: class java.util.ArrayList)
>          - object (class org.nari.r2rml.entities.LogicalTableMapping, org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>          - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction, name: logicalTableMapping, type: class org.nari.r2rml.entities.LogicalTableMapping)
>          - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction, org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>          - field (class: org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4, type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>          - object (class org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>          at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>          at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>          at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>          at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>          ... 33 more
> 
> All these classes implement serializable interface.
> So how could I serialize Jena model java object?
> 
> Thanks very much!
> 
> 
> Jason
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
>