You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Paul Houle <on...@gmail.com> on 2015/05/11 18:36:39 UTC

Jena: Spark vs. Drools

I just want to share a few of my desiderata for working with RDF data.
There really are a few of these that are contradictory in nature.  These
touch on the Graph/Model split and similar things.

One of them is streaming processing with tools like Spark,  where the real
point is raw speed,  and that comes down to getting as close to "zero copy"
as possible in terms of processing.

Sometimes I am looking at a stream of triples and I want to filter out
anything from 50% to 90% to 99.99% of them and I am often doing some kind
of map or reduce that works a triple at a time,  so the elephant in the
room is parsing time and memory consumption,  so something that is insanely
fast (like the Hadoop Writable) that is highly mutable is desirable.

Now I want it to be optional in a pipeline to shove facts into an in-memory
model,  because sometimes that is a great way to get things done,  and it
would be nice to be able not have to change my filtering code and have
confidence that what is happening under the hood is efficient,  without a
lot of mindless copying.

On the other hand I am also doing things where immutable data structures
are the way,  particularly I am using Jena classes with production rules
engines such as Drools.  From my current viewpoint,  RDFS and OWL are just
"logical theories" which are on the shelf together with logical theories on
other topics such as invoices and postal addresses.  In this model there is

(i) a small rule base,
(ii) a fair-sized "T-Box" like knowledge base (say 1-1M triples),  and
(iii) a small "A-Box" knowledge base which is streaming past the system in
the sense that it is doing a 'consultation' which may involve a number of
decisions,  then we toss the A-Box out.

I like the feature set of Drools but may end up using something
clojure-based for a rules engine,  basically for the reason that the source
code of OPS5 in LISP is about 3k LOC and Drools core is orders of magnitude
bigger.  When I look at data modelling problems people run into with
"business rules engine" it is clear that RDF is the right answer for many
such conundrums.



-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
https://legalentityidentifier.info/lei/lookup
<http://legalentityidentifier.info/lei/lookup>

Re: Jena: Spark vs. Drools

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Comments inline:

On 11/05/2015 23:23, "Paul Houle" <on...@gmail.com> wrote:

>I've processed cumulative terabytes of data with
>
>https://github.com/paulhoule/infovore
>
>which was developed pre-Elephas.  One issue I have is that at this scale
>the reader has to be 100% bombproof.  Every triple I read is on a single
>line,  but it guaranteed that there will be a bad triple in there
>somewhere,  and the system needs to reject it and move on to the next
>triple.  I looked at the code a while ago and found that Elephas does the
>same thing I did,  which was create a new parser for each line,  which is
>an awful solution in terms of speed.

Well that's one behaviour available

For some formats like NTriples we provide the ability to process in line
based, batch based or whole file style

See for list of which format supports which processing style:

http://jena.apache.org/documentation/hadoop/io.html#input
http://jena.apache.org/documentation/hadoop/io.html#input_1

Obviously in the batch/whole file style you are trading off error
tolerance for performance as you suggest because with these processing
styles we only create a single parser for the block or the file.  In those
styles an error aborts further processing of the batch because the parser
is not recoverable.

In the future perhaps we could improve this so in the case of an error you
read to the next new line and then restart the parser.

>
>For me it is not so much a matter of speed as it is cost,  as I spin these
>clusters up in AWS and try to spin up the right number of machines so the
>job finishes in a bit less than an hour.
>
>So far I haven't done much Spark + RDFyet but I can say if I have to deal
>with data sets that i can't process easily on one machine that will be the
>way I go.  What I do know is that Spark works with Hadoop Writables and
>other I/O stuff from Hadoop,  although I don't know if this is the optimal
>solution.  A lot of my attraction to Spark is that it scales from the
>application domain of Java parallel streams up to "huge data" and that is
>important to me.

The Elephas Writables all use Thrift as the underlying serialisation so
there is minimal serialization/deserialization cost involved and they
should work nicely with Spark and any other Hadoop ecosystem framework
that support Writables

For stream processing I would take a look at Apache Flink, it has very
similar aims to Spark but is designed as a streaming engine from the
ground up so provides true streaming unlike Spark's micro-batching
approach.  Overall it is not as mature in some areas but there are other
areas where Flink is miles ahead of Spark especially in terms of memory
management (the memory management improvements that Databricks recently
announced they were planning to work on for Spark as Project Tungsten
already exists and are mature in Flink).

>
>Probably the issue of "restartable parser for N-Triples" is separate from
>the "parser that doesn't allocate anything it doesn't need to allocate".

Yes, note that Jena parsers already maintain the minimal state necessary.
As Andy recently noted separately on another email thread the NTriples
parser only holds state for the current line and the Turtle parser only
holds state from the start of some block of triples to the terminating '.'

>So far as restartable Turtle,  I would look to
>
>https://www.tbray.org/ongoing/When/201x/2015/02/26/JSON-Text-Sequences

Nice

We did some work internally at Cray with designed a parallel friendly
serialisation for RDF tuples that achieves high compression while allowing
for both parallel compression and decompression.  We use some similar
tricks to separate blocks and records within blocks where necessary.

Maybe someday we will be able to publish this as open source, I haven't
bugged my manager about this lately...

Rob

>
>I have almost no interest in DL reasoners, except for cases where they
>help
>do something I want,  such using "rdfs:subPropertyOf" to get terms under a
>common vocabulary.  I think DL has held back the semantic web more than
>anything else;  i.e.  people like Kendall Clark can do things I wouldn't
>think can be done in OWL,  but the people I talk to want to express
>government regulations and business policies. and ask a question like "Is
>Bank X adequately capitalized?"  Certainly I need to do things like
>convert
>Fahrenheit and Centigrade to Kelvin (to extend the rdfs:subPropertyOf
>concept) and that is trivial to do with production rules but impossible
>with OWL/RDFS.
>
>Now the SWRL idea where you extend production rules with RDFS/OWL is a
>good
>idea and I also think SPIN is a good idea,  but for many of the data
>transformation and classification tasks I do,  the RETE network execution
>is close to ideal.  Also when it comes to things like mixed initiative
>interaction,  complex event processing, complex decisions (think chess
>playing where you have to consider the effects of many moves or route
>optimization) and the asynchronous I/O morass that people are heading into
>without helmets,  I think that kind of system has a lot to offer.
>
>So far as "business rules engines" go,  a definite theme I see is that the
>near-term state of the art is a lot better than people think because there
>are so many communities that aren't talking.  I have a recent book on KR
>that stops with MYCIN and doesn't say your bank probably uses ILOG or that
>a program written in ILOG made the final decisions for IBM Watson.
>
>Now Drools does suck for the simple reason that Drools doesn't really know
>Java so it can't give you error messages that make sense and that is just
>compounded by the decision tables and DSL stuff.  I think Drools has made
>some of the same mistakes other BRMS have,  in terms of building a system
>where there are enough different ways to do things that everybody from the
>execs to the devs are driven crazy and that a modern system probably
>involves.  (At least Drools did have enough sense to use Git for version
>control)
>
>a rules language
>a brilliant DSL system that does a large amount of reasoning
>an IDE that lets you render text and text+annotation documents in
>different
>ways
>
>but underlying it all the idea that complexity can be part of the problem
>as much as part of the solution and that what makes life hard for devs
>makes it hard for execs and vice versa.
>
>
>
>
>
>
>
>
>
>On Mon, May 11, 2015 at 4:16 PM, Bruno P. Kinoshita <ki...@apache.org>
>wrote:
>
>> Hi Paul
>> I worked with Jena in a Hadoop/Hive cluster, but without Spark. There
>>was
>> only one job that took too long to work on my dataset, but I suspect it
>>was
>> due to something in my custom code - which could be replaced in parts
>>now
>> by Elephas - or due to the lack of optimization in the storage format or
>> job parameters.
>> In my case, I was doing some NLP with OpenNLP and creating triples that
>> would be loaded later in a Jena graph. Since I didn't need to work on
>>the
>> graph/model in the cluster, I never had a similar case as yours.
>> Few questions:
>> - Have you looked at Giraph and other graph solutions for Hadoop too?
>> Maybe it provides some abstraction layer that could be used in
>>conjunction
>> with Jena graphs.
>> - Did you have to use some special configuration for persisting your
>> datasets to disk too? Did you find some good examples or literature
>>online
>> that you could share with devs that don't have much experience with
>>Spark
>> (like me :-) ?
>> - Would it make sense to try to use existing reasoners like Hermit and
>> Pellet, instead of using Drools?
>> - Have you used Elephas too? Anything that would be useful to Spark and
>> could be added maybe?
>> - Are you writing/blogging about it?
>> In this same project, one of the third party libraries used Drools for
>> rules to extract content from PDF. While I found it really powerful, it
>>was
>> hard to debug and adjust the parameters, as it had some custom code to
>> manipulate excels and generate the rules.
>> Thanks!Bruno
>>
>>       From: Paul Houle <on...@gmail.com>
>>  To: dev@jena.apache.org
>>  Sent: Tuesday, May 12, 2015 4:36 AM
>>  Subject: Jena: Spark vs. Drools
>>
>> I just want to share a few of my desiderata for working with RDF data.
>> There really are a few of these that are contradictory in nature.  These
>> touch on the Graph/Model split and similar things.
>>
>> One of them is streaming processing with tools like Spark,  where the
>>real
>> point is raw speed,  and that comes down to getting as close to "zero
>>copy"
>> as possible in terms of processing.
>>
>> Sometimes I am looking at a stream of triples and I want to filter out
>> anything from 50% to 90% to 99.99% of them and I am often doing some
>>kind
>> of map or reduce that works a triple at a time,  so the elephant in the
>> room is parsing time and memory consumption,  so something that is
>>insanely
>> fast (like the Hadoop Writable) that is highly mutable is desirable.
>>
>> Now I want it to be optional in a pipeline to shove facts into an
>>in-memory
>> model,  because sometimes that is a great way to get things done,  and
>>it
>> would be nice to be able not have to change my filtering code and have
>> confidence that what is happening under the hood is efficient,  without
>>a
>> lot of mindless copying.
>>
>> On the other hand I am also doing things where immutable data structures
>> are the way,  particularly I am using Jena classes with production rules
>> engines such as Drools.  From my current viewpoint,  RDFS and OWL are
>>just
>> "logical theories" which are on the shelf together with logical
>>theories on
>> other topics such as invoices and postal addresses.  In this model
>>there is
>>
>> (i) a small rule base,
>> (ii) a fair-sized "T-Box" like knowledge base (say 1-1M triples),  and
>> (iii) a small "A-Box" knowledge base which is streaming past the system
>>in
>> the sense that it is doing a 'consultation' which may involve a number
>>of
>> decisions,  then we toss the A-Box out.
>>
>> I like the feature set of Drools but may end up using something
>> clojure-based for a rules engine,  basically for the reason that the
>>source
>> code of OPS5 in LISP is about 3k LOC and Drools core is orders of
>>magnitude
>> bigger.  When I look at data modelling problems people run into with
>> "business rules engine" it is clear that RDF is the right answer for
>>many
>> such conundrums.
>>
>>
>>
>> --
>> Paul Houle
>>
>> *Applying Schemas for Natural Language Processing, Distributed Systems,
>> Classification and Text Mining and Data Lakes*
>>
>> (607) 539 6254    paul.houle on Skype  ontology2@gmail.com
>> https://legalentityidentifier.info/lei/lookup
>> <http://legalentityidentifier.info/lei/lookup>
>>
>>
>>
>>
>
>
>-- 
>Paul Houle
>
>*Applying Schemas for Natural Language Processing, Distributed Systems,
>Classification and Text Mining and Data Lakes*
>
>(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
>https://legalentityidentifier.info/lei/lookup
><http://legalentityidentifier.info/lei/lookup>

Re: Jena: Spark vs. Drools

Posted by Paul Houle <on...@gmail.com>.

I've processed cumulative terabytes of data with

https://github.com/paulhoule/infovore

which was developed pre-Elephas.  One issue I have is that at this scale
the reader has to be 100% bombproof.  Every triple I read is on a single
line,  but it guaranteed that there will be a bad triple in there
somewhere,  and the system needs to reject it and move on to the next
triple.  I looked at the code a while ago and found that Elephas does the
same thing I did,  which was create a new parser for each line,  which is
an awful solution in terms of speed.

For me it is not so much a matter of speed as it is cost,  as I spin these
clusters up in AWS and try to spin up the right number of machines so the
job finishes in a bit less than an hour.

So far I haven't done much Spark + RDFyet but I can say if I have to deal
with data sets that i can't process easily on one machine that will be the
way I go.  What I do know is that Spark works with Hadoop Writables and
other I/O stuff from Hadoop,  although I don't know if this is the optimal
solution.  A lot of my attraction to Spark is that it scales from the
application domain of Java parallel streams up to "huge data" and that is
important to me.

Probably the issue of "restartable parser for N-Triples" is separate from
the "parser that doesn't allocate anything it doesn't need to allocate".
So far as restartable Turtle,  I would look to

https://www.tbray.org/ongoing/When/201x/2015/02/26/JSON-Text-Sequences

I have almost no interest in DL reasoners, except for cases where they help
do something I want,  such using "rdfs:subPropertyOf" to get terms under a
common vocabulary.  I think DL has held back the semantic web more than
anything else;  i.e.  people like Kendall Clark can do things I wouldn't
think can be done in OWL,  but the people I talk to want to express
government regulations and business policies. and ask a question like "Is
Bank X adequately capitalized?"  Certainly I need to do things like convert
Fahrenheit and Centigrade to Kelvin (to extend the rdfs:subPropertyOf
concept) and that is trivial to do with production rules but impossible
with OWL/RDFS.

Now the SWRL idea where you extend production rules with RDFS/OWL is a good
idea and I also think SPIN is a good idea,  but for many of the data
transformation and classification tasks I do,  the RETE network execution
is close to ideal.  Also when it comes to things like mixed initiative
interaction,  complex event processing, complex decisions (think chess
playing where you have to consider the effects of many moves or route
optimization) and the asynchronous I/O morass that people are heading into
without helmets,  I think that kind of system has a lot to offer.

So far as "business rules engines" go,  a definite theme I see is that the
near-term state of the art is a lot better than people think because there
are so many communities that aren't talking.  I have a recent book on KR
that stops with MYCIN and doesn't say your bank probably uses ILOG or that
a program written in ILOG made the final decisions for IBM Watson.

Now Drools does suck for the simple reason that Drools doesn't really know
Java so it can't give you error messages that make sense and that is just
compounded by the decision tables and DSL stuff.  I think Drools has made
some of the same mistakes other BRMS have,  in terms of building a system
where there are enough different ways to do things that everybody from the
execs to the devs are driven crazy and that a modern system probably
involves.  (At least Drools did have enough sense to use Git for version
control)

a rules language
a brilliant DSL system that does a large amount of reasoning
an IDE that lets you render text and text+annotation documents in different
ways

but underlying it all the idea that complexity can be part of the problem
as much as part of the solution and that what makes life hard for devs
makes it hard for execs and vice versa.

On Mon, May 11, 2015 at 4:16 PM, Bruno P. Kinoshita <ki...@apache.org>
wrote:

> Hi Paul
> I worked with Jena in a Hadoop/Hive cluster, but without Spark. There was
> only one job that took too long to work on my dataset, but I suspect it was
> due to something in my custom code - which could be replaced in parts now
> by Elephas - or due to the lack of optimization in the storage format or
> job parameters.
> In my case, I was doing some NLP with OpenNLP and creating triples that
> would be loaded later in a Jena graph. Since I didn't need to work on the
> graph/model in the cluster, I never had a similar case as yours.
> Few questions:
> - Have you looked at Giraph and other graph solutions for Hadoop too?
> Maybe it provides some abstraction layer that could be used in conjunction
> with Jena graphs.
> - Did you have to use some special configuration for persisting your
> datasets to disk too? Did you find some good examples or literature online
> that you could share with devs that don't have much experience with Spark
> (like me :-) ?
> - Would it make sense to try to use existing reasoners like Hermit and
> Pellet, instead of using Drools?
> - Have you used Elephas too? Anything that would be useful to Spark and
> could be added maybe?
> - Are you writing/blogging about it?
> In this same project, one of the third party libraries used Drools for
> rules to extract content from PDF. While I found it really powerful, it was
> hard to debug and adjust the parameters, as it had some custom code to
> manipulate excels and generate the rules.
> Thanks!Bruno
>
>       From: Paul Houle <on...@gmail.com>
>  To: dev@jena.apache.org
>  Sent: Tuesday, May 12, 2015 4:36 AM
>  Subject: Jena: Spark vs. Drools
>
> I just want to share a few of my desiderata for working with RDF data.
> There really are a few of these that are contradictory in nature.  These
> touch on the Graph/Model split and similar things.
>
> One of them is streaming processing with tools like Spark,  where the real
> point is raw speed,  and that comes down to getting as close to "zero copy"
> as possible in terms of processing.
>
> Sometimes I am looking at a stream of triples and I want to filter out
> anything from 50% to 90% to 99.99% of them and I am often doing some kind
> of map or reduce that works a triple at a time,  so the elephant in the
> room is parsing time and memory consumption,  so something that is insanely
> fast (like the Hadoop Writable) that is highly mutable is desirable.
>
> Now I want it to be optional in a pipeline to shove facts into an in-memory
> model,  because sometimes that is a great way to get things done,  and it
> would be nice to be able not have to change my filtering code and have
> confidence that what is happening under the hood is efficient,  without a
> lot of mindless copying.
>
> On the other hand I am also doing things where immutable data structures
> are the way,  particularly I am using Jena classes with production rules
> engines such as Drools.  From my current viewpoint,  RDFS and OWL are just
> "logical theories" which are on the shelf together with logical theories on
> other topics such as invoices and postal addresses.  In this model there is
>
> (i) a small rule base,
> (ii) a fair-sized "T-Box" like knowledge base (say 1-1M triples),  and
> (iii) a small "A-Box" knowledge base which is streaming past the system in
> the sense that it is doing a 'consultation' which may involve a number of
> decisions,  then we toss the A-Box out.
>
> I like the feature set of Drools but may end up using something
> clojure-based for a rules engine,  basically for the reason that the source
> code of OPS5 in LISP is about 3k LOC and Drools core is orders of magnitude
> bigger.  When I look at data modelling problems people run into with
> "business rules engine" it is clear that RDF is the right answer for many
> such conundrums.
>
>
>
> --
> Paul Houle
>
> *Applying Schemas for Natural Language Processing, Distributed Systems,
> Classification and Text Mining and Data Lakes*
>
> (607) 539 6254    paul.houle on Skype  ontology2@gmail.com
> https://legalentityidentifier.info/lei/lookup
> <http://legalentityidentifier.info/lei/lookup>
>
>
>
>

-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   ontology2@gmail.com
https://legalentityidentifier.info/lei/lookup
<http://legalentityidentifier.info/lei/lookup>

Re: Jena: Spark vs. Drools

Posted by "Bruno P. Kinoshita" <ki...@apache.org>.

Hi Paul
I worked with Jena in a Hadoop/Hive cluster, but without Spark. There was only one job that took too long to work on my dataset, but I suspect it was due to something in my custom code - which could be replaced in parts now by Elephas - or due to the lack of optimization in the storage format or job parameters.
In my case, I was doing some NLP with OpenNLP and creating triples that would be loaded later in a Jena graph. Since I didn't need to work on the graph/model in the cluster, I never had a similar case as yours. 
Few questions:
- Have you looked at Giraph and other graph solutions for Hadoop too? Maybe it provides some abstraction layer that could be used in conjunction with Jena graphs.
- Did you have to use some special configuration for persisting your datasets to disk too? Did you find some good examples or literature online that you could share with devs that don't have much experience with Spark (like me :-) ?
- Would it make sense to try to use existing reasoners like Hermit and Pellet, instead of using Drools?
- Have you used Elephas too? Anything that would be useful to Spark and could be added maybe?
- Are you writing/blogging about it? 
In this same project, one of the third party libraries used Drools for rules to extract content from PDF. While I found it really powerful, it was hard to debug and adjust the parameters, as it had some custom code to manipulate excels and generate the rules. 
Thanks!Bruno
 
      From: Paul Houle <on...@gmail.com>
 To: dev@jena.apache.org 
 Sent: Tuesday, May 12, 2015 4:36 AM
 Subject: Jena: Spark vs. Drools
   
I just want to share a few of my desiderata for working with RDF data.
There really are a few of these that are contradictory in nature.  These
touch on the Graph/Model split and similar things.

One of them is streaming processing with tools like Spark,  where the real
point is raw speed,  and that comes down to getting as close to "zero copy"
as possible in terms of processing.

Sometimes I am looking at a stream of triples and I want to filter out
anything from 50% to 90% to 99.99% of them and I am often doing some kind
of map or reduce that works a triple at a time,  so the elephant in the
room is parsing time and memory consumption,  so something that is insanely
fast (like the Hadoop Writable) that is highly mutable is desirable.

Now I want it to be optional in a pipeline to shove facts into an in-memory
model,  because sometimes that is a great way to get things done,  and it
would be nice to be able not have to change my filtering code and have
confidence that what is happening under the hood is efficient,  without a
lot of mindless copying.

On the other hand I am also doing things where immutable data structures
are the way,  particularly I am using Jena classes with production rules
engines such as Drools.  From my current viewpoint,  RDFS and OWL are just
"logical theories" which are on the shelf together with logical theories on
other topics such as invoices and postal addresses.  In this model there is

(i) a small rule base,
(ii) a fair-sized "T-Box" like knowledge base (say 1-1M triples),  and
(iii) a small "A-Box" knowledge base which is streaming past the system in
the sense that it is doing a 'consultation' which may involve a number of
decisions,  then we toss the A-Box out.

I like the feature set of Drools but may end up using something
clojure-based for a rules engine,  basically for the reason that the source
code of OPS5 in LISP is about 3k LOC and Drools core is orders of magnitude
bigger.  When I look at data modelling problems people run into with
"business rules engine" it is clear that RDF is the right answer for many
such conundrums.



-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype  ontology2@gmail.com
https://legalentityidentifier.info/lei/lookup
<http://legalentityidentifier.info/lei/lookup>