You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by filipus <fl...@gmail.com> on 2014/06/10 13:52:32 UTC

pmml with augustus

hello guys,

has anybody experiances with the library augustus as a serializer for
scoring models?

looks very promising and i even found a hint on the connection augustus and
spark

all the best



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pmml with augustus

Posted by filipus <fl...@gmail.com>.

@Paco: I understand that most promising for me to put effort in understanding
for in deploying models in the spark enviroment would be augustus and
zementis right?

actually as you mention I would have both direction of deploying. I have
already models which I could transform into pmml and I also think in
building more models in time using spark... or other model engines in the
hadoop field

When I read about mllib and mlbase I got very interested in it because it
seems to handle some aspects of my actual challenge (building arround 1000
models, administrate 1000 models, calculate arount 2 billions scores each
week) but with the administartion stuf I am so so sure about. also i find
that one need to put in the field (spark, mllib, mlbase, ...) some effort
into the transparency of the models.

as long you just build a recomender system you probably dont need something
like that but as you mention... there are a lot of departments where
analysts are building the models because the risk to spend millions of money
in a wrong place beause of the model which wasnt proofed carefully... is
simply to high for the managers

....

is there actually a direction of administration of scores in the
spark/mllib/mlbase field?. I mean somthing like

a) description of the score model, training data set, target variable, for
what for
b) quality check, actual performance in comparison with other models,
c) version control system
d) indicator if the score is activ or not
e) for specificily which action (for instance which website, wich customer
group, wich country,...)

a commercial product which is in a way compareable would the model manager
from sas

hey guys.. in anyway I will get involved in this field. It looks so
promissing

ps: think about integrating a mip solver! because you can not handle every
thing with a statistical model. in business you have quite often discrete
optimization problems when you try to manage your business with prediction
models :-)

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7353.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pmml with augustus

Posted by Paco Nathan <ce...@gmail.com>.

That's a good point about polyglot. Given that Spark is incorporating a
range of languages (Scala, Java, Py, R, SQL) it becomes a trade-off whether
or not to centralize support or integrate with native options. Going with
the latter implies more standardization and less tech debt.

The big win with PMML however is migration, e.g., regulated industries may
have a strong requirement to train in one place that is auditable (e.g.,
SAS) but then score at scale (e.g., Spark). Migration in the opposite
direction is also much in demand, e.g., to leverage training at scale
through Spark.

It's worth noting that there is a PMML community. Open Data Group
(Augustus) and Zementis do much work to help organize and promote that.
Opinion: both of those projects seem more likely as best ref impls than
JPMML -- at least more actively cooperating within the PMML open standard
community. YMMV.

If you're interested in PMML then I'd encourage you to get involved. There
are workshops, e.g., generally at KDD, ACM gatherings, etc.

FWIW, I was the original lead on Cascading's PMML support -- first rev that
other firms used in production, not the rewrite on Concurrent's site that
added Cascading deep dependencies.

On Tue, Jun 10, 2014 at 11:10 AM, Evan R. Sparks <ev...@gmail.com>
wrote:

> I should point out that if you don't want to take a polyglot approach to
> languages and reside solely in the JVM, then you can just use plain old
> java serialization on the Model objects that come out of MLlib's APIs from
> Java or Scala and load them up in another process and call the relevant
> .predict() method when it comes time to serve. The same approach would
> probably also work for models trained via MLlib's python APIs, but I
> haven't tried that.
>
> Native PMML serialization would be a nice feature to add to MLlib as a
> mechanism to transfer models to other environments for further
> analysis/serving. There's a JIRA discussion about this here:
> https://issues.apache.org/jira/browse/SPARK-1406
>
>
> On Tue, Jun 10, 2014 at 10:53 AM, filipus <fl...@gmail.com> wrote:
>
>> Thank you very much
>>
>> the cascading project i didn't recognize it at all till now
>>
>> this project is very interesting
>>
>> also I got the idea of the usage of scala as a language for spark -
>> becuase
>> i can intergrate jvm based libraries very easy/naturaly when I got it
>> right
>>
>> mh... but I could also use sparc as a model engine, augustus for the
>> serializer and a third party produkt for the prediction engine like using
>> jpmml
>>
>> mh... got the feeling that i need to do java, scala and python at the same
>> time...
>>
>> first things first -> augustus for an pmml output from spark :-)
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7335.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Re: pmml with augustus

Posted by "Evan R. Sparks" <ev...@gmail.com>.

I should point out that if you don't want to take a polyglot approach to
languages and reside solely in the JVM, then you can just use plain old
java serialization on the Model objects that come out of MLlib's APIs from
Java or Scala and load them up in another process and call the relevant
.predict() method when it comes time to serve. The same approach would
probably also work for models trained via MLlib's python APIs, but I
haven't tried that.

Native PMML serialization would be a nice feature to add to MLlib as a
mechanism to transfer models to other environments for further
analysis/serving. There's a JIRA discussion about this here:
https://issues.apache.org/jira/browse/SPARK-1406

On Tue, Jun 10, 2014 at 10:53 AM, filipus <fl...@gmail.com> wrote:

> Thank you very much
>
> the cascading project i didn't recognize it at all till now
>
> this project is very interesting
>
> also I got the idea of the usage of scala as a language for spark - becuase
> i can intergrate jvm based libraries very easy/naturaly when I got it right
>
> mh... but I could also use sparc as a model engine, augustus for the
> serializer and a third party produkt for the prediction engine like using
> jpmml
>
> mh... got the feeling that i need to do java, scala and python at the same
> time...
>
> first things first -> augustus for an pmml output from spark :-)
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7335.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: pmml with augustus

Posted by filipus <fl...@gmail.com>.

Thank you very much

the cascading project i didn't recognize it at all till now

this project is very interesting

also I got the idea of the usage of scala as a language for spark - becuase
i can intergrate jvm based libraries very easy/naturaly when I got it right

mh... but I could also use sparc as a model engine, augustus for the
serializer and a third party produkt for the prediction engine like using
jpmml

mh... got the feeling that i need to do java, scala and python at the same
time...

first things first -> augustus for an pmml output from spark :-)





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7335.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pmml with augustus

Posted by Sean Owen <so...@cloudera.com>.

On Tue, Jun 10, 2014 at 7:59 AM, Sean Owen <so...@cloudera.com> wrote:
> It's worth mentioning that Augustus is a Python-based library. On a
> related note, in Java-land, I have had good experiences with jpmml's
> projects:

https://github.com/jpmml

in particular

https://github.com/jpmml/jpmml-model
https://github.com/jpmml/jpmml-evaluator

I have not used OpenScoring yet.

Re: pmml with augustus

Posted by Sean Owen <so...@cloudera.com>.

It's worth mentioning that Augustus is a Python-based library. On a
related note, in Java-land, I have had good experiences with jpmml's
projects:


On Tue, Jun 10, 2014 at 7:52 AM, filipus <fl...@gmail.com> wrote:
> hello guys,
>
> has anybody experiances with the library augustus as a serializer for
> scoring models?
>
> looks very promising and i even found a hint on the connection augustus and
> spark
>
> all the best
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pmml with augustus

Posted by filipus <fl...@gmail.com>.

@villu: thank you for your help. In prommis I gonna try it! thats cools :-)
do you know also the other way around from pmml to a model object in spark?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pmml with augustus

Posted by Villu Ruusmann <vi...@gmail.com>.

Hello Spark/PMML enthusiasts,

It's pretty trivial to integrate the JPMML-Evaluator library with Spark. In
brief, take the following steps in your Spark application code:
1) Create a Java Map ("arguments") that represents the input data record.
You need to specify a key-value mapping for every active MiningField. The
key type is org.jpmml.evaluator.FieldName. The value type could be String or
any Java primitive data type that can be converted to the requested PMML
type.
2) Obtain an instance of org.jpmml.evaluator.Evaluator. Invoke its
#evaluate(Map<FieldName, ?>) method using the argument map created in step
1.
3) Process the Java Map ("results") that represents the output data record.

Putting it all together:
JavaRDD<Map&lt;FieldName, String>> arguments = ...
final ModelEvaluator<?> modelEvaluator =
(ModelEvaluator<?>)pmmlManager.getModelManager(null,
ModelEvaluatorFactory.getInstance()); // See the JPMML-Evaluator
documentation
JavaRDD<Map&lt;FieldName, ?>> results = arguments.flatMap(new
FlatMapFunction<Map&lt;FieldName, String>, Map<FieldName, ?>>(){

	@Override
	public Iterable<Map&lt;FieldName, ?>> call(Map<FieldName, String>
arguments){
		Map<FieldName, ?> result = modelEvaluator.evaluate(arguments);
		return Collections.<Map&lt;FieldName, ?>>singletonList(result);
	}
});

Of course, it's not very elegant to be using JavaRDD<Map&lt;K, V>> here.
Maybe someone can give me a hint about making it look and feel more Spark-y?

Also, I would like to refute earlier comment by @pacoid, that
JPMML-evaluator compares poorly against Augustus and Zementis products.
First, JPMML-Evaluator fully supports PMML specification versions 3.0
through 4.2. I would specifically stress the support for PMML 4.2, which was
released just a few months ago. Second, JPMML is open source. Perhaps its
licensing terms could be more liberal, but it's nevertheless the most open
and approachable way of bringing Java and PMML together.


VR



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pmml-with-augustus-tp7313p7412.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.