You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Pentreath (JIRA)" <ji...@apache.org> on 2016/04/25 11:28:13 UTC

[jira] [Comment Edited] (SPARK-14891) ALS in ML never validates input schema

    [ https://issues.apache.org/jira/browse/SPARK-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256140#comment-15256140 ] 

Nick Pentreath edited comment on SPARK-14891 at 4/25/16 9:27 AM:
-----------------------------------------------------------------

Currently the only doc is 
{code}
/**
 * :: DeveloperApi ::
 * An implementation of ALS that supports generic ID types, specialized for Int and Long. This is
 * exposed as a developer API for users who do need other ID types. But it is not recommended
 * because it increases the shuffle size and memory requirement during training. For simplicity,
 * users and items must have the same type. The number of distinct users/items should be smaller
 * than 2 billion.
 */
@DeveloperApi
object ALS ... 
{code}

The user-facing ML API casts user/item col to {{IntegerType}} but with no warnings or enforcing any schema validation, so it implicitly "supports" any numeric input type but silently casts it to Int, which would cause some irritating issues when, say, saving the model and trying to use it in production, or making recommendations and saving those to a datastore for a set of user ids (which were Long, say, and have now been mangled to Ints)


was (Author: mlnick):
Currently the only doc is 
{code}
/**
 * :: DeveloperApi ::
 * An implementation of ALS that supports generic ID types, specialized for Int and Long. This is
 * exposed as a developer API for users who do need other ID types. But it is not recommended
 * because it increases the shuffle size and memory requirement during training. For simplicity,
 * users and items must have the same type. The number of distinct users/items should be smaller
 * than 2 billion.
 */
@DeveloperApi
object ALS ... 
{code}

The user-facing ML API casts user/item col to {{IntegerType}} but with no warnings or enforcing any schema validation, so it implicitly "supports" any numeric input type but silently casts it to Int, which would cause some irritating issues when, say, saving the model and trying to use it in production, or making recommendations and saving those to a datastore for a set of user ids (which were Long, say, and have not been mangled to Ints)

> ALS in ML never validates input schema
> --------------------------------------
>
>                 Key: SPARK-14891
>                 URL: https://issues.apache.org/jira/browse/SPARK-14891
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Nick Pentreath
>
> Currently, {{ALS.fit}} never validates the input schema. There is a {{transformSchema}} impl that calls {{validateAndTransformSchema}}, but it is never called in either {{ALS.fit}} or {{ALSModel.transform}}.
> This was highlighted in SPARK-13857 (and failing PySpark tests [here|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56849/consoleFull])when adding a call to {{transformSchema}} in {{ALSModel.transform}} that actually validates the input schema. The PySpark docstring tests result in Long inputs by default, which fail validation as Int is required.
> Currently, the inputs for user and item ids are cast to Int, with no input type validation (or warning message). So users could pass in Long, Float, Double, etc. It's also not made clear anywhere in the docs that only Int types for user and item are supported.
> Enforcing validation seems the best option but might break user code that previously "just worked" especially in PySpark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org