You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2015/02/08 23:14:32 UTC

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/4460

    SPARK-4588 [MLLIB] [WIP] Add API for feature attributes

    @mengxr
    
    This is an early checkin to see if this is in the right direction at all.
    
    - Should this have a counterpart "builder" class, or setters that make a new copy of the object?
    - In the JIRA you mention feature dimension. Dumb question, but is that just the cardinality of categorical features?
    - I structured it all with maps, and I wonder if that's OK

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-4588

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4460.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4460
    
----
commit aa3361be703ae08dd10525330d6471aedb294f59
Author: Sean Owen <so...@cloudera.com>
Date:   2015-02-08T22:11:22Z

    Initial draft of FeatureAttributes class

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75352674
  
    +1 for the feature assembler or some other algorithm handling munging and indexing as needed.
    * Note that the behavior of the assembler may depend on the algorithm being used.  E.g., an assembler may want to use 1-hot encoding for Strings for linear regression, but use simple indexing for trees.  That makes it awkward for the user, and we may eventually want each algorithm to handle its own feature assembly if needed.
    
    About categorical types for decision trees: There should ideally be a distinction between categorical types with arbitrary values and categorical types known to be in a range {0, 1, ..., numCategories-1}.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75370525
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27813/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-75225746

Don't you always have to look at the data to determine how many unique values a column has, regardless of type? String and int are encodings, but attribute types like categorical and continuous are interpretations. Those seem orthogonal to me and I thought `Attribute` was only metadata representing the attribute type, whereas the RDD schema already knows the actual column data types. That is, there's no "string" attribute feature type right?

From a user perspective, I'd be surprised if I had to encode categorical string values as it seems like something a framework can easily do for me. There's nothing inherently strange about providing a string-valued column that is (necessarily) categorical and in fact they often are strings in input. But there's no reason it couldn't encode the categorical values internally in a different way if needed. Are you just referring to the latter? sure, anything can be done within the framework.

So why can't a vector-valued column contain a string -- is the issue that internally we can't have a single array-valued column with different data types? that makes sense. Vector-valued columns are usually used for things like a large number of indicator variables, or related counts, which are all of the same data type, and not typically to encode complex nested schemas containing different data types. But if you really wanted a vector-valued column containing continuous age and categorical gender, in that case yes it seems like the data representation limitations would demand they're of the same data type, and must be doubles, and that's fine. But does mean you can't ever have a non-vector string-valued categorical feature?

I agree that discrete and ordinal don't come up. I don't think they're required as types, but may allow optimizations. For example, there's no point in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. They're all the same rule. That optimization doesn't exist yet. I can't actually think of a realistic optimization that would depend on knowing a value is ordinal (1,2,3,...) I'd drop that maybe.

OK I will get to work on changes discussed so far.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-75370475

- (Any support for making Metadata .get methods return `Option`?)
- I created a `FeatureType` hierarchy. It's a little tricky and involves `trait`s because `Binary` has to have two types, but works.
- I did not yet make a `BinaryAttribute` or `DiscreteAttribute`, pending views on whether it's worth it. The same sort of extra work to inject a `CategoricalTypeAttribute` trait or something is necessary in order to let `DiscreteAttribute` share attributes of a `CategoricalAttribute` and `ContinuousAttribute`. Worth it?
- I did not yet make a `DiscreteAttribute`. Worth it? That one's not hard.
- `CategoricalAttribute` now accepts an optional, explicit cardinality (and that replaces `numCategories`). I think it should only be defined for categorical types, no? I added an example in the scaladoc.
- I did not rename `FeatureAttributes` or try to make it extend `Attribute`, since I understand that vector-valued features do not have heterogeneous types. That is, a vector-valued feature is simply a many-valued continuous or categorical or binary column, not a complete sub-schema.
- I went to add `AttributeGroup`, but then I can't figure out how this isn't already covered by `Attribute`'s dimension? It's 1 for a scalar, >1 for a vector-valued feature. That's all that's different from a metadata perspective, right?

@mengxr Please feel free at any time to take this over if it's easier. It will become more efficient at some point for you to just finish it as you think best. I personally believe your vision sounds good and am more concerned with making it match the vision of the other code you're creating, and letting you get on with that.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-74609769
  
    I like the current sketch but also want to think about it more.  A few thoughts:
    
    I'm not quite clear on how the Array of Attributes in FeatureAttributes corresponds to the columns of the DataFrame.  Is it one-to-one, or will Attributes be nested?  (I'm basically thinking about groups of features, especially individual features grouped into vectors.)
    
    How will propagation of feature names work?  Will we try to impose a standard, such as Transformers maintaining the same (or a modified) feature name whenever possible?
    
    By the way, do we want to call this "FeatureAttributes," or should we name it something like "ColumnAttributes" so it more obviously applies to other types of columns like labels, users, products, etc.?
    
    +1 for moving FeatureType from mllib.tree to attribute.  It should be more general.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75370501
  
      [Test build #27813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27813/consoleFull) for   PR 4460 at commit [`7c944da`](https://github.com/apache/spark/commit/7c944da8b2a1aa9ec9d70fabe77b17d54c08b291).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-73439415
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27053/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-73927817
  
      [Test build #27294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27294/consoleFull) for   PR 4460 at commit [`b3a1000`](https://github.com/apache/spark/commit/b3a10005e753c0e106dc73c74b6161bdae8d8a22).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class Attribute(val index: Int,`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-74648170

So is the idea that `FeatureAttributes` becomes `AttributeGroup`, and that it continues to contain many `Attribute`s? I didn't realize that we intended the vector-valued features to be whole schemas within themselves. So they may be `AttributeGroup`s too and so an `AttributeGroup` is an `Attribute` too. Makes sense.

Rename `FeatureType`? and what's its value for `AttributeGroup`? `GROUP` or `null`?

You could imagine a more elaborate hierarchy of types: _discrete_ is a special case of _continuous_, _ordinal_ is a special case of _discrete_. It's nice to have that expressiveness; it adds somewhat to the complexity for the caller and the code. Maybe you could argue that the schema should force an interpretation for the algorithm. But I kind of like it. The type objects would have methods like `isContinuous`, `isCategorical`. Should I make a fuller hierarchy or stick to adding `BINARY`?

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-74623712
  
    Btw, for the feature type, beside continuous and categorical, do we want to make binary special? It could be treated as both continuous and categorical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-75223695

@srowen If we mark a string column categorical, it may be hard to answer how many categories it has without looking at the data. If a column is marked categorical, it should store the information of categories in the metadata and store only integer/double in the column. The difference is that we cannot merge a string column with a double column into a vector column without turning the string column into an integer/double column. So I feel that we should treat string column and categorical column separately.

For the type hierarchy, if we add a type, let's discuss what algorithms would actually use it:

1. numeric (slightly broader than continuous): linear algorithms can take them, while decision tree needs to bin them first (if the number of distinct values are large)
2. categorical: decision tree can take them, while linear algorithms needs dummy coding.
3. binary: both linear algorithms and decision tree can take them directly.
4. discrete/ordinal: ?

Maybe for the base `Attribute` trait, we can have `cardinality: Int`. For continuous data, we put `inf` or `-1`, for ordinal data, we put the number of distinct values. Let's try to write down the API and see how it looks.

Not a concern at this stage, for `AttributeGroup` we might need to handle the storage efficiently to avoid GC pressure. It may come out with millions of features, and we will hit GC if we use too many objects to store the attributes. Well, this is maybe too early to worry about.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75384291
  
    @jkbradley The issue with algorithms handling munging and indexing is the increased complexity. For example, if `DecisionTree` takes string columns, there will be some code inside `DecisionTree` that maps them into indices, and also we need to insert the same logic to the model it produces, and then save/load for this model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-73435580
  
      [Test build #27053 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27053/consoleFull) for   PR 4460 at commit [`aa3361b`](https://github.com/apache/spark/commit/aa3361be703ae08dd10525330d6471aedb294f59).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-75366539

@mengxr Understood about being told the number of distinct values for a column by the caller/schema. I thought you were saying this was a difference between string / integer columns. Yes, the point of this value is that it can be fed in as metadata.

I think we are saying the same thing regarding strings. Yes, there's no point in every algorithm handling string types. It's fine if they only consume numeric types, and some separate optional transformation stage re-encodes strings if needed. That's a transformation stage the framework could provide, for users to drop in. Algorithms could indeed refuse to operate on string types.

To me, that means that the idea of a string-valued categorical column still has a place in the representation since it exists at some stage of a pipeline. It's just that such a thing would never reach an algorithm as-is. Is that aligned with what you guys think?

Let me get down to the code to see if this actually matters to the metadata. Right now nothing about this PR has to do with the underlying data type. As long as you aren't saying "string" is a type mutually exclusive with "categorical" then I think we're saying the same thing.

Yes I think my tree example wasn't a good one, nevermind. The algorithm is already going to choose only actual data values as split points. Hm. Might matter to algorithms that want to restrict themselves to discrete inputs, like multinomial naive bayes? even there I don't think it's an example of requiring the distinction. I'm neutral about adding 'discrete' as a type I suppose.

Interesting point: are numeric categorical values always assumed to be in {0, 1, 2, ..., n-1}? That strikes me as something that is often true, not always. A particular algorithm may require it, and may provide an optional transformation to put a feature into this form. It doesn't seem like a conceptually different type as much as an additional restriction on a categorical feature's representation? that is, is this maybe an additional bit of metadata? Hm.

Let me make some code changes to reflect the discussion so far.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75370524
  
      [Test build #27813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27813/consoleFull) for   PR 4460 at commit [`7c944da`](https://github.com/apache/spark/commit/7c944da8b2a1aa9ec9d70fabe77b17d54c08b291).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class Attribute(val index: Int,`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-73927632
  
      [Test build #27294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27294/consoleFull) for   PR 4460 at commit [`b3a1000`](https://github.com/apache/spark/commit/b3a10005e753c0e106dc73c74b6161bdae8d8a22).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75304366
  
    > Don't you always have to look at the data to determine how many unique values a column has, regardless of type?
    
    No if we already have ML attributes saved together with the data or defined by users.
    
    > String and int are encodings, but attribute types like categorical and continuous are interpretations. Those seem orthogonal to me and I thought Attribute was only metadata representing the attribute type, whereas the RDD schema already knows the actual column data types.
    
    Conceptually, this is true. But adding restrictions would simplify our implementation. The restriction I proposed is that data stored in columns with ML attributes are values (Float/Double/Int/Long/Boolean/Vector). So algorithms and transformers don't need to handle special types. Let's consider a vector assembler that merges multiple columns into a vector column. If it needs to handle string columns, it needs to call some indexer to turn strings into indices and merge them. This piece of code would probably appear in every algorithm and the unit test. If we force users turn string columns into numeric ones first. The implementation of the rest of the pipelines could be simplified.
    
    > From a user perspective, I'd be surprised if I had to encode categorical string values as it seems like something a framework can easily do for me. There's nothing inherently strange about providing a string-valued column that is (necessarily) categorical and in fact they often are strings in input. But there's no reason it couldn't encode the categorical values internally in a different way if needed. Are you just referring to the latter? sure, anything can be done within the framework.
    
    scikit-learn has a clear separation of string values and numeric values. All string values must be encoded into categorical columns through transformers before calling ML algorithms, and all ML algorithms take a matrix `X` and a vector `y`. That didn't surprise users much (hopefully). In MLlib, we will provide transformers that turn strings into categorical columns in various ways.
    
    > I agree that discrete and ordinal don't come up. I don't think they're required as types, but may allow optimizations. For example, there's no point in checking decision rules like >= 3.4, >= 3.5, >= 3.6 for a discrete feature. They're all the same rule. That optimization doesn't exist yet. I can't actually think of a realistic optimization that would depend on knowing a value is ordinal (1,2,3,...) I'd drop that maybe.
    
    For trees, if the features are integers and their is a split `> 3.4`. Then there won't be splits between 3 and 4 because all points are separated. It looks okay to me that we have a split `> 3.4` while all values are integers. We can definitely add this attribute back if it becomes necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75461553
  
    > I went to add AttributeGroup, but then I can't figure out how this isn't already covered by Attribute's dimension? It's 1 for a scalar, >1 for a vector-valued feature. That's all that's different from a metadata perspective, right?
    
    If we want to have attributes parallel to arbitrary nestings of SQL types, then it will need to be changed.  I'm not clear right now what the consensus is on supporting nested schema.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-73927823
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27294/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-73599916

@srowen I'm thinking of connecting `FeatureAttributes` and SQL's `Metadata` through factory methods but not via constructors.

~~~
val attributes = FeatureAttributes.fromMetadata(metadata)
val metadata = attributes.toMetadata
~~~

And we can build `FeatureAttributes` using a builder. This would be easier for us to write unit tests and manipulate feature attributes.

By `feature dimension`, I mean the vector size for a vector-typed feature column.

I see that you made feature names fundamental. It might be hard to provide every feature a name, especially after a series of feature transformations. Essentially, we want to maintain a bi-map between feature indices and names, while some names might be missing. It would be better if we can feature index as the feature identifier.

Given a `FeatureAttributes` instances, the following methods seem to be necessary:

1. `size(): Int`: 1 for a scalar column, and vectorSize for a vector column
1. `producer: String`, log who produces those attributes
2. `getFeatureAttribute(index: Int): Attribute`: gets a feature's attribute from its index
3. `getFeatureIndex(name: String): gets a feature's index from its name
4. `categoricalFeatureIndices(): Array[Int]`: returns a list of categorical features

The construct will look like

~~~
class FeatureAttributes(val attributes: Array[Attribute], val producer: String)
~~~

And we maintain a map from feature name to feature indices internally. For each `Attribute`, we can put the following fields:

1. `featureType: Int`: continuous, categorical, etc
2. `name: String`

Then we can have `ContinuousAttribute` and `CategoricalAttribute` as two subclasses of `Attribute` and each implements its own methods and `toMetadata`/`fromMetadata` methods. For a continuous feature, we might want to leave slots for min, max, support, etc, and for a categorical attribute, we want to at least have `categories: Array[String]` and `numCategories`.

This is basically my brain dump ... we definitely need to revise them when we get into the details.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-74614834
  
    This is perhaps contained in @jkbradley 's question, but how does this work with features that are represented with multiple entries in the feature vector - e.g. when we're doing a one-hot encoding.  With a one-hot encoding is each category its own feature or can a feature span multiple indices in the vector?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by srowen <gi...@git.apache.org>.

Github user srowen closed the pull request at:

    https://github.com/apache/spark/pull/4460


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-74623603

There are two types of `Attribute(s)`: describing a feature group (a vector column) or describing a single feature (a scalar column). For a feature group, the column name becomes the group name and individual features inside this group may have their own names. For example, we have a vector column called `user` and inside this feature group we can have features named `age` and `gender`. When we merge multiple groups into a single feature vector, e.g., in a feature vector assembler, the names are flattened like `user:age` and `user:gender`. This answers @sryza 's question about one-hot-encoding. Assume that the input column is a scalar column called "country" with categories stored in the attribute. Then OneHotEncoder will output a vector column and generate feature attributes with names like `country:US`, `country:CA`, etc.

+1 on @jkbradley 's suggestion about not calling it `FeatureAttribute`. `Attribute` should be okay to describe a scalar column but we also need a name to describe a vector column, where `Attributes` may sounds a little confusing. I suggest `AttributeGroup`.

We don't need to care about the `FeatureType` in `mllib.tree` in this PR. Once we have this PR merged, we can migrate the decision tree code.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-75170753

Call it `AttributeType` maybe?

So if an `AttributeGroup` contains both `Attribute`s but also vector-valued columns, which sound like `AttributeGroup`s within themselves. That's why it seemed like `AttributeGroup` should be an `Attribute` or at least share a common superclass? then I didn't know what to call it and it seemed like overkill. That was the logic behind `AttributeGroup extends Attribute` -- WDYT?

As for hierarchy that's all I can think of. Ordinal extends discrete extends continuous; binary extends, well, discrete and categorical I suppose.

Hm, I'd imagine most categorical features come in as strings. This feels like just the kind of thing a framework can accommodate if it has the type information. I don't think it's more or less complex to say that a string column can be categorical? It would take some work to inject a translation to integers where that's needed but that's great if the framework can do that.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-73927224
  
    @mengxr Great, that helps me. I took another shot at implementing the above ideas.
    
    - Is package `org.apache.spark.ml.attribute` reasonable?
    - `FeatureType` duplicates `org.apache.spark.mllib.tree.configuration.FeatureType`; OK as this will be the 'new' version?
    - `Metadata` operates like a `Map`, and that means keys don't necessarily have a value. Does it need to return `Option[...]` from its methods then?
    - You can see there's some special handling of optional values in the caller when buildilng too. Should the builder accept `Option[...]` too?
    - I'm not sure I implemented the patterns quite in the way you had in mind, like regarding `producer`. Please comment if you prefer a different design


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-73439410
  
      [Test build #27053 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27053/consoleFull) for   PR 4460 at commit [`aa3361b`](https://github.com/apache/spark/commit/aa3361be703ae08dd10525330d6471aedb294f59).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class FeatureAttributes(val metadata: Metadata) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4460#issuecomment-75106841

> Rename FeatureType? and what's its value for AttributeGroup? GROUP or null?

I wish we could use `type`, but it is already taken by Scala. `DataType` is taken by SQL. So `DatumType` or `MLDataType`? ... I don't really have good suggestions. I'm not sure whether we should make `AttributeGroup` an `Attribute`. What is the benefit of making it an `Attribute`?

> You could imagine a more elaborate hierarchy of types: discrete is a special case of continuous, ordinal is a special case of discrete. It's nice to have that expressiveness; it adds somewhat to the complexity for the caller and the code. Maybe you could argue that the schema should force an interpretation for the algorithm. But I kind of like it. The type objects would have methods like isContinuous, isCategorical. Should I make a fuller hierarchy or stick to adding BINARY?

I think having a full hierarchy is a good idea. Could you list all of the types you want to include? Then we can check the complexity. Btw, I don't know whether we should have ML attributes attached to string columns. It seems to me that a string column should be mapped to an integer column first to become an ML column with attribute. Hopefully that reduces the complexity.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75129379
  
    I'm OK with a type hierarchy as long as it stays simple (and doesn't turn into a type system parallel to the DataFrame system).
    
    To support any type of DataFrame (with Structs and Arrays), we'll need to support nesting for sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: SPARK-4588 [MLLIB] [WIP] Add API for feature a...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4460#issuecomment-75458264
  
    > To me, that means that the idea of a string-valued categorical column still has a place in the representation since it exists at some stage of a pipeline. It's just that such a thing would never reach an algorithm as-is. Is that aligned with what you guys think?
    
    I agree: ML types should be applicable to any SQL type, as long as it makes sense.
    
    > As long as you aren't saying "string" is a type mutually exclusive with "categorical" then I think we're saying the same thing.
    
    I think we are saying the same thing.  (There will be some mutually exclusive types, such as Strings not being continuous.)
    
    > Interesting point: are numeric categorical values always assumed to be in {0, 1, 2, ..., n-1}? 
    
    Algorithms which want 0-based indices for categories (for efficient vector indexing) could handle the re-indexing themselves, but it would be nice to encode it in metadata for the benefit of ensemble algorithms (where you would only want to do the re-indexing once).
    
    > @jkbradley The issue with algorithms handling munging and indexing is the increased complexity. 
    
    @mengxr  It won't really increase complexity of the code much since the same code could be re-used for all algorithms (with a few options for the feature types the algorithm can handle).  The main issue is the API:
    * Do users want to be able to call an algorithm on any dataset they load, without thinking about the feature types?
      * If so, does the implicit featurization belong in the algorithm (which knows what types it can take) or in a featurizer PipelineStage before the algorithm (where the user would have to specify feature types based on the algorithm being used)?
    * Or do we want to force users to examine the features and select types by hand before running an algorithm?
    
    I've argued for algorithms handling featurization before, but I can see reason in forcing users to know what they are doing.  This discussion may not belong in this PR anyways, since this functionality could be added later on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org