You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/24 11:14:41 UTC

[GitHub] [spark] zhengruifeng commented on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns

zhengruifeng commented on issue #25983: [SPARK-29327][MLLIB]Support specifying features via multiple columns
URL: https://github.com/apache/spark/pull/25983#issuecomment-545870132
 
 
   > VectorAssembler has to make a pass over the data and merge multiple columns.
   `VectorAssembler` only trigger a `first()` job to get the sizes of input vectors.
   
   > Many ML algorithms prefer columnar data and this allows the algorithm to determine what it wants to do with the columns.
   Do you mean column-based parallelism used in distributed tree building? Such function is not exposed to end users, and what you need to do is only to set params like `(..., updater=distcol)`.
   If some alg will benefit from column-based parallelism, I guess it is better to split the features internally. No alg in MLLibs is designed to fit/transform with column-based datasets for now, so I do not prefer to add this feature.
   
   > It is being used with XGBoost.
   I cannot find any related docs in [XGBoost Parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#xgboost-parameters). Could you please provide a link for this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org