You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/05 18:35:35 UTC

[GitHub] [incubator-mxnet] yifeim commented on issue #15428: Dataloader does not support sparse data

yifeim commented on issue #15428: Dataloader does not support sparse data
URL: https://github.com/apache/incubator-mxnet/issues/15428#issuecomment-508835732
 
 
   The vanilla sparse format lacks sufficient information for e.g., recommendation applications. There are many extensions on group-wise ranking loss, other field identifiers, and other pipe marks. Here are some examples:
   
   1. Group-wise ranking loss
   
   vw allows auxiliary labels and [shared information among groups](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Contextual-Bandit-algorithms)
   ```
   shared | s_1 s_2
   0:1.0:0.5 | a:1 b:1 c:1
   | a:0.5 b:2 c:1
   ```
   
   xgboost allows a [`.group` file](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html#group-input-format) to count how many rows belong to one ranking group
   ```
   2
   3
   ```
   
   2. Multi-field features
   
   libffm uses [multiple columns](https://github.com/ycjuan/libffm/blob/master/README#L116)
   ```
   <label> <field1>:<feature1>:<value1> <field2>:<feature2>:<value2> ...
   ```
   
   vw uses [multiple pipes](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/input-format)
   ```
   1 1.0 |MetricFeatures:3.28 height:1.5 length:2.0 |Says black with white stripes |OtherFeatures NumberOfLegs:4.0 HasStripes
   1 1.0 zebra|MetricFeatures:3.28 height:1.5 length:2.0 |Says black with white stripes |OtherFeatures NumberOfLegs:4.0 HasStripes
   ```
   
   3. Other delimiters in open-source datasets, e.g., [Criteo counterfactual analysis challenge](https://arxiv.org/abs/1612.00367) is similar to the vw format, but uses space as delimiters.
   ```
   example ${exID}: ${hashID} ${wasAdClicked} ${propensity} ${nbSlots} ${nbCandidates} ${displayFeat1}:${v 1} ...
   ${wasProduct1Clicked} exid:${exID} ${productFeat1 1}:${v1 1} ...
   ```
   
   It is rather difficult to enumerate all the cases, so I would recommend allowing more flexibility, e,g, with a regex format for the parser.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services