You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@echarts.apache.org by GitBox <gi...@apache.org> on 2021/08/19 20:38:28 UTC
[GitHub] [echarts] 100pah edited a comment on pull request #15355: dataset: fix dataset performance drops signifcantly on high dimensions data.

100pah edited a comment on pull request #15355:
URL: https://github.com/apache/echarts/pull/15355#issuecomment-902228124


   Some changes has been made in dffeb3a9a
   
   
   ## There are these issues existing before this commit:
   1. If no dimensions specified on dataset, series can not really share one storage instance. (Because the each series will create its own dimension name (like ['x', 'y', 'value'], ['x', 'value', 'y'], and the storage hash is based on those names. So the hash can not match).
   2. Each time `setOption` update series (but not change `dataset`), new data stack dimensions (and corresponding chunks) will be keep added to the shared data storage, and then the chunks will be more and more.
   3. When "unused dimension omit" happen, the index of SeriesData['dimensions'] and SeriesData['getDimension'] is not the dimensionIndex that users known. But there are someplace still use dimensionIndex to visit them. (especially in visualMap and other similar cases that user can input dimension index via option).
   4. If user only specify type but no name in dimensions, their will be some bug when "unused dimension omit" happen. (Because unused dimensions will not auto-generate dimension name by `createDimensions` and so that it has no dimension name in storage, and can not be queried by dimension name).
   5. If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same data. And then a data storage created by based on s
   
   ## This commit try resolve those issues by this way:
   1. Do not save the "dimName->dimIndex map" in data storage any more. Because:
       1. In fact data storage do not need this map to read/write data.
       2. dimNames are usually created based on each series info (like ['x', 'y', 'value'], ['x', 'value', 'y'], ...) if not specified by user. So they are different between series. But even those series have different generated dimension names, they can still share one storage, because essentially they visit the same dataset source by dimIndex.
   2. Make `SeriesDimensionDefine` (that is, each item of `SeriesData['dimensionInfos']`) contain `storageDimensionIndex` to indicate its corresponding data store dimension index. And alway user `storageDimensionIndex` to visit dat storage rather than dimName. `storageDimensionIndex` is created in `createDimension`.
   3. create a new structure `SeriesDimensionRequest` for each series. It contains the info generated by `createDimension` (like dimensionDefineList, whether dimension omitted, source for this series).
       3.1. `sourceManager` use `seriesDimensionRequest` to find the shared storage by generate storage dimensions and hash based on the `dimCount` and `dimensionDefineList` (which are created by `createDimension` and saved in `seriesDimensionRequest`).
       3.2. `dataStack` add "data stack dimensions" to `dimensionDefineList` in `seriesDimensionRequest`.
       3.3. `seriesData` use `seriesDimensionRequest` to init its dimensions, and use `seriesDimensionRequest` query dimName by dimIndex from source, or query dimIndex by dimName from source for "omitted dimension". If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same "raw data". The data storage generated based on the "raw data" can be shared between series, but the `dimensionDefine` should not be shared. `SeriesDimensionRequest` encapsulate these mess and work each series to query dimension name or index.
       3.4. `seriesDimensionRequest` do not create new data structure as possible as it can, but reference shared data structure (like `source` instance). So it will not cost memory issue.
   4. Change the previous `storage.appendDimension` to `storage.ensureCalculationDimension` for data stack. That is, if its dimension has been created, reuse it.
   5. Remove previous `canUse` method. Whether a storage can be shared by series is all determined by hash. The hash is generated in two ways:
       5.1. For source format "arrayRows" (i.e., [[12, 33], [55, 99], ...]), dimension name do not need to be added to hash, because this kind of data actually visited by index. If two series have different dimension name (like 'x', 'y') for single index, they also can share the storage.
       5.2. For source format "objectRows" (i.e, [{a: 12, b: 33}, {b: 55, a: 99}, ...]), property name 'a', 'b' will be added to hash, because this kind of data actually visited by property name.
       5.3. And as before, dimension type, ordinal meta id, source header, series layout will also be added to hash.
   6. Make `DataStorage` method immutable:
       `DataStorage['filterSelf']` -> `DataStorage['filter']`
       `DataStorage['selectRange']`
   
   ## PENDING:
   1. Should deprecate `dimName = seriesData.getDimension(dimLoose)` and `series.get(dimName)` but always use `dimIdx = seriesData.getDimensionIndex(dimLoose)` and `dataStorage.get(dimIdx)` instead. For examples:
   ```js
   // Previously
   const val = data.get(data.getDimension(dim), dataIdx);
   // Now
   const val = data.getStorage().get(data.getDimensionIndex(dim), dataIdx);
   ```
   `seriesData.getDimension(dimLoose)` has a feature that convert dimIdx to dimName, which is not essentially necessary (because dimIdx can be used to visit data directly), but this feature require a "dimIdx->dimName map" in `SeriesData` (why? because when some dimensions are omitted, we can not use dimIdx on `SeriesData['dimensions']` directly).
   
   2. Radar has bug when using `series.encode`. This commit do not fix the bug but keep as it is.
   
   ## Test case
   <test/dataset-case.html>
   <test/dataset-performance.html>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@echarts.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@echarts.apache.org
For additional commands, e-mail: commits-help@echarts.apache.org