You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by fhueske <gi...@git.apache.org> on 2014/10/09 17:13:41 UTC

[GitHub] incubator-flink pull request: [FLINK-1151] Adds BaseStatistics for...

GitHub user fhueske opened a pull request:

    https://github.com/apache/incubator-flink/pull/147

    [FLINK-1151] Adds BaseStatistics for CollectionDataSources

    Adds statistics for collection data sources based on collection size and serializer information.
    Adds getMinimumLength() to TypeSerializer to get a lower bound for size estimates.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fhueske/incubator-flink collectionStats

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-flink/pull/147.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #147
    
----
commit 924b02afd7b7ae519dbadfbff88354689ef728ce
Author: Fabian Hueske <fh...@apache.org>
Date:   2014-10-09T13:04:20Z

    [FLINK-1151] Adds statistics for collection data sources based on collection size and serializer info.
    Adds getMinimumLength() to TypeSerializer to get a lower bound for size estimations.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-flink pull request: [FLINK-1151] Adds BaseStatistics for...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/incubator-flink/pull/147#issuecomment-59583563
  
    I think the `getMinimumLength()` method would fit better into the type information. That would be a better separation of roles, because the serializer's task is actually not to provide statistics for the types. The `getLength()` method is for the runtime to handle fix-length data types more efficiently at runtime.
    
    The way the method currently delegates mostly to `getLength()`, it will return `-1` for the var length data types, leading to weird estimates.
    
    For better size estimates, we might allow type information to sample a collection of elements to figure out the size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-flink pull request: [FLINK-1151] Adds BaseStatistics for...

Posted by fhueske <gi...@git.apache.org>.
Github user fhueske commented on the pull request:

    https://github.com/apache/incubator-flink/pull/147#issuecomment-59699366
  
    I also thought about adding ``getMinimumLength()`` to the type info but decided for the seriailzers because they define how much data is actually written out (e.g., length info for strings or size info for arrays). On the other hand, these few bytes are probably negligible compared to the actual size of var-length data types.
    
    In fact, ``getMinLength()`` does not delegate to ``getLength()`` for var length data types (if there's no bug). Forwarding -1 wouldn't make any sense.
    
    You're right, for CollectionDataSource sampling some elements would give better estimates, but I though the ``getMinLength()`` could also be used for size estimation during optimization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---