You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jbapple (via GitHub)" <gi...@apache.org> on 2023/02/03 15:58:34 UTC

[GitHub] [arrow] jbapple opened a new pull request, #34032: GH-29745: [C++][Python] Approximate count distinct kernel

jbapple opened a new pull request, #34032:
URL: https://github.com/apache/arrow/pull/34032

   This patch provides "hll", an abbreviation for HyperLogLog, which can approximately count the number of distinct elements in a stream (or set of streams) with less memory than exact count distinct.
   
   To do this, we pull in Apache DataSketches - a vendored version, as the most recent released version has some issues that casue our CI to fail.
   
   
   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   ### What changes are included in this PR?
   
   Changes to the cpp and python directories, as well as a full copy of https://github.com/apache/datasketches-cpp
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ### Are these changes tested?
   
   Yes
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   ### Are there any user-facing changes?
   
   Yes - the addition of the hll function. This PR contains an update to the docs.
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1423529306

   Alternatively, maybe it's time to start pioneering the ability to write an extension that adds new compute functions?  E.g. DuckDb is heading in this direction (https://duckdb.org/docs/extensions/overview.html) and it seems like it would be a way to help keep the compute kernels more maintainable (both kernel maintainers not wanting to be affected by engine changes and vice versa).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34032: GH-29745: [C++][Python] Approximate count distinct kernel

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1416068121

   * Closes: #29745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cpcloud commented on pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel

Posted by "cpcloud (via GitHub)" <gi...@apache.org>.
cpcloud commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1490549221

   @jbapple Can you close this out if there's no sign of it getting merged? It's been open for nearly two months without activity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jbapple closed pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel

Posted by "jbapple (via GitHub)" <gi...@apache.org>.
jbapple closed pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel
URL: https://github.com/apache/arrow/pull/34032


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1423528132

   I agree with @assignUser .  Plus, new repos bring in the question of code size, security, etc.  Probably we will also want a new `ARROW_WITH_DATASKETCHES` cmake option that controls inclusion (and if not included we just don't get those kernels and that's ok). That being said, the datasketches project looks pretty neat.  Do you have much experience with it?  Do you know if there would be other kernels beyond hll that are available?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34032: GH-29745: [C++][Python] Approximate count distinct kernel

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1416068166

   :warning: GitHub issue #29745 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1416106025

   * Closes: #29746


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] assignUser commented on pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel

Posted by "assignUser (via GitHub)" <gi...@apache.org>.
assignUser commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1423081463

   Thanks for the PR, sounds useful
   
   . I personally am not a fan of vendoring entire repos if there is no need for it. In this case you seem to know a specific rev that works, so I would prefer to get the repo via FetchContent (as we are bumping cmake to 3.16 for 12.0.0)/EP and use the provided cmake code & config to integrate it into the build process. 
   
   But I am no cpp maintainer so I will defer to other opinions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jbapple commented on pull request #34032: GH-29746: [C++][Python] Approximate count distinct kernel

Posted by "jbapple (via GitHub)" <gi...@apache.org>.
jbapple commented on PR #34032:
URL: https://github.com/apache/arrow/pull/34032#issuecomment-1423535877

   > That being said, the datasketches project looks pretty neat. Do you have much experience with it? Do you know if there would be other kernels beyond hll that are available?
   
   There are other useful kernels, including frequent items / heavy hitters, reservoir sampling, quantiles for non-numeric entities, and exterme-value quantiles (like p99.9).
   
   https://datasketches.apache.org/docs/Architecture/MajorSketchFamilies.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org