You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brachi Packter (JIRA)" <ji...@apache.org> on 2019/04/04 12:40:00 UTC

[jira] [Commented] (BEAM-2728) Extension for sketch-based statistics

    [ https://issues.apache.org/jira/browse/BEAM-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809803#comment-16809803 ] 

Brachi Packter commented on BEAM-2728:
--------------------------------------

I want to save the sketch itself to BigQuery, to be able to perform merge [https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions]

I used this library [https://github.com/apache/beam/tree/master/sdks/java/extensions/sketching]

and in the code:
{code:java}
.apply("hll-count", Combine.perKey(ApproximateDistinct.ApproximateDistinctFn .create(StringUtf8Coder.of())))
.apply("to-table-row", ParDo.of(new DoFn< ValueInSingleWindow<KV<GroupByData,HyperLogLogPlus>>, TableRow>() { 
   @ProcessElement 
   public void processElement(ProcessContext processContext) { 
     ValueInSingleWindow<KV<GroupByData,HyperLogLogPlus>> windowed = processContext.element(); 
     KV<GroupByData, HyperLogLogPlus> keyData = windowed.getValue(); 
     GroupByData key = keyData.getKey(); 
     HyperLogLogPlus hllSketch = keyData.getValue(); 
     TableRow tableRow = new TableRow(); 
     tableRow.set("country_code",key.countryCode); 
     tableRow.set("event", key.event); 
     tableRow.set("profile", key.profile);
 
{code}
// How can I get the HLL ????????
{code:java}
tableRow.set("hll",hllSketch.getBytes());{code}

> Extension for sketch-based statistics
> -------------------------------------
>
>                 Key: BEAM-2728
>                 URL: https://issues.apache.org/jira/browse/BEAM-2728
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-sketching
>            Reporter: Arnaud Fournier
>            Assignee: Arnaud Fournier
>            Priority: Minor
>          Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> Goal : Provide an extension library to compute approximate statistics on streams.
> Interest : Probabilistic data structures can create an approximation (sketch) of the current state of a stream without storing every element but rather processing each observation quickly to summarize its current state and find useful statistical insights.
> Implementation is here : https://github.com/ArnaudFnr/beam/tree/sketching/sdks/java/extensions/sketching
> More info : https://docs.google.com/document/d/1Xy6g5RPBYX_HadpIr_2WrUeusiwL0Jo2ACI5PEOP1kc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)