You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2020/01/12 17:04:51 UTC

[GitHub] [incubator-datasketches-java] priyamgupta01 commented on issue #288: getPMF of UpdateDoublesSketch is giving different results for same data

priyamgupta01 commented on issue #288: getPMF of UpdateDoublesSketch is giving different results for same data
URL: https://github.com/apache/incubator-datasketches-java/issues/288#issuecomment-573435627
 
 
   Hi Rhodes/Alexander
   
   Thanks for your findings/suggestion.
   You might have faced issues in compiling as I have shared the relevant code
   from different modules.
   
   Here is brief of why I am using both tuple and quantile sketch.
   I have timeseries list of  users in different windows. I need to find the
   distribution of users who appeared in multiple windows.
   What I did is assigned value 1 to each user in tuple sketch and then merged
   to find that a same user appeared in how many windows. Then this sketch is
   passed to quantile sketch to get the distribution.
   
   Your suggestion in 4th point helped me to solve the problem. I increased
   the value of k to 4096.
   
   Here is the brief of issue that I faced:
   
   When I run the code with same set of data multiple times, what I observed
   was that distribution gets changed Everytime. For eg:
   1st run
   user count appearing in 3 windows=3446
   user count appearing in 4
   windows=1000
   
   2nd run
   user count appearing in 3 windows=3646
   user count appearing in 4 windows=800
   
   
   Total count was same but distribution was getting rearranged.
   
   
   
   
   
   Thanks,
   Priyam
   
   
   
   
   
   On Thu, Jan 9, 2020, 7:10 AM Lee Rhodes <no...@github.com> wrote:
   
   > Priyam,
   >
   > There are a number of problems with the code you provided:
   > 1. First it is filled with errors. It took me a while for me to figure
   > out what you *might* be trying to do.
   > 2. As Alex points out, you leave out so much information, that we are left
   > to guessing what you are up to.
   > 3. Why are you using a tuple sketch at all? If you have a stream of
   > double values that you want to understand the distribution of, why not just
   > send them directly to one of the quantiles sketches.
   > 4. Your code is effectively doing double sampling of your data, first by
   > the tuple sketch (which by default keeps 4096 samples), and the sampling
   > that with the quantiles sketch (which by default keeps only 128 values).
   > This will make the error bounds on the quantile sketch meaningless and it
   > will be much worse. You should at least increase the K value of the
   > quantile sketch to be much larger, preferably at least as large as the
   > configured size of the tuple sketch.
   > 5. The ArrayOfDoublesSketch is an aggregating sketch. This means that if
   > there are any duplicate keys in your stream, the value retained in the
   > sketch will be the sum. Is this what you want? Only if you want to obtain
   > the "distribution of sums" will this make any sense.
   >
   > If you can be much more clear about what you are trying to do and the
   > nature of your input stream, we could be more helpful.
   >
   > Cheers,
   >
   > Lee.
   >
   >
   >
   > On Wed, Jan 8, 2020 at 10:20 AM Alexander Saydakov <
   > notifications@github.com>
   > wrote:
   >
   > > You are using approximate algorithms, so the results can be different
   > > every time. The question is how different are they? What accuracy do you
   > > expect? To answer this question you need to be more specific. What do you
   > > mean by "oscillating between 2-3 values"? What is the true distribution
   > of
   > > your input data? What approximation are you getting? Why do you think it
   > is
   > > too far off?
   > >
   > > —
   > > You are receiving this because you are subscribed to this thread.
   > > Reply to this email directly, view it on GitHub
   > > <
   > https://github.com/apache/incubator-datasketches-java/issues/288?email_source=notifications&email_token=ADCXRQW24TE4PI4UXK7BHFLQ4YKNLA5CNFSM4KEC7IDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINPUCI#issuecomment-572193289
   > >,
   > > or unsubscribe
   > > <
   > https://github.com/notifications/unsubscribe-auth/ADCXRQVGTHRV7PSIQEIEOADQ4YKNLANCNFSM4KEC7IDA
   > >
   > > .
   > >
   >
   > —
   > You are receiving this because you authored the thread.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/incubator-datasketches-java/issues/288?email_source=notifications&email_token=ACDWUJ5PIIOYH3DXS2SVZ33Q4Z6BZA5CNFSM4KEC7IDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIOTX5I#issuecomment-572341237>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/ACDWUJ2ZV7VY4UADPJBC6KTQ4Z6BZANCNFSM4KEC7IDA>
   > .
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org