You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@datasketches.apache.org by Karl Matthias <ka...@community.com> on 2021/08/25 18:14:29 UTC

Theta Serialize/Deserialize and then update?

Hey folks,

I am working with both the Java library and the C++ library and the Theta
sketch.

What I would like to do is update a sketch, save it somewhere (i.e. disk,
etc), then reload it later and possibly update it then. The CompactSketch
doesn't support updates when an UpdateSketch is serialized and loaded, it
is read-only.

From looking at the Java code it seems like it would be possible to create
an UpdateSketch from the contents of a CompactSketch but there doesn't
appear to be an existing method that does this. Am I missing something that
already does this? Or is it not possible?

Many thanks
Karl

Re: [E] Theta Serialize/Deserialize and then update?

Posted by Karl Matthias <ka...@community.com>.

Hi Lee,

Thanks very much for this. I had missed that the Union supported updates. I
had thought I needed to get the result from it first, but that also returns
a CompactSketch which your reasoning explains well. Really appreciate both
of you guys helping me out.

Cheers,
Karl

On Fri, Aug 27, 2021 at 1:04 AM leerho <le...@gmail.com> wrote:

> Hi Karl,
>   I just want to explain the reasons you cannot create an UpdateSketch
> directly from a CompactSketch:
>
> The CompactSketch is by definition immutable and has the smallest
> footprint and simplest structure.  It is produced as the result of all of
> the set operations because the set operations enable "merging" of sketches
> with different values of "K".  Thus the CompactSketch has no concept of
> "K".  It is just a list of hashes and a value of Theta. You can perform all
> the operations with a CompactSketch that you can with an UpdateSketch,
> except updating it with more input data.  Merging CompactSketches is faster
> than merging UpdateSketches because of the simpler structure, and, if you
> specify "ordered" (the default) when retrieving your CompactSketch, merging
> becomes extremely fast.
>
> Note that the theta Union provides a toByteArray(), union(Memory) as well
> as update(raw datums) operations. So you can always use the Union operator
> instead of the UpdateSketch for all updating and merging operations.  If
> you need to serialize (e.g, for transport or storage, etc.) you can
>
>    - byteArray = union.toByteArray()
>    - <transport>
>    - mem = Memory.wrap(byteArray)
>    - union2 = //create new Union with SetOperationBuilder...
>    - union2.union(mem)
>    - //now you can continue to update(datums) with the union2, and/or
>    perform more union operations.
>
> Lee.
>
> On Thu, Aug 26, 2021 at 10:39 AM Karl Matthias <ka...@community.com> wrote:
>
>> Thanks for that. I figured out how to manage it in the Java lib. You need
>> to use a WritableMemory to wrap the byte array and then explicitly
>> instantiate an UpdateSketch with the WritableMemory. This is now working
>> and I'm doing some prototyping. Ideally I could use this from the C++
>> library as well, but I will work with the Java lib for now while
>> investigating.
>>
>> I will spend some time seeing if I can simplify a series model to do what
>> I want.
>>
>> On Thu, Aug 26, 2021 at 12:07 AM Alexander Saydakov <
>> saydakov@verizonmedia.com> wrote:
>>
>>> I believe that Java code still has the functionality to serialize and
>>> deserialize updatable Theta sketches. You point to a "wrap" operation,
>>> which is one of two ways to deserialize: heapify (instantiate an object on
>>> heap from a given chunk of bytes, involves copying data) and wrap (directly
>>> operate on a given chunk of bytes, often off-heap)
>>>
>>> Perhaps you could explain your use case a little more? What would the
>>> life cycle of your sketches be? When would you serialize them? When
>>> deserialize? How many do you anticipate to keep overall? How many would you
>>> like to update? What is the reason for serializing? And so on.
>>>
>>> On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <ka...@community.com>
>>> wrote:
>>>
>>>> Thank you, I will dig around the old source and see if I can find it.
>>>> AFAICT it was already removed from the Java implementation as well [1]. You
>>>> can serialize an UpdateSketch but when deserializing they are read-only.
>>>>
>>>> I do deeply understand time series data (I was on the team that
>>>> designed the second generation metrics pipeline at New Relic) but the
>>>> problem I'm trying to solve is not nicely modeled as a time series. Of
>>>> course that is possible, but doing it that way will require much more data
>>>> and many more calculations than I want at reporting time. The reported data
>>>> will always be for all time. So modeling as a time series will require an
>>>> increasingly large number of sketches, and possibly thus also a periodic
>>>> roll-up/compaction phase. None of which is necessary if I can simply update
>>>> the same sketch—really a set of them representing various dimensions—until
>>>> I rebuild it/them from the source events on a periodic basis. It is also
>>>> too much cardinality across too many dimensions to use the sketches simply
>>>> as a roll-up tool for distinct counting on the original data.
>>>>
>>>> I was hoping a private fork wasn't necessary to do it, but I can
>>>> understand that you folks intentionally chose not to support it. I will
>>>> have a go at it and see what I can make work.
>>>>
>>>> Thanks for the replies!
>>>>
>>>> [1]
>>>> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=>
>>>>
>>>> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
>>>> saydakov@verizonmedia.com> wrote:
>>>>
>>>>> It is possible, and we used to have serialization and deserialization
>>>>> of updatable Theta sketches. At some point we decided that it is more
>>>>> confusing than useful and might encourage anti-patterns in big systems
>>>>> (such as deserialize-update-serialize sequences on every update). So we
>>>>> removed this functionality from the C++ code, but not from Java (yet).
>>>>> Again, I would suggest treating serialization as finalizing a sketch.
>>>>> If you want to update it, create a fresh one for this new time frame or
>>>>> whatever classifier makes sense (batch, session, transaction). Hopefully
>>>>> this new sketch can be kept for updating for a while (unlit some
>>>>> close-of-books for a period of time or until the whole batch is processed
>>>>> or something). Finalized sketches can be easily merged as needed. Say, you
>>>>> create a new sketch every minute and serialize the previous one. Later you
>>>>> can have your report to show the last 60-min rolling window or a calendar
>>>>> day or something like that by aggregating the appropriate set of sketches
>>>>> for that report.
>>>>>
>>>>>
>>>>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <ka...@community.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the reply. Yes I could do time series sketches, but what I
>>>>>> want actually is a summary representation of the current set, which I
>>>>>> update over time and eventually replace entirely. It's an evented system
>>>>>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>>>>>> entirely at any time, but if maintained live they would be a fast
>>>>>> approximation that is combinable with other Theta sketches. Ideally I would
>>>>>> not have to keep them all in memory to do that and could serialize and
>>>>>> deserialize at will.
>>>>>>
>>>>>> It sounds like it's not currently implemented. But if I can manage
>>>>>> the code to do it, it is possible?
>>>>>>
>>>>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>>>>>> saydakov@verizonmedia.com> wrote:
>>>>>>
>>>>>>> Is there a good reason to necessarily update the same sketch you
>>>>>>> decided to serialize?
>>>>>>> I would suggest considering that sketch finalized. Perhaps, in your
>>>>>>> system these sketches would represent different time periods or different
>>>>>>> categories or something like that. Later on you may want to merge (union)
>>>>>>> some of them to obtain an estimate for a longer time frame or a total
>>>>>>> across categories and so on.
>>>>>>>
>>>>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> I am working with both the Java library and the C++ library and the
>>>>>>>> Theta sketch.
>>>>>>>>
>>>>>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>>>>>> disk, etc), then reload it later and possibly update it then. The
>>>>>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>>>>>> and loaded, it is read-only.
>>>>>>>>
>>>>>>>> From looking at the Java code it seems like it would be possible to
>>>>>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>>>>>> doesn't appear to be an existing method that does this. Am I missing
>>>>>>>> something that already does this? Or is it not possible?
>>>>>>>>
>>>>>>>> Many thanks
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Posted by leerho <le...@gmail.com>.

Hi Karl,
  I just want to explain the reasons you cannot create an UpdateSketch
directly from a CompactSketch:

The CompactSketch is by definition immutable and has the smallest footprint
and simplest structure.  It is produced as the result of all of the set
operations because the set operations enable "merging" of sketches with
different values of "K".  Thus the CompactSketch has no concept of "K".  It
is just a list of hashes and a value of Theta. You can perform all the
operations with a CompactSketch that you can with an UpdateSketch, except
updating it with more input data.  Merging CompactSketches is faster than
merging UpdateSketches because of the simpler structure, and, if you
specify "ordered" (the default) when retrieving your CompactSketch, merging
becomes extremely fast.

Note that the theta Union provides a toByteArray(), union(Memory) as well
as update(raw datums) operations. So you can always use the Union operator
instead of the UpdateSketch for all updating and merging operations.  If
you need to serialize (e.g, for transport or storage, etc.) you can

   - byteArray = union.toByteArray()
   - <transport>
   - mem = Memory.wrap(byteArray)
   - union2 = //create new Union with SetOperationBuilder...
   - union2.union(mem)
   - //now you can continue to update(datums) with the union2, and/or
   perform more union operations.

Lee.

On Thu, Aug 26, 2021 at 10:39 AM Karl Matthias <ka...@community.com> wrote:

> Thanks for that. I figured out how to manage it in the Java lib. You need
> to use a WritableMemory to wrap the byte array and then explicitly
> instantiate an UpdateSketch with the WritableMemory. This is now working
> and I'm doing some prototyping. Ideally I could use this from the C++
> library as well, but I will work with the Java lib for now while
> investigating.
>
> I will spend some time seeing if I can simplify a series model to do what
> I want.
>
> On Thu, Aug 26, 2021 at 12:07 AM Alexander Saydakov <
> saydakov@verizonmedia.com> wrote:
>
>> I believe that Java code still has the functionality to serialize and
>> deserialize updatable Theta sketches. You point to a "wrap" operation,
>> which is one of two ways to deserialize: heapify (instantiate an object on
>> heap from a given chunk of bytes, involves copying data) and wrap (directly
>> operate on a given chunk of bytes, often off-heap)
>>
>> Perhaps you could explain your use case a little more? What would the
>> life cycle of your sketches be? When would you serialize them? When
>> deserialize? How many do you anticipate to keep overall? How many would you
>> like to update? What is the reason for serializing? And so on.
>>
>> On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <ka...@community.com> wrote:
>>
>>> Thank you, I will dig around the old source and see if I can find it.
>>> AFAICT it was already removed from the Java implementation as well [1]. You
>>> can serialize an UpdateSketch but when deserializing they are read-only.
>>>
>>> I do deeply understand time series data (I was on the team that designed
>>> the second generation metrics pipeline at New Relic) but the problem I'm
>>> trying to solve is not nicely modeled as a time series. Of course that is
>>> possible, but doing it that way will require much more data and many more
>>> calculations than I want at reporting time. The reported data will always
>>> be for all time. So modeling as a time series will require an increasingly
>>> large number of sketches, and possibly thus also a periodic
>>> roll-up/compaction phase. None of which is necessary if I can simply update
>>> the same sketch—really a set of them representing various dimensions—until
>>> I rebuild it/them from the source events on a periodic basis. It is also
>>> too much cardinality across too many dimensions to use the sketches simply
>>> as a roll-up tool for distinct counting on the original data.
>>>
>>> I was hoping a private fork wasn't necessary to do it, but I can
>>> understand that you folks intentionally chose not to support it. I will
>>> have a go at it and see what I can make work.
>>>
>>> Thanks for the replies!
>>>
>>> [1]
>>> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=>
>>>
>>> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
>>> saydakov@verizonmedia.com> wrote:
>>>
>>>> It is possible, and we used to have serialization and deserialization
>>>> of updatable Theta sketches. At some point we decided that it is more
>>>> confusing than useful and might encourage anti-patterns in big systems
>>>> (such as deserialize-update-serialize sequences on every update). So we
>>>> removed this functionality from the C++ code, but not from Java (yet).
>>>> Again, I would suggest treating serialization as finalizing a sketch.
>>>> If you want to update it, create a fresh one for this new time frame or
>>>> whatever classifier makes sense (batch, session, transaction). Hopefully
>>>> this new sketch can be kept for updating for a while (unlit some
>>>> close-of-books for a period of time or until the whole batch is processed
>>>> or something). Finalized sketches can be easily merged as needed. Say, you
>>>> create a new sketch every minute and serialize the previous one. Later you
>>>> can have your report to show the last 60-min rolling window or a calendar
>>>> day or something like that by aggregating the appropriate set of sketches
>>>> for that report.
>>>>
>>>>
>>>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <ka...@community.com>
>>>> wrote:
>>>>
>>>>> Thanks for the reply. Yes I could do time series sketches, but what I
>>>>> want actually is a summary representation of the current set, which I
>>>>> update over time and eventually replace entirely. It's an evented system
>>>>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>>>>> entirely at any time, but if maintained live they would be a fast
>>>>> approximation that is combinable with other Theta sketches. Ideally I would
>>>>> not have to keep them all in memory to do that and could serialize and
>>>>> deserialize at will.
>>>>>
>>>>> It sounds like it's not currently implemented. But if I can manage the
>>>>> code to do it, it is possible?
>>>>>
>>>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>>>>> saydakov@verizonmedia.com> wrote:
>>>>>
>>>>>> Is there a good reason to necessarily update the same sketch you
>>>>>> decided to serialize?
>>>>>> I would suggest considering that sketch finalized. Perhaps, in your
>>>>>> system these sketches would represent different time periods or different
>>>>>> categories or something like that. Later on you may want to merge (union)
>>>>>> some of them to obtain an estimate for a longer time frame or a total
>>>>>> across categories and so on.
>>>>>>
>>>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey folks,
>>>>>>>
>>>>>>> I am working with both the Java library and the C++ library and the
>>>>>>> Theta sketch.
>>>>>>>
>>>>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>>>>> disk, etc), then reload it later and possibly update it then. The
>>>>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>>>>> and loaded, it is read-only.
>>>>>>>
>>>>>>> From looking at the Java code it seems like it would be possible to
>>>>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>>>>> doesn't appear to be an existing method that does this. Am I missing
>>>>>>> something that already does this? Or is it not possible?
>>>>>>>
>>>>>>> Many thanks
>>>>>>> Karl
>>>>>>>
>>>>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Posted by Karl Matthias <ka...@community.com>.

Thanks for that. I figured out how to manage it in the Java lib. You need
to use a WritableMemory to wrap the byte array and then explicitly
instantiate an UpdateSketch with the WritableMemory. This is now working
and I'm doing some prototyping. Ideally I could use this from the C++
library as well, but I will work with the Java lib for now while
investigating.

I will spend some time seeing if I can simplify a series model to do what I
want.

On Thu, Aug 26, 2021 at 12:07 AM Alexander Saydakov <
saydakov@verizonmedia.com> wrote:

> I believe that Java code still has the functionality to serialize and
> deserialize updatable Theta sketches. You point to a "wrap" operation,
> which is one of two ways to deserialize: heapify (instantiate an object on
> heap from a given chunk of bytes, involves copying data) and wrap (directly
> operate on a given chunk of bytes, often off-heap)
>
> Perhaps you could explain your use case a little more? What would the life
> cycle of your sketches be? When would you serialize them? When deserialize?
> How many do you anticipate to keep overall? How many would you like to
> update? What is the reason for serializing? And so on.
>
> On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <ka...@community.com> wrote:
>
>> Thank you, I will dig around the old source and see if I can find it.
>> AFAICT it was already removed from the Java implementation as well [1]. You
>> can serialize an UpdateSketch but when deserializing they are read-only.
>>
>> I do deeply understand time series data (I was on the team that designed
>> the second generation metrics pipeline at New Relic) but the problem I'm
>> trying to solve is not nicely modeled as a time series. Of course that is
>> possible, but doing it that way will require much more data and many more
>> calculations than I want at reporting time. The reported data will always
>> be for all time. So modeling as a time series will require an increasingly
>> large number of sketches, and possibly thus also a periodic
>> roll-up/compaction phase. None of which is necessary if I can simply update
>> the same sketch—really a set of them representing various dimensions—until
>> I rebuild it/them from the source events on a periodic basis. It is also
>> too much cardinality across too many dimensions to use the sketches simply
>> as a roll-up tool for distinct counting on the original data.
>>
>> I was hoping a private fork wasn't necessary to do it, but I can
>> understand that you folks intentionally chose not to support it. I will
>> have a go at it and see what I can make work.
>>
>> Thanks for the replies!
>>
>> [1]
>> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=>
>>
>> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
>> saydakov@verizonmedia.com> wrote:
>>
>>> It is possible, and we used to have serialization and deserialization of
>>> updatable Theta sketches. At some point we decided that it is more
>>> confusing than useful and might encourage anti-patterns in big systems
>>> (such as deserialize-update-serialize sequences on every update). So we
>>> removed this functionality from the C++ code, but not from Java (yet).
>>> Again, I would suggest treating serialization as finalizing a sketch. If
>>> you want to update it, create a fresh one for this new time frame or
>>> whatever classifier makes sense (batch, session, transaction). Hopefully
>>> this new sketch can be kept for updating for a while (unlit some
>>> close-of-books for a period of time or until the whole batch is processed
>>> or something). Finalized sketches can be easily merged as needed. Say, you
>>> create a new sketch every minute and serialize the previous one. Later you
>>> can have your report to show the last 60-min rolling window or a calendar
>>> day or something like that by aggregating the appropriate set of sketches
>>> for that report.
>>>
>>>
>>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <ka...@community.com>
>>> wrote:
>>>
>>>> Thanks for the reply. Yes I could do time series sketches, but what I
>>>> want actually is a summary representation of the current set, which I
>>>> update over time and eventually replace entirely. It's an evented system
>>>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>>>> entirely at any time, but if maintained live they would be a fast
>>>> approximation that is combinable with other Theta sketches. Ideally I would
>>>> not have to keep them all in memory to do that and could serialize and
>>>> deserialize at will.
>>>>
>>>> It sounds like it's not currently implemented. But if I can manage the
>>>> code to do it, it is possible?
>>>>
>>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>>>> saydakov@verizonmedia.com> wrote:
>>>>
>>>>> Is there a good reason to necessarily update the same sketch you
>>>>> decided to serialize?
>>>>> I would suggest considering that sketch finalized. Perhaps, in your
>>>>> system these sketches would represent different time periods or different
>>>>> categories or something like that. Later on you may want to merge (union)
>>>>> some of them to obtain an estimate for a longer time frame or a total
>>>>> across categories and so on.
>>>>>
>>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com>
>>>>> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> I am working with both the Java library and the C++ library and the
>>>>>> Theta sketch.
>>>>>>
>>>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>>>> disk, etc), then reload it later and possibly update it then. The
>>>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>>>> and loaded, it is read-only.
>>>>>>
>>>>>> From looking at the Java code it seems like it would be possible to
>>>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>>>> doesn't appear to be an existing method that does this. Am I missing
>>>>>> something that already does this? Or is it not possible?
>>>>>>
>>>>>> Many thanks
>>>>>> Karl
>>>>>>
>>>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Posted by Alexander Saydakov <sa...@verizonmedia.com>.

I believe that Java code still has the functionality to serialize and
deserialize updatable Theta sketches. You point to a "wrap" operation,
which is one of two ways to deserialize: heapify (instantiate an object on
heap from a given chunk of bytes, involves copying data) and wrap (directly
operate on a given chunk of bytes, often off-heap)

Perhaps you could explain your use case a little more? What would the life
cycle of your sketches be? When would you serialize them? When deserialize?
How many do you anticipate to keep overall? How many would you like to
update? What is the reason for serializing? And so on.

On Wed, Aug 25, 2021 at 2:26 PM Karl Matthias <ka...@community.com> wrote:

> Thank you, I will dig around the old source and see if I can find it.
> AFAICT it was already removed from the Java implementation as well [1]. You
> can serialize an UpdateSketch but when deserializing they are read-only.
>
> I do deeply understand time series data (I was on the team that designed
> the second generation metrics pipeline at New Relic) but the problem I'm
> trying to solve is not nicely modeled as a time series. Of course that is
> possible, but doing it that way will require much more data and many more
> calculations than I want at reporting time. The reported data will always
> be for all time. So modeling as a time series will require an increasingly
> large number of sketches, and possibly thus also a periodic
> roll-up/compaction phase. None of which is necessary if I can simply update
> the same sketch—really a set of them representing various dimensions—until
> I rebuild it/them from the source events on a periodic basis. It is also
> too much cardinality across too many dimensions to use the sketches simply
> as a roll-up tool for distinct counting on the original data.
>
> I was hoping a private fork wasn't necessary to do it, but I can
> understand that you folks intentionally chose not to support it. I will
> have a go at it and see what I can make work.
>
> Thanks for the replies!
>
> [1]
> https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_datasketches-2Djava_blob_27ecce938555d731f29df97f12f4744a0efb663d_src_main_java_org_apache_datasketches_theta_Sketch.java-23L139&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=0TpvE_u2hS1ubQhK3gLhy94YgZm2k_r8JHJnqgjOXx4&m=4MOEFXeD5db9oY9LJT00yMhrs15KmwAKMoMQm_mpWP8&s=qPeEDGmb9kd6n6nkOG002YD-j3Taq0udBPitc-G_rHk&e=>
>
> On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
> saydakov@verizonmedia.com> wrote:
>
>> It is possible, and we used to have serialization and deserialization of
>> updatable Theta sketches. At some point we decided that it is more
>> confusing than useful and might encourage anti-patterns in big systems
>> (such as deserialize-update-serialize sequences on every update). So we
>> removed this functionality from the C++ code, but not from Java (yet).
>> Again, I would suggest treating serialization as finalizing a sketch. If
>> you want to update it, create a fresh one for this new time frame or
>> whatever classifier makes sense (batch, session, transaction). Hopefully
>> this new sketch can be kept for updating for a while (unlit some
>> close-of-books for a period of time or until the whole batch is processed
>> or something). Finalized sketches can be easily merged as needed. Say, you
>> create a new sketch every minute and serialize the previous one. Later you
>> can have your report to show the last 60-min rolling window or a calendar
>> day or something like that by aggregating the appropriate set of sketches
>> for that report.
>>
>>
>> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <ka...@community.com> wrote:
>>
>>> Thanks for the reply. Yes I could do time series sketches, but what I
>>> want actually is a summary representation of the current set, which I
>>> update over time and eventually replace entirely. It's an evented system
>>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>>> entirely at any time, but if maintained live they would be a fast
>>> approximation that is combinable with other Theta sketches. Ideally I would
>>> not have to keep them all in memory to do that and could serialize and
>>> deserialize at will.
>>>
>>> It sounds like it's not currently implemented. But if I can manage the
>>> code to do it, it is possible?
>>>
>>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>>> saydakov@verizonmedia.com> wrote:
>>>
>>>> Is there a good reason to necessarily update the same sketch you
>>>> decided to serialize?
>>>> I would suggest considering that sketch finalized. Perhaps, in your
>>>> system these sketches would represent different time periods or different
>>>> categories or something like that. Later on you may want to merge (union)
>>>> some of them to obtain an estimate for a longer time frame or a total
>>>> across categories and so on.
>>>>
>>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com>
>>>> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> I am working with both the Java library and the C++ library and the
>>>>> Theta sketch.
>>>>>
>>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>>> disk, etc), then reload it later and possibly update it then. The
>>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>>> and loaded, it is read-only.
>>>>>
>>>>> From looking at the Java code it seems like it would be possible to
>>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>>> doesn't appear to be an existing method that does this. Am I missing
>>>>> something that already does this? Or is it not possible?
>>>>>
>>>>> Many thanks
>>>>> Karl
>>>>>
>>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Posted by Karl Matthias <ka...@community.com>.

Thank you, I will dig around the old source and see if I can find it.
AFAICT it was already removed from the Java implementation as well [1]. You
can serialize an UpdateSketch but when deserializing they are read-only.

I do deeply understand time series data (I was on the team that designed
the second generation metrics pipeline at New Relic) but the problem I'm
trying to solve is not nicely modeled as a time series. Of course that is
possible, but doing it that way will require much more data and many more
calculations than I want at reporting time. The reported data will always
be for all time. So modeling as a time series will require an increasingly
large number of sketches, and possibly thus also a periodic
roll-up/compaction phase. None of which is necessary if I can simply update
the same sketch—really a set of them representing various dimensions—until
I rebuild it/them from the source events on a periodic basis. It is also
too much cardinality across too many dimensions to use the sketches simply
as a roll-up tool for distinct counting on the original data.

I was hoping a private fork wasn't necessary to do it, but I can understand
that you folks intentionally chose not to support it. I will have a go at
it and see what I can make work.

Thanks for the replies!

[1]
https://github.com/apache/datasketches-java/blob/27ecce938555d731f29df97f12f4744a0efb663d/src/main/java/org/apache/datasketches/theta/Sketch.java#L139

On Wed, Aug 25, 2021 at 9:46 PM Alexander Saydakov <
saydakov@verizonmedia.com> wrote:

> It is possible, and we used to have serialization and deserialization of
> updatable Theta sketches. At some point we decided that it is more
> confusing than useful and might encourage anti-patterns in big systems
> (such as deserialize-update-serialize sequences on every update). So we
> removed this functionality from the C++ code, but not from Java (yet).
> Again, I would suggest treating serialization as finalizing a sketch. If
> you want to update it, create a fresh one for this new time frame or
> whatever classifier makes sense (batch, session, transaction). Hopefully
> this new sketch can be kept for updating for a while (unlit some
> close-of-books for a period of time or until the whole batch is processed
> or something). Finalized sketches can be easily merged as needed. Say, you
> create a new sketch every minute and serialize the previous one. Later you
> can have your report to show the last 60-min rolling window or a calendar
> day or something like that by aggregating the appropriate set of sketches
> for that report.
>
>
> On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <ka...@community.com> wrote:
>
>> Thanks for the reply. Yes I could do time series sketches, but what I
>> want actually is a summary representation of the current set, which I
>> update over time and eventually replace entirely. It's an evented system
>> and I want to use Theta sketches as a sort of summary. I can rebuild them
>> entirely at any time, but if maintained live they would be a fast
>> approximation that is combinable with other Theta sketches. Ideally I would
>> not have to keep them all in memory to do that and could serialize and
>> deserialize at will.
>>
>> It sounds like it's not currently implemented. But if I can manage the
>> code to do it, it is possible?
>>
>> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
>> saydakov@verizonmedia.com> wrote:
>>
>>> Is there a good reason to necessarily update the same sketch you decided
>>> to serialize?
>>> I would suggest considering that sketch finalized. Perhaps, in your
>>> system these sketches would represent different time periods or different
>>> categories or something like that. Later on you may want to merge (union)
>>> some of them to obtain an estimate for a longer time frame or a total
>>> across categories and so on.
>>>
>>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com>
>>> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> I am working with both the Java library and the C++ library and the
>>>> Theta sketch.
>>>>
>>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>>> disk, etc), then reload it later and possibly update it then. The
>>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>>> and loaded, it is read-only.
>>>>
>>>> From looking at the Java code it seems like it would be possible to
>>>> create an UpdateSketch from the contents of a CompactSketch but there
>>>> doesn't appear to be an existing method that does this. Am I missing
>>>> something that already does this? Or is it not possible?
>>>>
>>>> Many thanks
>>>> Karl
>>>>
>>>>

Re: [E] Theta Serialize/Deserialize and then update?

Posted by Alexander Saydakov <sa...@verizonmedia.com>.

It is possible, and we used to have serialization and deserialization of
updatable Theta sketches. At some point we decided that it is more
confusing than useful and might encourage anti-patterns in big systems
(such as deserialize-update-serialize sequences on every update). So we
removed this functionality from the C++ code, but not from Java (yet).
Again, I would suggest treating serialization as finalizing a sketch. If
you want to update it, create a fresh one for this new time frame or
whatever classifier makes sense (batch, session, transaction). Hopefully
this new sketch can be kept for updating for a while (unlit some
close-of-books for a period of time or until the whole batch is processed
or something). Finalized sketches can be easily merged as needed. Say, you
create a new sketch every minute and serialize the previous one. Later you
can have your report to show the last 60-min rolling window or a calendar
day or something like that by aggregating the appropriate set of sketches
for that report.

On Wed, Aug 25, 2021 at 1:20 PM Karl Matthias <ka...@community.com> wrote:

> Thanks for the reply. Yes I could do time series sketches, but what I want
> actually is a summary representation of the current set, which I update
> over time and eventually replace entirely. It's an evented system and I
> want to use Theta sketches as a sort of summary. I can rebuild them
> entirely at any time, but if maintained live they would be a fast
> approximation that is combinable with other Theta sketches. Ideally I would
> not have to keep them all in memory to do that and could serialize and
> deserialize at will.
>
> It sounds like it's not currently implemented. But if I can manage the
> code to do it, it is possible?
>
> On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
> saydakov@verizonmedia.com> wrote:
>
>> Is there a good reason to necessarily update the same sketch you decided
>> to serialize?
>> I would suggest considering that sketch finalized. Perhaps, in your
>> system these sketches would represent different time periods or different
>> categories or something like that. Later on you may want to merge (union)
>> some of them to obtain an estimate for a longer time frame or a total
>> across categories and so on.
>>
>> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> I am working with both the Java library and the C++ library and the
>>> Theta sketch.
>>>
>>> What I would like to do is update a sketch, save it somewhere (i.e.
>>> disk, etc), then reload it later and possibly update it then. The
>>> CompactSketch doesn't support updates when an UpdateSketch is serialized
>>> and loaded, it is read-only.
>>>
>>> From looking at the Java code it seems like it would be possible to
>>> create an UpdateSketch from the contents of a CompactSketch but there
>>> doesn't appear to be an existing method that does this. Am I missing
>>> something that already does this? Or is it not possible?
>>>
>>> Many thanks
>>> Karl
>>>
>>>

Re: [E] Theta Serialize/Deserialize and then update?

Posted by Karl Matthias <ka...@community.com>.

Thanks for the reply. Yes I could do time series sketches, but what I want
actually is a summary representation of the current set, which I update
over time and eventually replace entirely. It's an evented system and I
want to use Theta sketches as a sort of summary. I can rebuild them
entirely at any time, but if maintained live they would be a fast
approximation that is combinable with other Theta sketches. Ideally I would
not have to keep them all in memory to do that and could serialize and
deserialize at will.

It sounds like it's not currently implemented. But if I can manage the code
to do it, it is possible?

On Wed, Aug 25, 2021 at 8:09 PM Alexander Saydakov <
saydakov@verizonmedia.com> wrote:

> Is there a good reason to necessarily update the same sketch you decided
> to serialize?
> I would suggest considering that sketch finalized. Perhaps, in your system
> these sketches would represent different time periods or different
> categories or something like that. Later on you may want to merge (union)
> some of them to obtain an estimate for a longer time frame or a total
> across categories and so on.
>
> On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com> wrote:
>
>> Hey folks,
>>
>> I am working with both the Java library and the C++ library and the Theta
>> sketch.
>>
>> What I would like to do is update a sketch, save it somewhere (i.e. disk,
>> etc), then reload it later and possibly update it then. The CompactSketch
>> doesn't support updates when an UpdateSketch is serialized and loaded, it
>> is read-only.
>>
>> From looking at the Java code it seems like it would be possible to
>> create an UpdateSketch from the contents of a CompactSketch but there
>> doesn't appear to be an existing method that does this. Am I missing
>> something that already does this? Or is it not possible?
>>
>> Many thanks
>> Karl
>>
>>

Re: [E] Theta Serialize/Deserialize and then update?

Posted by Alexander Saydakov <sa...@verizonmedia.com>.

Is there a good reason to necessarily update the same sketch you decided to
serialize?
I would suggest considering that sketch finalized. Perhaps, in your system
these sketches would represent different time periods or different
categories or something like that. Later on you may want to merge (union)
some of them to obtain an estimate for a longer time frame or a total
across categories and so on.

On Wed, Aug 25, 2021 at 11:14 AM Karl Matthias <ka...@community.com> wrote:

> Hey folks,
>
> I am working with both the Java library and the C++ library and the Theta
> sketch.
>
> What I would like to do is update a sketch, save it somewhere (i.e. disk,
> etc), then reload it later and possibly update it then. The CompactSketch
> doesn't support updates when an UpdateSketch is serialized and loaded, it
> is read-only.
>
> From looking at the Java code it seems like it would be possible to create
> an UpdateSketch from the contents of a CompactSketch but there doesn't
> appear to be an existing method that does this. Am I missing something that
> already does this? Or is it not possible?
>
> Many thanks
> Karl
>
>