You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@apex.apache.org by AJAY GUPTA <aj...@gmail.com> on 2017/03/08 16:05:02 UTC

Sort Accumulation

Hi,

I would like to propose the Sort Accumulation. The accumulation will be
responsible for sorting the input POJO stream. The accumulation will
require a comparator to compare and sort the input tuples. Another boolean
parameter "sortDesc" will be used to decide sorting order.

Let me know your views.

Thanks,
Ajay

Re: Sort Accumulation

Posted by AJAY GUPTA <aj...@gmail.com>.

Hi

I am currently going ahead with the implementation of Sort Accumulation
using List as mentioned in previous mail. Sorted insertion of elements will
be ensured via binary search.

Ajay


On Wed, Mar 15, 2017 at 4:49 PM, AJAY GUPTA <aj...@gmail.com> wrote:

> Hi Bright,
>
> Its true that with these data structures, it could go out of memory.
> However, as an initial version, I believe its best to make use of in
> memory data structures for now. Disk based approach will require us to
> implement a B+ tree based data structure on hdfs.
>
> Also, I was studying more about TreeMultiset and have come to conclude
> that TreeMultiSet as the Accumulation type will not work due to the way
> TreeMultiSet handles duplicates (maintains count based on key, not the
> actual record itself).
> Hence, its best to make use of a list and use binary search to insert the
> record in sorted order.
>
>
> Ajay
>
>
>
> On Tue, Mar 14, 2017 at 9:39 PM, Bright Chen <br...@datatorrent.com>
> wrote:
>
>> Hi Ajay,
>> My feeling is you assume handle all tuples in memory.
>> How to handle the case if memory is not enough to hold all tuples?
>>
>> thanks
>> -Bright
>>
>> On Mon, Mar 13, 2017 at 11:00 PM, AJAY GUPTA <aj...@gmail.com>
>> wrote:
>>
>> > Hi Apex Dev community,
>> >
>> > Kindly let me know if implementing this accumulation using TreeMultiSet
>> is
>> > fine.
>> >
>> >
>> > Ajay
>> >
>> > On Thu, Mar 9, 2017 at 12:24 PM, AJAY GUPTA <aj...@gmail.com>
>> wrote:
>> >
>> > > Hi Bright,
>> > >
>> > > I couldnot completely understand the bucketing approach you mentioned.
>> > How
>> > > would we bucket the data considering we have no idea what the data
>> will
>> > be?
>> > >
>> > > How about using a TreeMultiSet?
>> > >
>> > >
>> > > Thanks,
>> > > Ajay
>> > >
>> > > On Wed, Mar 8, 2017 at 11:24 PM, Bright Chen <br...@datatorrent.com>
>> > > wrote:
>> > >
>> > >> Hi Ajay,
>> > >> I think sort at getOutput() probably will get this method stuck due
>> to
>> > >> very
>> > >> high volume of computation.
>> > >> And as we still need to persistent the data, it will not very
>> helpful to
>> > >> increase the performance of processing tuple. Probably we can bucket
>> the
>> > >> data with range of value. Such as following:
>> > >> - process tuple in one window: sort data of current window in memory
>> > >> - end window: merge the sorted memory data into buckets.
>> > >>
>> > >> thanks
>> > >> Bright
>> > >>
>> > >> On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA <aj...@gmail.com>
>> > wrote:
>> > >>
>> > >> > Hi Thomas,
>> > >> >
>> > >> > I looked at TopN. The accumulate() of TopN is an O(n*k). Using
>> similar
>> > >> > approach for Sort will lead to an O(n^2) complexity.
>> > >> > Since we have to sort all elements, we can do it in a single sort
>> call
>> > >> in
>> > >> > getOutput().
>> > >> >
>> > >> >
>> > >> > On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <th...@apache.org>
>> wrote:
>> > >> >
>> > >> > > Look at the existing topN accumulation. It should be a
>> > generalization,
>> > >> > > where you don't have a limit.
>> > >> > >
>> > >> > >
>> > >> > > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <ajaygit158@gmail.com
>> >
>> > >> wrote:
>> > >> > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > > I would like to propose the Sort Accumulation. The accumulation
>> > >> will be
>> > >> > > > responsible for sorting the input POJO stream. The accumulation
>> > will
>> > >> > > > require a comparator to compare and sort the input tuples.
>> Another
>> > >> > > boolean
>> > >> > > > parameter "sortDesc" will be used to decide sorting order.
>> > >> > > >
>> > >> > > > Let me know your views.
>> > >> > > >
>> > >> > > > Thanks,
>> > >> > > > Ajay
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Sort Accumulation

Posted by AJAY GUPTA <aj...@gmail.com>.

Hi Bright,

Its true that with these data structures, it could go out of memory.
However, as an initial version, I believe its best to make use of in memory
data structures for now. Disk based approach will require us to implement a
B+ tree based data structure on hdfs.

Also, I was studying more about TreeMultiset and have come to conclude that
TreeMultiSet as the Accumulation type will not work due to the way
TreeMultiSet handles duplicates (maintains count based on key, not the
actual record itself).
Hence, its best to make use of a list and use binary search to insert the
record in sorted order.


Ajay



On Tue, Mar 14, 2017 at 9:39 PM, Bright Chen <br...@datatorrent.com> wrote:

> Hi Ajay,
> My feeling is you assume handle all tuples in memory.
> How to handle the case if memory is not enough to hold all tuples?
>
> thanks
> -Bright
>
> On Mon, Mar 13, 2017 at 11:00 PM, AJAY GUPTA <aj...@gmail.com> wrote:
>
> > Hi Apex Dev community,
> >
> > Kindly let me know if implementing this accumulation using TreeMultiSet
> is
> > fine.
> >
> >
> > Ajay
> >
> > On Thu, Mar 9, 2017 at 12:24 PM, AJAY GUPTA <aj...@gmail.com>
> wrote:
> >
> > > Hi Bright,
> > >
> > > I couldnot completely understand the bucketing approach you mentioned.
> > How
> > > would we bucket the data considering we have no idea what the data will
> > be?
> > >
> > > How about using a TreeMultiSet?
> > >
> > >
> > > Thanks,
> > > Ajay
> > >
> > > On Wed, Mar 8, 2017 at 11:24 PM, Bright Chen <br...@datatorrent.com>
> > > wrote:
> > >
> > >> Hi Ajay,
> > >> I think sort at getOutput() probably will get this method stuck due to
> > >> very
> > >> high volume of computation.
> > >> And as we still need to persistent the data, it will not very helpful
> to
> > >> increase the performance of processing tuple. Probably we can bucket
> the
> > >> data with range of value. Such as following:
> > >> - process tuple in one window: sort data of current window in memory
> > >> - end window: merge the sorted memory data into buckets.
> > >>
> > >> thanks
> > >> Bright
> > >>
> > >> On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA <aj...@gmail.com>
> > wrote:
> > >>
> > >> > Hi Thomas,
> > >> >
> > >> > I looked at TopN. The accumulate() of TopN is an O(n*k). Using
> similar
> > >> > approach for Sort will lead to an O(n^2) complexity.
> > >> > Since we have to sort all elements, we can do it in a single sort
> call
> > >> in
> > >> > getOutput().
> > >> >
> > >> >
> > >> > On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <th...@apache.org>
> wrote:
> > >> >
> > >> > > Look at the existing topN accumulation. It should be a
> > generalization,
> > >> > > where you don't have a limit.
> > >> > >
> > >> > >
> > >> > > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <aj...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > I would like to propose the Sort Accumulation. The accumulation
> > >> will be
> > >> > > > responsible for sorting the input POJO stream. The accumulation
> > will
> > >> > > > require a comparator to compare and sort the input tuples.
> Another
> > >> > > boolean
> > >> > > > parameter "sortDesc" will be used to decide sorting order.
> > >> > > >
> > >> > > > Let me know your views.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Ajay
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Sort Accumulation

Posted by Bright Chen <br...@datatorrent.com>.

Hi Ajay,
My feeling is you assume handle all tuples in memory.
How to handle the case if memory is not enough to hold all tuples?

thanks
-Bright

On Mon, Mar 13, 2017 at 11:00 PM, AJAY GUPTA <aj...@gmail.com> wrote:

> Hi Apex Dev community,
>
> Kindly let me know if implementing this accumulation using TreeMultiSet is
> fine.
>
>
> Ajay
>
> On Thu, Mar 9, 2017 at 12:24 PM, AJAY GUPTA <aj...@gmail.com> wrote:
>
> > Hi Bright,
> >
> > I couldnot completely understand the bucketing approach you mentioned.
> How
> > would we bucket the data considering we have no idea what the data will
> be?
> >
> > How about using a TreeMultiSet?
> >
> >
> > Thanks,
> > Ajay
> >
> > On Wed, Mar 8, 2017 at 11:24 PM, Bright Chen <br...@datatorrent.com>
> > wrote:
> >
> >> Hi Ajay,
> >> I think sort at getOutput() probably will get this method stuck due to
> >> very
> >> high volume of computation.
> >> And as we still need to persistent the data, it will not very helpful to
> >> increase the performance of processing tuple. Probably we can bucket the
> >> data with range of value. Such as following:
> >> - process tuple in one window: sort data of current window in memory
> >> - end window: merge the sorted memory data into buckets.
> >>
> >> thanks
> >> Bright
> >>
> >> On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA <aj...@gmail.com>
> wrote:
> >>
> >> > Hi Thomas,
> >> >
> >> > I looked at TopN. The accumulate() of TopN is an O(n*k). Using similar
> >> > approach for Sort will lead to an O(n^2) complexity.
> >> > Since we have to sort all elements, we can do it in a single sort call
> >> in
> >> > getOutput().
> >> >
> >> >
> >> > On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <th...@apache.org> wrote:
> >> >
> >> > > Look at the existing topN accumulation. It should be a
> generalization,
> >> > > where you don't have a limit.
> >> > >
> >> > >
> >> > > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <aj...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > I would like to propose the Sort Accumulation. The accumulation
> >> will be
> >> > > > responsible for sorting the input POJO stream. The accumulation
> will
> >> > > > require a comparator to compare and sort the input tuples. Another
> >> > > boolean
> >> > > > parameter "sortDesc" will be used to decide sorting order.
> >> > > >
> >> > > > Let me know your views.
> >> > > >
> >> > > > Thanks,
> >> > > > Ajay
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Sort Accumulation

Posted by AJAY GUPTA <aj...@gmail.com>.

Hi Apex Dev community,

Kindly let me know if implementing this accumulation using TreeMultiSet is
fine.


Ajay

On Thu, Mar 9, 2017 at 12:24 PM, AJAY GUPTA <aj...@gmail.com> wrote:

> Hi Bright,
>
> I couldnot completely understand the bucketing approach you mentioned. How
> would we bucket the data considering we have no idea what the data will be?
>
> How about using a TreeMultiSet?
>
>
> Thanks,
> Ajay
>
> On Wed, Mar 8, 2017 at 11:24 PM, Bright Chen <br...@datatorrent.com>
> wrote:
>
>> Hi Ajay,
>> I think sort at getOutput() probably will get this method stuck due to
>> very
>> high volume of computation.
>> And as we still need to persistent the data, it will not very helpful to
>> increase the performance of processing tuple. Probably we can bucket the
>> data with range of value. Such as following:
>> - process tuple in one window: sort data of current window in memory
>> - end window: merge the sorted memory data into buckets.
>>
>> thanks
>> Bright
>>
>> On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA <aj...@gmail.com> wrote:
>>
>> > Hi Thomas,
>> >
>> > I looked at TopN. The accumulate() of TopN is an O(n*k). Using similar
>> > approach for Sort will lead to an O(n^2) complexity.
>> > Since we have to sort all elements, we can do it in a single sort call
>> in
>> > getOutput().
>> >
>> >
>> > On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <th...@apache.org> wrote:
>> >
>> > > Look at the existing topN accumulation. It should be a generalization,
>> > > where you don't have a limit.
>> > >
>> > >
>> > > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <aj...@gmail.com>
>> wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I would like to propose the Sort Accumulation. The accumulation
>> will be
>> > > > responsible for sorting the input POJO stream. The accumulation will
>> > > > require a comparator to compare and sort the input tuples. Another
>> > > boolean
>> > > > parameter "sortDesc" will be used to decide sorting order.
>> > > >
>> > > > Let me know your views.
>> > > >
>> > > > Thanks,
>> > > > Ajay
>> > > >
>> > >
>> >
>>
>
>

Re: Sort Accumulation

Posted by AJAY GUPTA <aj...@gmail.com>.

Hi Bright,

I couldnot completely understand the bucketing approach you mentioned. How
would we bucket the data considering we have no idea what the data will be?

How about using a TreeMultiSet?


Thanks,
Ajay

On Wed, Mar 8, 2017 at 11:24 PM, Bright Chen <br...@datatorrent.com> wrote:

> Hi Ajay,
> I think sort at getOutput() probably will get this method stuck due to very
> high volume of computation.
> And as we still need to persistent the data, it will not very helpful to
> increase the performance of processing tuple. Probably we can bucket the
> data with range of value. Such as following:
> - process tuple in one window: sort data of current window in memory
> - end window: merge the sorted memory data into buckets.
>
> thanks
> Bright
>
> On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA <aj...@gmail.com> wrote:
>
> > Hi Thomas,
> >
> > I looked at TopN. The accumulate() of TopN is an O(n*k). Using similar
> > approach for Sort will lead to an O(n^2) complexity.
> > Since we have to sort all elements, we can do it in a single sort call in
> > getOutput().
> >
> >
> > On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <th...@apache.org> wrote:
> >
> > > Look at the existing topN accumulation. It should be a generalization,
> > > where you don't have a limit.
> > >
> > >
> > > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <aj...@gmail.com>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I would like to propose the Sort Accumulation. The accumulation will
> be
> > > > responsible for sorting the input POJO stream. The accumulation will
> > > > require a comparator to compare and sort the input tuples. Another
> > > boolean
> > > > parameter "sortDesc" will be used to decide sorting order.
> > > >
> > > > Let me know your views.
> > > >
> > > > Thanks,
> > > > Ajay
> > > >
> > >
> >
>

Re: Sort Accumulation

Posted by Bright Chen <br...@datatorrent.com>.

Hi Ajay,
I think sort at getOutput() probably will get this method stuck due to very
high volume of computation.
And as we still need to persistent the data, it will not very helpful to
increase the performance of processing tuple. Probably we can bucket the
data with range of value. Such as following:
- process tuple in one window: sort data of current window in memory
- end window: merge the sorted memory data into buckets.

thanks
Bright

On Wed, Mar 8, 2017 at 8:51 AM, AJAY GUPTA <aj...@gmail.com> wrote:

> Hi Thomas,
>
> I looked at TopN. The accumulate() of TopN is an O(n*k). Using similar
> approach for Sort will lead to an O(n^2) complexity.
> Since we have to sort all elements, we can do it in a single sort call in
> getOutput().
>
>
> On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <th...@apache.org> wrote:
>
> > Look at the existing topN accumulation. It should be a generalization,
> > where you don't have a limit.
> >
> >
> > On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <aj...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I would like to propose the Sort Accumulation. The accumulation will be
> > > responsible for sorting the input POJO stream. The accumulation will
> > > require a comparator to compare and sort the input tuples. Another
> > boolean
> > > parameter "sortDesc" will be used to decide sorting order.
> > >
> > > Let me know your views.
> > >
> > > Thanks,
> > > Ajay
> > >
> >
>

Re: Sort Accumulation

Posted by AJAY GUPTA <aj...@gmail.com>.

Hi Thomas,

I looked at TopN. The accumulate() of TopN is an O(n*k). Using similar
approach for Sort will lead to an O(n^2) complexity.
Since we have to sort all elements, we can do it in a single sort call in
getOutput().

On Wed, Mar 8, 2017 at 10:09 PM, Thomas Weise <th...@apache.org> wrote:

> Look at the existing topN accumulation. It should be a generalization,
> where you don't have a limit.
>
>
> On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <aj...@gmail.com> wrote:
>
> > Hi,
> >
> > I would like to propose the Sort Accumulation. The accumulation will be
> > responsible for sorting the input POJO stream. The accumulation will
> > require a comparator to compare and sort the input tuples. Another
> boolean
> > parameter "sortDesc" will be used to decide sorting order.
> >
> > Let me know your views.
> >
> > Thanks,
> > Ajay
> >
>

Re: Sort Accumulation

Posted by Thomas Weise <th...@apache.org>.

Look at the existing topN accumulation. It should be a generalization,
where you don't have a limit.

On Wed, Mar 8, 2017 at 8:05 AM, AJAY GUPTA <aj...@gmail.com> wrote:

> Hi,
>
> I would like to propose the Sort Accumulation. The accumulation will be
> responsible for sorting the input POJO stream. The accumulation will
> require a comparator to compare and sort the input tuples. Another boolean
> parameter "sortDesc" will be used to decide sorting order.
>
> Let me know your views.
>
> Thanks,
> Ajay
>