You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Atri Sharma <at...@gmail.com> on 2015/07/09 19:19:51 UTC

PRAM Distributed Sorting

Folks,

I am beta testing a PRAM model based parallel sorting algorithm and will
integrate it with Ignite soon.

My idea is to be able to use Ignite as an in memory sorting engine.

Does anybody have any ideas around this especially around existing sorting
functionalities?

Re: PRAM Distributed Sorting

Posted by Atri Sharma <at...@gmail.com>.
While I do agree with you in principle, I am not sure about the startup
costs and node transfer costs.

This is pretty experimental so I might be re inventing the wheel :)
On 14 Jul 2015 19:18, "Gianfranco Murador" <mu...@gmail.com>
wrote:

> I believe that an "reduce" function is is appropriate for this type of task
> and is generic enough to sort by any criteria.
> Maybe I'm wrong, but that's just my opinion.
> Regards,
>   Gianfranco
>
> 2015-07-14 15:11 GMT+02:00 Atri Sharma <at...@gmail.com>:
>
> > So, consider a relational database, like postgres. A major component of
> > sorting performance comes from the in memory sorting that happens for
> this
> > case. Normally, something like an external sort would be used in
> > conjugation with the disk files. However, a big data analytical
> production
> > use case has this requirement that the available memory to postgres for
> > sorting is pretty huge *but* so is the data and the response time has to
> be
> > really fast and oh, the data has to be streamed from the database given
> > certain events.
> >
> > So what I was thinking was on these lines:
> >
> > 1) Add a sorting module to the engine.
> > 2) Allow the sorting module to get the data streamed through data
> > streamers.
> > 3) Give sorting module access to the cache.
> > 4) Make a sort API which can be used by an external engine to chunk sort
> > into ignite, using streamers to stream data and distribute sort across
> > multiple threads, and give sorted results back.
> >
> > Note : This is actually more of a use case for Ignite. The reasons I
> > proposed adding it to core were: 1) Since direct interaction with data
> > streamer and cache is needed. 2) It would be a good use case demo. 3) It
> > might allow Ignite to be used as a pure play sorting engine thus allowing
> > existing databases to work with it.
> >
> > Thoughts?
> >
> > On Tue, Jul 14, 2015 at 4:49 PM, Gianfranco Murador <
> > murador.gianfranco@gmail.com> wrote:
> >
> > > I would  say that in case of a distributed algorithm complexity lies
> not
> > > only in the number of input data, but also and, more, in the number of
> > > messages exchanged between nodes to achieve the result.
> > > I  agree to maintain a certain principle of locality for related data,
> or
> > > leave this task  to a system that already has a data model suitable to
> > > scale sorting ( RDBMS  ? ).
> > > Regards,
> > > Gianfranco
> > >
> > >
> > > 2015-07-14 12:14 GMT+02:00 Atri Sharma <at...@gmail.com>:
> > >
> > > > Hi Roman,
> > > >
> > > > On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <
> > roman@shaposhnik.org
> > > >
> > > > wrote:
> > > >
> > > > > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <atri.jiit@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >
> > > > > What's the interconnect for this system?
> > > > >
> > > >
> > > > Not sure I got what you meant here.
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > *l'apprenant*
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Atri
> > *l'apprenant*
> >
>

Re: PRAM Distributed Sorting

Posted by Gianfranco Murador <mu...@gmail.com>.
I believe that an "reduce" function is is appropriate for this type of task
and is generic enough to sort by any criteria.
Maybe I'm wrong, but that's just my opinion.
Regards,
  Gianfranco

2015-07-14 15:11 GMT+02:00 Atri Sharma <at...@gmail.com>:

> So, consider a relational database, like postgres. A major component of
> sorting performance comes from the in memory sorting that happens for this
> case. Normally, something like an external sort would be used in
> conjugation with the disk files. However, a big data analytical production
> use case has this requirement that the available memory to postgres for
> sorting is pretty huge *but* so is the data and the response time has to be
> really fast and oh, the data has to be streamed from the database given
> certain events.
>
> So what I was thinking was on these lines:
>
> 1) Add a sorting module to the engine.
> 2) Allow the sorting module to get the data streamed through data
> streamers.
> 3) Give sorting module access to the cache.
> 4) Make a sort API which can be used by an external engine to chunk sort
> into ignite, using streamers to stream data and distribute sort across
> multiple threads, and give sorted results back.
>
> Note : This is actually more of a use case for Ignite. The reasons I
> proposed adding it to core were: 1) Since direct interaction with data
> streamer and cache is needed. 2) It would be a good use case demo. 3) It
> might allow Ignite to be used as a pure play sorting engine thus allowing
> existing databases to work with it.
>
> Thoughts?
>
> On Tue, Jul 14, 2015 at 4:49 PM, Gianfranco Murador <
> murador.gianfranco@gmail.com> wrote:
>
> > I would  say that in case of a distributed algorithm complexity lies not
> > only in the number of input data, but also and, more, in the number of
> > messages exchanged between nodes to achieve the result.
> > I  agree to maintain a certain principle of locality for related data, or
> > leave this task  to a system that already has a data model suitable to
> > scale sorting ( RDBMS  ? ).
> > Regards,
> > Gianfranco
> >
> >
> > 2015-07-14 12:14 GMT+02:00 Atri Sharma <at...@gmail.com>:
> >
> > > Hi Roman,
> > >
> > > On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <
> roman@shaposhnik.org
> > >
> > > wrote:
> > >
> > > > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <at...@gmail.com>
> > > wrote:
> > > >
> > > >
> > > > What's the interconnect for this system?
> > > >
> > >
> > > Not sure I got what you meant here.
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > *l'apprenant*
> > >
> >
>
>
>
> --
> Regards,
>
> Atri
> *l'apprenant*
>

Re: PRAM Distributed Sorting

Posted by Atri Sharma <at...@gmail.com>.
So, consider a relational database, like postgres. A major component of
sorting performance comes from the in memory sorting that happens for this
case. Normally, something like an external sort would be used in
conjugation with the disk files. However, a big data analytical production
use case has this requirement that the available memory to postgres for
sorting is pretty huge *but* so is the data and the response time has to be
really fast and oh, the data has to be streamed from the database given
certain events.

So what I was thinking was on these lines:

1) Add a sorting module to the engine.
2) Allow the sorting module to get the data streamed through data streamers.
3) Give sorting module access to the cache.
4) Make a sort API which can be used by an external engine to chunk sort
into ignite, using streamers to stream data and distribute sort across
multiple threads, and give sorted results back.

Note : This is actually more of a use case for Ignite. The reasons I
proposed adding it to core were: 1) Since direct interaction with data
streamer and cache is needed. 2) It would be a good use case demo. 3) It
might allow Ignite to be used as a pure play sorting engine thus allowing
existing databases to work with it.

Thoughts?

On Tue, Jul 14, 2015 at 4:49 PM, Gianfranco Murador <
murador.gianfranco@gmail.com> wrote:

> I would  say that in case of a distributed algorithm complexity lies not
> only in the number of input data, but also and, more, in the number of
> messages exchanged between nodes to achieve the result.
> I  agree to maintain a certain principle of locality for related data, or
> leave this task  to a system that already has a data model suitable to
> scale sorting ( RDBMS  ? ).
> Regards,
> Gianfranco
>
>
> 2015-07-14 12:14 GMT+02:00 Atri Sharma <at...@gmail.com>:
>
> > Hi Roman,
> >
> > On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <roman@shaposhnik.org
> >
> > wrote:
> >
> > > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <at...@gmail.com>
> > wrote:
> > >
> > >
> > > What's the interconnect for this system?
> > >
> >
> > Not sure I got what you meant here.
> >
> >
> > --
> > Regards,
> >
> > Atri
> > *l'apprenant*
> >
>



-- 
Regards,

Atri
*l'apprenant*

Re: PRAM Distributed Sorting

Posted by Gianfranco Murador <mu...@gmail.com>.
I would  say that in case of a distributed algorithm complexity lies not
only in the number of input data, but also and, more, in the number of
messages exchanged between nodes to achieve the result.
I  agree to maintain a certain principle of locality for related data, or
leave this task  to a system that already has a data model suitable to
scale sorting ( RDBMS  ? ).
Regards,
Gianfranco


2015-07-14 12:14 GMT+02:00 Atri Sharma <at...@gmail.com>:

> Hi Roman,
>
> On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
>
> > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <at...@gmail.com>
> wrote:
> >
> >
> > What's the interconnect for this system?
> >
>
> Not sure I got what you meant here.
>
>
> --
> Regards,
>
> Atri
> *l'apprenant*
>

Re: PRAM Distributed Sorting

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Tue, Jul 14, 2015 at 3:14 AM, Atri Sharma <at...@gmail.com> wrote:
> Hi Roman,
>
> On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
>
>> On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <at...@gmail.com> wrote:
>>
>>
>> What's the interconnect for this system?
>>
>
> Not sure I got what you meant here.

How are the nodes connected physically (Ethernet, IB, etc.)
and logically (TCP/IP, UDP, etc.) ?

Thanks,
Roman.

Re: PRAM Distributed Sorting

Posted by Atri Sharma <at...@gmail.com>.
Hi Roman,

On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <ro...@shaposhnik.org>
wrote:

> On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <at...@gmail.com> wrote:
>
>
> What's the interconnect for this system?
>

Not sure I got what you meant here.


-- 
Regards,

Atri
*l'apprenant*

Re: PRAM Distributed Sorting

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <at...@gmail.com> wrote:
> Hi Cos,
>
> Sorry I missed your email earlier.
>
> The use case around this is to have high speed sorting by maintaining in
> memory sorting for production servers. So I have a 512 GB RAM system which
> has to be able to sort efficiently but maintaining the stability and
> failover systems of the data.
>
> Do you see anything that I am missing here,please?

What's the interconnect for this system?

Thanks,
Roman.

Re: PRAM Distributed Sorting

Posted by Atri Sharma <at...@gmail.com>.
Sergi,

I totally understand and appreciate your point. You are totally correct in
your point that something not having definite use case should not be done.
However, in this case, I think many use cases exist for in memory sorting.
For eg, I am hacking on a side application which requires ordering multi
regex as fast as possible in memory. In other case, streaming extracting
data from HDFS and calculating file CDC for large data using multi sort
method is something I had in mind as well.

I think I did not come across clearly. The module I am talking about is
currently a use case on top of Ignite and I was soliciting feedback if it
is a good idea to integrate it with Ignite as a core module which can be
offered as a functionality or used internally in case we plan to support
features later on which require sorting (sort based aggregates for eg).

Sorry for the ambiguity early on. I hope my idea is more clear now.

On Mon, Jul 13, 2015 at 3:04 PM, Sergi Vladykin <se...@gmail.com>
wrote:

> Atri,
>
> Is there any real world demand for this functionality?
> You know, throwing code in is easy but then this code needs to be
> maintained, its bad if this code is useful only for imaginary use cases.
> And to be honest currently I don't understand a practical purpose of what
> you are doing. Having said that you'd better clearly define on dev list
> goals and design of subsystem you are willing to implement so that
> committers can provide feedback as early as possible. Otherwise it may
> appear that you will waste your time on something that will not be
> accepted.
>
> Sergi
>
> 2015-07-13 9:41 GMT+03:00 Atri Sharma <at...@gmail.com>:
>
> > Hi Cos,
> >
> > Sorry I missed your email earlier.
> >
> > The use case around this is to have high speed sorting by maintaining in
> > memory sorting for production servers. So I have a 512 GB RAM system
> which
> > has to be able to sort efficiently but maintaining the stability and
> > failover systems of the data.
> >
> > Do you see anything that I am missing here,please?
> >
> > On Fri, Jul 10, 2015 at 1:28 AM, Konstantin Boudnik <co...@apache.org>
> > wrote:
> >
> > > On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> > > > Folks,
> > > >
> > > > I am beta testing a PRAM model based parallel sorting algorithm and
> > will
> > > > integrate it with Ignite soon.
> > > >
> > > > My idea is to be able to use Ignite as an in memory sorting engine.
> > >
> > > For my own education: what'd be the use case for such functionality?
> > >
> > > Thanks,
> > >   Cos
> > >
> > > > Does anybody have any ideas around this especially around existing
> > > sorting
> > > > functionalities?
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Atri
> > *l'apprenant*
> >
>



-- 
Regards,

Atri
*l'apprenant*

Re: PRAM Distributed Sorting

Posted by Sergi Vladykin <se...@gmail.com>.
Atri,

Is there any real world demand for this functionality?
You know, throwing code in is easy but then this code needs to be
maintained, its bad if this code is useful only for imaginary use cases.
And to be honest currently I don't understand a practical purpose of what
you are doing. Having said that you'd better clearly define on dev list
goals and design of subsystem you are willing to implement so that
committers can provide feedback as early as possible. Otherwise it may
appear that you will waste your time on something that will not be
accepted.

Sergi

2015-07-13 9:41 GMT+03:00 Atri Sharma <at...@gmail.com>:

> Hi Cos,
>
> Sorry I missed your email earlier.
>
> The use case around this is to have high speed sorting by maintaining in
> memory sorting for production servers. So I have a 512 GB RAM system which
> has to be able to sort efficiently but maintaining the stability and
> failover systems of the data.
>
> Do you see anything that I am missing here,please?
>
> On Fri, Jul 10, 2015 at 1:28 AM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
> > On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> > > Folks,
> > >
> > > I am beta testing a PRAM model based parallel sorting algorithm and
> will
> > > integrate it with Ignite soon.
> > >
> > > My idea is to be able to use Ignite as an in memory sorting engine.
> >
> > For my own education: what'd be the use case for such functionality?
> >
> > Thanks,
> >   Cos
> >
> > > Does anybody have any ideas around this especially around existing
> > sorting
> > > functionalities?
> >
>
>
>
> --
> Regards,
>
> Atri
> *l'apprenant*
>

Re: PRAM Distributed Sorting

Posted by Atri Sharma <at...@gmail.com>.
Hi Cos,

Sorry I missed your email earlier.

The use case around this is to have high speed sorting by maintaining in
memory sorting for production servers. So I have a 512 GB RAM system which
has to be able to sort efficiently but maintaining the stability and
failover systems of the data.

Do you see anything that I am missing here,please?

On Fri, Jul 10, 2015 at 1:28 AM, Konstantin Boudnik <co...@apache.org> wrote:

> On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> > Folks,
> >
> > I am beta testing a PRAM model based parallel sorting algorithm and will
> > integrate it with Ignite soon.
> >
> > My idea is to be able to use Ignite as an in memory sorting engine.
>
> For my own education: what'd be the use case for such functionality?
>
> Thanks,
>   Cos
>
> > Does anybody have any ideas around this especially around existing
> sorting
> > functionalities?
>



-- 
Regards,

Atri
*l'apprenant*

Re: PRAM Distributed Sorting

Posted by Konstantin Boudnik <co...@apache.org>.
On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> Folks,
> 
> I am beta testing a PRAM model based parallel sorting algorithm and will
> integrate it with Ignite soon.
> 
> My idea is to be able to use Ignite as an in memory sorting engine.

For my own education: what'd be the use case for such functionality?

Thanks,
  Cos

> Does anybody have any ideas around this especially around existing sorting
> functionalities?