Apache Ignite Developers - Legacy Mail Archive

PRAM Distributed Sorting

Classic

List

Threaded

12 messages Options

Atri Sharma

PRAM Distributed Sorting

Folks,

I am beta testing a PRAM model based parallel sorting algorithm and will
integrate it with Ignite soon.

My idea is to be able to use Ignite as an in memory sorting engine.

Does anybody have any ideas around this especially around existing sorting
functionalities?

Konstantin Boudnik-2

Re: PRAM Distributed Sorting

On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> Folks,
>
> I am beta testing a PRAM model based parallel sorting algorithm and will
> integrate it with Ignite soon.
>
> My idea is to be able to use Ignite as an in memory sorting engine.

For my own education: what'd be the use case for such functionality?

Thanks,
Cos

> Does anybody have any ideas around this especially around existing sorting
> functionalities?

Atri Sharma

Re: PRAM Distributed Sorting

Hi Cos,

Sorry I missed your email earlier.

The use case around this is to have high speed sorting by maintaining in
memory sorting for production servers. So I have a 512 GB RAM system which
has to be able to sort efficiently but maintaining the stability and
failover systems of the data.

Do you see anything that I am missing here,please?

On Fri, Jul 10, 2015 at 1:28 AM, Konstantin Boudnik <[hidden email]> wrote:

> On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> > Folks,
> >
> > I am beta testing a PRAM model based parallel sorting algorithm and will
> > integrate it with Ignite soon.
> >
> > My idea is to be able to use Ignite as an in memory sorting engine.
>
> For my own education: what'd be the use case for such functionality?
>
> Thanks,
> Cos
>
> > Does anybody have any ideas around this especially around existing
> sorting
> > functionalities?
>

--
Regards,

Atri
*l'apprenant*

Sergi

Re: PRAM Distributed Sorting

Atri,

Is there any real world demand for this functionality?
You know, throwing code in is easy but then this code needs to be
maintained, its bad if this code is useful only for imaginary use cases.
And to be honest currently I don't understand a practical purpose of what
you are doing. Having said that you'd better clearly define on dev list
goals and design of subsystem you are willing to implement so that
committers can provide feedback as early as possible. Otherwise it may
appear that you will waste your time on something that will not be
accepted.

Sergi

2015-07-13 9:41 GMT+03:00 Atri Sharma <[hidden email]>:

> Hi Cos,
>
> Sorry I missed your email earlier.
>
> The use case around this is to have high speed sorting by maintaining in
> memory sorting for production servers. So I have a 512 GB RAM system which
> has to be able to sort efficiently but maintaining the stability and
> failover systems of the data.
>
> Do you see anything that I am missing here,please?
>
> On Fri, Jul 10, 2015 at 1:28 AM, Konstantin Boudnik <[hidden email]>
> wrote:
>
> > On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> > > Folks,
> > >
> > > I am beta testing a PRAM model based parallel sorting algorithm and
> will
> > > integrate it with Ignite soon.
> > >
> > > My idea is to be able to use Ignite as an in memory sorting engine.
> >
> > For my own education: what'd be the use case for such functionality?
> >
> > Thanks,
> > Cos
> >
> > > Does anybody have any ideas around this especially around existing
> > sorting
> > > functionalities?
> >
>
>
>
> --
> Regards,
>
> Atri
> *l'apprenant*
>

Atri Sharma

Re: PRAM Distributed Sorting

Sergi,

I totally understand and appreciate your point. You are totally correct in
your point that something not having definite use case should not be done.
However, in this case, I think many use cases exist for in memory sorting.
For eg, I am hacking on a side application which requires ordering multi
regex as fast as possible in memory. In other case, streaming extracting
data from HDFS and calculating file CDC for large data using multi sort
method is something I had in mind as well.

I think I did not come across clearly. The module I am talking about is
currently a use case on top of Ignite and I was soliciting feedback if it
is a good idea to integrate it with Ignite as a core module which can be
offered as a functionality or used internally in case we plan to support
features later on which require sorting (sort based aggregates for eg).

Sorry for the ambiguity early on. I hope my idea is more clear now.

On Mon, Jul 13, 2015 at 3:04 PM, Sergi Vladykin <[hidden email]>
wrote:

> Atri,
>
> Is there any real world demand for this functionality?
> You know, throwing code in is easy but then this code needs to be
> maintained, its bad if this code is useful only for imaginary use cases.
> And to be honest currently I don't understand a practical purpose of what
> you are doing. Having said that you'd better clearly define on dev list
> goals and design of subsystem you are willing to implement so that
> committers can provide feedback as early as possible. Otherwise it may
> appear that you will waste your time on something that will not be
> accepted.
>
> Sergi
>
> 2015-07-13 9:41 GMT+03:00 Atri Sharma <[hidden email]>:
>
> > Hi Cos,
> >
> > Sorry I missed your email earlier.
> >
> > The use case around this is to have high speed sorting by maintaining in
> > memory sorting for production servers. So I have a 512 GB RAM system
> which
> > has to be able to sort efficiently but maintaining the stability and
> > failover systems of the data.
> >
> > Do you see anything that I am missing here,please?
> >
> > On Fri, Jul 10, 2015 at 1:28 AM, Konstantin Boudnik <[hidden email]>
> > wrote:
> >
> > > On Thu, Jul 09, 2015 at 10:49PM, Atri Sharma wrote:
> > > > Folks,
> > > >
> > > > I am beta testing a PRAM model based parallel sorting algorithm and
> > will
> > > > integrate it with Ignite soon.
> > > >
> > > > My idea is to be able to use Ignite as an in memory sorting engine.
> > >
> > > For my own education: what'd be the use case for such functionality?
> > >
> > > Thanks,
> > > Cos
> > >
> > > > Does anybody have any ideas around this especially around existing
> > > sorting
> > > > functionalities?
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Atri
> > *l'apprenant*
> >
>

--
Regards,

Atri
*l'apprenant*

Roman Shaposhnik

Re: PRAM Distributed Sorting

In reply to this post by Atri Sharma

On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <[hidden email]> wrote:

What's the interconnect for this system?

Thanks,
Roman.

Atri Sharma

Re: PRAM Distributed Sorting

Hi Roman,

On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <[hidden email]>
wrote:

> On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <[hidden email]> wrote:
>
>
> What's the interconnect for this system?
>

Not sure I got what you meant here.

--
Regards,

Atri
*l'apprenant*

Gianfranco Murador

Re: PRAM Distributed Sorting

I would say that in case of a distributed algorithm complexity lies not
only in the number of input data, but also and, more, in the number of
messages exchanged between nodes to achieve the result.
I agree to maintain a certain principle of locality for related data, or
leave this task to a system that already has a data model suitable to
scale sorting ( RDBMS ? ).
Regards,
Gianfranco

2015-07-14 12:14 GMT+02:00 Atri Sharma <[hidden email]>:

> Hi Roman,
>
> On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <[hidden email]>
> wrote:
>
> > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <[hidden email]>
> wrote:
> >
> >
> > What's the interconnect for this system?
> >
>
> Not sure I got what you meant here.
>
>
> --
> Regards,
>
> Atri
> *l'apprenant*
>

Atri Sharma

Re: PRAM Distributed Sorting

So, consider a relational database, like postgres. A major component of
sorting performance comes from the in memory sorting that happens for this
case. Normally, something like an external sort would be used in
conjugation with the disk files. However, a big data analytical production
use case has this requirement that the available memory to postgres for
sorting is pretty huge *but* so is the data and the response time has to be
really fast and oh, the data has to be streamed from the database given
certain events.

So what I was thinking was on these lines:

1) Add a sorting module to the engine.
2) Allow the sorting module to get the data streamed through data streamers.
3) Give sorting module access to the cache.
4) Make a sort API which can be used by an external engine to chunk sort
into ignite, using streamers to stream data and distribute sort across
multiple threads, and give sorted results back.

Note : This is actually more of a use case for Ignite. The reasons I
proposed adding it to core were: 1) Since direct interaction with data
streamer and cache is needed. 2) It would be a good use case demo. 3) It
might allow Ignite to be used as a pure play sorting engine thus allowing
existing databases to work with it.

Thoughts?

On Tue, Jul 14, 2015 at 4:49 PM, Gianfranco Murador <
[hidden email]> wrote:

> I would say that in case of a distributed algorithm complexity lies not
> only in the number of input data, but also and, more, in the number of
> messages exchanged between nodes to achieve the result.
> I agree to maintain a certain principle of locality for related data, or
> leave this task to a system that already has a data model suitable to
> scale sorting ( RDBMS ? ).
> Regards,
> Gianfranco
>
>
> 2015-07-14 12:14 GMT+02:00 Atri Sharma <[hidden email]>:
>
> > Hi Roman,
> >
> > On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <[hidden email]
> >
> > wrote:
> >
> > > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <[hidden email]>
> > wrote:
> > >
> > >
> > > What's the interconnect for this system?
> > >
> >
> > Not sure I got what you meant here.
> >
> >
> > --
> > Regards,
> >
> > Atri
> > *l'apprenant*
> >
>

--
Regards,

Atri
*l'apprenant*

Gianfranco Murador

Re: PRAM Distributed Sorting

I believe that an "reduce" function is is appropriate for this type of task
and is generic enough to sort by any criteria.
Maybe I'm wrong, but that's just my opinion.
Regards,
Gianfranco

2015-07-14 15:11 GMT+02:00 Atri Sharma <[hidden email]>:

> So, consider a relational database, like postgres. A major component of
> sorting performance comes from the in memory sorting that happens for this
> case. Normally, something like an external sort would be used in
> conjugation with the disk files. However, a big data analytical production
> use case has this requirement that the available memory to postgres for
> sorting is pretty huge *but* so is the data and the response time has to be
> really fast and oh, the data has to be streamed from the database given
> certain events.
>
> So what I was thinking was on these lines:
>
> 1) Add a sorting module to the engine.
> 2) Allow the sorting module to get the data streamed through data
> streamers.
> 3) Give sorting module access to the cache.
> 4) Make a sort API which can be used by an external engine to chunk sort
> into ignite, using streamers to stream data and distribute sort across
> multiple threads, and give sorted results back.
>
> Note : This is actually more of a use case for Ignite. The reasons I
> proposed adding it to core were: 1) Since direct interaction with data
> streamer and cache is needed. 2) It would be a good use case demo. 3) It
> might allow Ignite to be used as a pure play sorting engine thus allowing
> existing databases to work with it.
>
> Thoughts?
>
> On Tue, Jul 14, 2015 at 4:49 PM, Gianfranco Murador <
> [hidden email]> wrote:
>
> > I would say that in case of a distributed algorithm complexity lies not
> > only in the number of input data, but also and, more, in the number of
> > messages exchanged between nodes to achieve the result.
> > I agree to maintain a certain principle of locality for related data, or
> > leave this task to a system that already has a data model suitable to
> > scale sorting ( RDBMS ? ).
> > Regards,
> > Gianfranco
> >
> >
> > 2015-07-14 12:14 GMT+02:00 Atri Sharma <[hidden email]>:
> >
> > > Hi Roman,
> > >
> > > On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <
> [hidden email]
> > >
> > > wrote:
> > >
> > > > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <[hidden email]>
> > > wrote:
> > > >
> > > >
> > > > What's the interconnect for this system?
> > > >
> > >
> > > Not sure I got what you meant here.
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > *l'apprenant*
> > >
> >
>
>
>
> --
> Regards,
>
> Atri
> *l'apprenant*
>

Atri Sharma

Re: PRAM Distributed Sorting

While I do agree with you in principle, I am not sure about the startup
costs and node transfer costs.

This is pretty experimental so I might be re inventing the wheel :)
On 14 Jul 2015 19:18, "Gianfranco Murador" <[hidden email]>
wrote:

> I believe that an "reduce" function is is appropriate for this type of task
> and is generic enough to sort by any criteria.
> Maybe I'm wrong, but that's just my opinion.
> Regards,
> Gianfranco
>
> 2015-07-14 15:11 GMT+02:00 Atri Sharma <[hidden email]>:
>
> > So, consider a relational database, like postgres. A major component of
> > sorting performance comes from the in memory sorting that happens for
> this
> > case. Normally, something like an external sort would be used in
> > conjugation with the disk files. However, a big data analytical
> production
> > use case has this requirement that the available memory to postgres for
> > sorting is pretty huge *but* so is the data and the response time has to
> be
> > really fast and oh, the data has to be streamed from the database given
> > certain events.
> >
> > So what I was thinking was on these lines:
> >
> > 1) Add a sorting module to the engine.
> > 2) Allow the sorting module to get the data streamed through data
> > streamers.
> > 3) Give sorting module access to the cache.
> > 4) Make a sort API which can be used by an external engine to chunk sort
> > into ignite, using streamers to stream data and distribute sort across
> > multiple threads, and give sorted results back.
> >
> > Note : This is actually more of a use case for Ignite. The reasons I
> > proposed adding it to core were: 1) Since direct interaction with data
> > streamer and cache is needed. 2) It would be a good use case demo. 3) It
> > might allow Ignite to be used as a pure play sorting engine thus allowing
> > existing databases to work with it.
> >
> > Thoughts?
> >
> > On Tue, Jul 14, 2015 at 4:49 PM, Gianfranco Murador <
> > [hidden email]> wrote:
> >
> > > I would say that in case of a distributed algorithm complexity lies
> not
> > > only in the number of input data, but also and, more, in the number of
> > > messages exchanged between nodes to achieve the result.
> > > I agree to maintain a certain principle of locality for related data,
> or
> > > leave this task to a system that already has a data model suitable to
> > > scale sorting ( RDBMS ? ).
> > > Regards,
> > > Gianfranco
> > >
> > >
> > > 2015-07-14 12:14 GMT+02:00 Atri Sharma <[hidden email]>:
> > >
> > > > Hi Roman,
> > > >
> > > > On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <
> > [hidden email]
> > > >
> > > > wrote:
> > > >
> > > > > On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <[hidden email]
> >
> > > > wrote:
> > > > >
> > > > >
> > > > > What's the interconnect for this system?
> > > > >
> > > >
> > > > Not sure I got what you meant here.
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > *l'apprenant*
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Atri
> > *l'apprenant*
> >
>

Roman Shaposhnik

Re: PRAM Distributed Sorting

In reply to this post by Atri Sharma

On Tue, Jul 14, 2015 at 3:14 AM, Atri Sharma <[hidden email]> wrote:

> Hi Roman,
>
> On Tue, Jul 14, 2015 at 12:32 AM, Roman Shaposhnik <[hidden email]>
> wrote:
>
>> On Sun, Jul 12, 2015 at 11:41 PM, Atri Sharma <[hidden email]> wrote:
>>
>>
>> What's the interconnect for this system?
>>
>
> Not sure I got what you meant here.

How are the nodes connected physically (Ethernet, IB, etc.)
and logically (TCP/IP, UDP, etc.) ?

Thanks,
Roman.