You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@madlib.apache.org by Atri Sharma <at...@gmail.com> on 2015/12/22 16:53:57 UTC

MADLib and Greenplum Large Objects

Hi All,

We are currently working on making Greenplum Large Objects better and
awesome.

We were thinking of seeing if MADLib can benefit from Large Objects and use
them in a manner which is helpful. MADLib can see if Large Objects can be
used as intermediate objects for intermediate states that are large.

Large Objects API can be seen
http://www.postgresql.org/docs/9.2/static/largeobjects.html

Large Objects will eventually scale out in Greenplum. They will be
distributed across cluster and queries will be performant.

Regards,

Atri

Re: MADLib and Greenplum Large Objects

Posted by Frank McQuillan <fm...@pivotal.io>.
Thanks Atri.

Kindly keep this mailing list apprised of your findings as you go.  Looking
forward to this effort.

Frank

On Wed, Dec 23, 2015 at 11:20 PM, Atri Sharma <at...@pivotal.io> wrote:

> I am currently working on getting it working in GPDB.
>
> HAWQ can be on the later roadmap.
>
> Regards,
>
> Atri
>
> On Thu, Dec 24, 2015 at 7:23 AM, Ivan Novick <in...@pivotal.io> wrote:
>
>> 3) GPBD and HAWQ
>>> * Ideally we would want both to have the LO capability, but if it is
>>> just GPDB initially, we could put the equivalent if #ifdef's in the code.
>>>
>>>>
>>>>
>> This may not happen on the same day, but if its the way to get this
>> important feature in madlib working we can get it on both roadmaps as this
>> is as upstream PostgreSQL feature.
>>
>> Cheers,
>> Ivan
>>
>>
>

Re: MADLib and Greenplum Large Objects

Posted by Atri Sharma <at...@pivotal.io>.
I am currently working on getting it working in GPDB.

HAWQ can be on the later roadmap.

Regards,

Atri

On Thu, Dec 24, 2015 at 7:23 AM, Ivan Novick <in...@pivotal.io> wrote:

> 3) GPBD and HAWQ
>> * Ideally we would want both to have the LO capability, but if it is just
>> GPDB initially, we could put the equivalent if #ifdef's in the code.
>>
>>>
>>>
> This may not happen on the same day, but if its the way to get this
> important feature in madlib working we can get it on both roadmaps as this
> is as upstream PostgreSQL feature.
>
> Cheers,
> Ivan
>
>

Re: MADLib and Greenplum Large Objects

Posted by Ivan Novick <in...@pivotal.io>.
>
> 3) GPBD and HAWQ
> * Ideally we would want both to have the LO capability, but if it is just
> GPDB initially, we could put the equivalent if #ifdef's in the code.
>
>>
>>
This may not happen on the same day, but if its the way to get this
important feature in madlib working we can get it on both roadmaps as this
is as upstream PostgreSQL feature.

Cheers,
Ivan

Re: MADLib and Greenplum Large Objects

Posted by Frank McQuillan <fm...@pivotal.io>.
Atri, thanks for the note on LOs.

http://www.postgresql.org/docs/current/static/lo-intro.html
mentions LO facility supporting values up to 4TB in size.  Internal
aggregate states much larger than 1 GB are very attractive to MADlib.

Do you have a code example where a LO is used as an internal aggregate
state?  That would give use an idea of how to implement.

Our main questions are:

1) Performance
* Does use of the LO facility result in acceptable performance?  Related to
Caleb's question on memory management.
* If use of LO means disk I/O operations where in the past we used
in-memory operations, then performance will degrade.

2) Ease of implementation
* Need to learn more about this one

3) GPBD and HAWQ
* Ideally we would want both to have the LO capability, but if it is just
GPDB initially, we could put the equivalent if #ifdef's in the code.

Rgds,
Frank

On Wed, Dec 23, 2015 at 1:59 PM, Roman Shaposhnik <ro...@shaposhnik.org>
wrote:

> On Wed, Dec 23, 2015 at 1:49 PM, Ivan Novick <in...@pivotal.io> wrote:
> > Its currently functioning on PostgreSQL so maybe thats the place to try
> it
> > first before worry about porting to GPDB and HAWQ which should be doable.
>
> That's a great point!
>
> Thanks,
> Roman.
>

Re: MADLib and Greenplum Large Objects

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Wed, Dec 23, 2015 at 1:49 PM, Ivan Novick <in...@pivotal.io> wrote:
> Its currently functioning on PostgreSQL so maybe thats the place to try it
> first before worry about porting to GPDB and HAWQ which should be doable.

That's a great point!

Thanks,
Roman.

Re: MADLib and Greenplum Large Objects

Posted by Ivan Novick <in...@pivotal.io>.
It seems increasing the memory allocation limit is a much harder option,
and LO API is the way to do these things in PostgreSQL, so wondering if
Madlib can take advantage of it.

Its currently functioning on PostgreSQL so maybe thats the place to try it
first before worry about porting to GPDB and HAWQ which should be doable.

Cheers,
Ivan



On Wed, Dec 23, 2015 at 1:45 PM, Caleb Welton <cw...@pivotal.io> wrote:

> This is one place, however I'd have to look at the LO API to understand if
> it gets past the memory allocation limitation, and then we'd have to
> discuss design of implementation and whether it would be implemented both
> in GPDB and HAWQ - which would be a requirement for MADlib.
>
> Sent from my iPhone
>
> > On Dec 23, 2015, at 1:39 PM, Ivan Novick <in...@pivotal.io> wrote:
> >
> > Hi Roman,
> >
> > There are requests for bigger intermediate data on madlib.
> >
> > Here is an extract from a request:
> >
> > """
> > Currently 1GB is the max field size for any data in a column in a row. We
> > want to increase this in GPDB 100GB. This will also be used by data
> science
> > to address issue below and also to store in a column a bigger thing like
> an
> > XML or JSON doc that is larger than 1GB.
> >
> > As a developer, I want to maintain a larger internal aggregate state in
> > memory > 1 GB, so that I can operate on larger data sets.
> >
> > Notes
> > 1) Many MADlib algorithms need to maintain large internal aggregates. One
> > example is the LDA algorithm that is limited to the number of topics X
> > vocabulary sizes < ~250M due to the 1 GB limit. For text analytics, this
> is
> > quite restrictive.
> > References
> > [1] http://www.postgresql.org/docs/9.4/static/sql-createaggregate.html
> > """
> >
> > On Wed, Dec 23, 2015 at 1:17 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> > wrote:
> >
> >> Atri,
> >>
> >> I'm curious what usage to you see for LOs when
> >> it comes to MADlib?
> >>
> >> Thanks,
> >> Roman.
> >>
> >>> On Tue, Dec 22, 2015 at 7:53 AM, Atri Sharma <at...@gmail.com>
> wrote:
> >>> Hi All,
> >>>
> >>> We are currently working on making Greenplum Large Objects better and
> >>> awesome.
> >>>
> >>> We were thinking of seeing if MADLib can benefit from Large Objects and
> >> use
> >>> them in a manner which is helpful. MADLib can see if Large Objects can
> be
> >>> used as intermediate objects for intermediate states that are large.
> >>>
> >>> Large Objects API can be seen
> >>> http://www.postgresql.org/docs/9.2/static/largeobjects.html
> >>>
> >>> Large Objects will eventually scale out in Greenplum. They will be
> >>> distributed across cluster and queries will be performant.
> >>>
> >>> Regards,
> >>>
> >>> Atri
> >>
>

Re: MADLib and Greenplum Large Objects

Posted by Caleb Welton <cw...@pivotal.io>.
This is one place, however I'd have to look at the LO API to understand if it gets past the memory allocation limitation, and then we'd have to discuss design of implementation and whether it would be implemented both in GPDB and HAWQ - which would be a requirement for MADlib.

Sent from my iPhone

> On Dec 23, 2015, at 1:39 PM, Ivan Novick <in...@pivotal.io> wrote:
> 
> Hi Roman,
> 
> There are requests for bigger intermediate data on madlib.
> 
> Here is an extract from a request:
> 
> """
> Currently 1GB is the max field size for any data in a column in a row. We
> want to increase this in GPDB 100GB. This will also be used by data science
> to address issue below and also to store in a column a bigger thing like an
> XML or JSON doc that is larger than 1GB.
> 
> As a developer, I want to maintain a larger internal aggregate state in
> memory > 1 GB, so that I can operate on larger data sets.
> 
> Notes
> 1) Many MADlib algorithms need to maintain large internal aggregates. One
> example is the LDA algorithm that is limited to the number of topics X
> vocabulary sizes < ~250M due to the 1 GB limit. For text analytics, this is
> quite restrictive.
> References
> [1] http://www.postgresql.org/docs/9.4/static/sql-createaggregate.html
> """
> 
> On Wed, Dec 23, 2015 at 1:17 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
> 
>> Atri,
>> 
>> I'm curious what usage to you see for LOs when
>> it comes to MADlib?
>> 
>> Thanks,
>> Roman.
>> 
>>> On Tue, Dec 22, 2015 at 7:53 AM, Atri Sharma <at...@gmail.com> wrote:
>>> Hi All,
>>> 
>>> We are currently working on making Greenplum Large Objects better and
>>> awesome.
>>> 
>>> We were thinking of seeing if MADLib can benefit from Large Objects and
>> use
>>> them in a manner which is helpful. MADLib can see if Large Objects can be
>>> used as intermediate objects for intermediate states that are large.
>>> 
>>> Large Objects API can be seen
>>> http://www.postgresql.org/docs/9.2/static/largeobjects.html
>>> 
>>> Large Objects will eventually scale out in Greenplum. They will be
>>> distributed across cluster and queries will be performant.
>>> 
>>> Regards,
>>> 
>>> Atri
>> 

Re: MADLib and Greenplum Large Objects

Posted by Ivan Novick <in...@pivotal.io>.
Hi Roman,

There are requests for bigger intermediate data on madlib.

Here is an extract from a request:

"""
Currently 1GB is the max field size for any data in a column in a row. We
want to increase this in GPDB 100GB. This will also be used by data science
to address issue below and also to store in a column a bigger thing like an
XML or JSON doc that is larger than 1GB.

As a developer, I want to maintain a larger internal aggregate state in
memory > 1 GB, so that I can operate on larger data sets.

Notes
1) Many MADlib algorithms need to maintain large internal aggregates. One
example is the LDA algorithm that is limited to the number of topics X
vocabulary sizes < ~250M due to the 1 GB limit. For text analytics, this is
quite restrictive.
References
[1] http://www.postgresql.org/docs/9.4/static/sql-createaggregate.html
"""

On Wed, Dec 23, 2015 at 1:17 PM, Roman Shaposhnik <ro...@shaposhnik.org>
wrote:

> Atri,
>
> I'm curious what usage to you see for LOs when
> it comes to MADlib?
>
> Thanks,
> Roman.
>
> On Tue, Dec 22, 2015 at 7:53 AM, Atri Sharma <at...@gmail.com> wrote:
> > Hi All,
> >
> > We are currently working on making Greenplum Large Objects better and
> > awesome.
> >
> > We were thinking of seeing if MADLib can benefit from Large Objects and
> use
> > them in a manner which is helpful. MADLib can see if Large Objects can be
> > used as intermediate objects for intermediate states that are large.
> >
> > Large Objects API can be seen
> > http://www.postgresql.org/docs/9.2/static/largeobjects.html
> >
> > Large Objects will eventually scale out in Greenplum. They will be
> > distributed across cluster and queries will be performant.
> >
> > Regards,
> >
> > Atri
>

Re: MADLib and Greenplum Large Objects

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
Atri,

I'm curious what usage to you see for LOs when
it comes to MADlib?

Thanks,
Roman.

On Tue, Dec 22, 2015 at 7:53 AM, Atri Sharma <at...@gmail.com> wrote:
> Hi All,
>
> We are currently working on making Greenplum Large Objects better and
> awesome.
>
> We were thinking of seeing if MADLib can benefit from Large Objects and use
> them in a manner which is helpful. MADLib can see if Large Objects can be
> used as intermediate objects for intermediate states that are large.
>
> Large Objects API can be seen
> http://www.postgresql.org/docs/9.2/static/largeobjects.html
>
> Large Objects will eventually scale out in Greenplum. They will be
> distributed across cluster and queries will be performant.
>
> Regards,
>
> Atri