You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cayenne.apache.org by MG...@escholar.com on 2010/05/21 23:27:07 UTC

Blobs in the DataContext

Hi,

        I'm using cayenne to store large files in BLOBs as a process runs. 
 The first step of the process is storing large files (~ 600MB) and they 
are ending up in the DB just fine, then we run some tasks and get some 
output files, and then store the large output files (~ 500MB) to the DB. 
The output files are not making it into the DB.  In fact it appears that 
the whole program is just sitting and waiting, for what, i have no idea 
and after you try and spawn another thread in the program it throws an out 
of memory exception.  I was trying to figure out why the larger input 
files got persisted fine, but the large output files cause a problem and 
the only thing I could think of was that when the BLOBs are created they 
are cached in the DataContext and are never cleared eventually just 
causing the memory to be exhausted.  Is this possible?  Anything else 
anyone can think of?

note: I'm also compressing the stream in memory as I'm adding it to the 
byte[], but still... it works for the input files.  also, each of these 
phases of the process is followed by a commit, so all the input files are 
committed together and all the output files should be committed together 
as well, but this never happens.

Thank you for any help you may be able to provide.
-Mike

Re: Blobs in the DataContext

Posted by Aristedes Maniatis <ar...@maniatis.org>.
On 22/05/10 7:27 AM, MGargano@escholar.com wrote:
> I'm using cayenne to store large files in BLOBs as a process runs.
>   The first step of the process is storing large files (~ 600MB)

I'm sure others will chime in with similar thoughts: ORMs are not really designed for this type of workload. For that matter, nor are most databases. I'd even avoid ever turning these things into Java objects at any point along the way. You might like to explore other ways of storing these files on disk, keeping the metadata in the database controlled by Cayenne (filename, size, created date, user, etc.).

If you must continue along the path you've chosen, you'll want to spend quite some time with a Java profiler (like Yourkit or jprofiler) looking at all the places these objects end up, and exactly how they get garbage collected. That's not just about Cayenne but everywhere from the JDBC stack through to your import/export process.

Cheers
Ari

-- 
-------------------------->
Aristedes Maniatis
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Blobs in the DataContext

Posted by Andrus Adamchik <an...@objectstyle.org>.
> Some database products offer "streaming" to and from BLOBs which is  
> one get-around for these problems.  This means you can theoretically  
> get away with not having to hold the whole hunk of data in memory at  
> once.

Yeah, I was wondering the same thing. Blob is an interface that you  
can implement as a pointer to a file or a remote URL. How many drivers  
are smart enough to actually use 'getBinaryStream' instead of  
'getBytes' inside the prepared statement?

Andrus


On May 22, 2010, at 9:42 AM, Andrew Lindesay wrote:

> Hi there Mike;
>
> I'd have to agree with Ari there; small BLOBs (usually in a sub- 
> table) work fine with an object-relational mapping system like  
> Cayenne, but trying to use an object-relational technology for big  
> BLOBs is generally troublesome owing to the cost of shifting those  
> big hunks of data around and also gobbling-up all that memory.
>
> Some database products offer "streaming" to and from BLOBs which is  
> one get-around for these problems.  This means you can theoretically  
> get away with not having to hold the whole hunk of data in memory at  
> once.
>
> Some time ago I was having to work cross-database with BLOBs of  
> arbitrary size and had some such troubles.  For this reason I wrote  
> a system which lays down a whole series of smaller BLOBs which are  
> then linked by a header-table holding some very basic meta-data such  
> as a "short unique code" to link into the object-relational world.   
> It's non-transactional across the whole data, but generally special  
> handling is required to deal with large data sets anyway.  In java,  
> I then have an input/output stream writing to and reading from this  
> data structure.  There are some other advantages to this system such  
> as being able to do "out of order" writes to the stream.
>
> That is actually part of my "lestuff" project which is open-source  
> so you are welcome to use that if you would like; drop me a note and  
> I'll give you some pointers.  Otherwise, maybe this gives you some  
> ideas.
>
> Regards;
>
>> I'm using cayenne to store large files in BLOBs as a process runs.
>> The first step of the process is storing large files (~ 600MB) and  
>> they
>> are ending up in the DB just fine, then we run some tasks and get  
>> some
>> output files, and then store the large output files (~ 500MB) to  
>> the DB.
> ...
>> note: I'm also compressing the stream in memory as I'm adding it to  
>> the
>> byte[], but still... it works for the input files.  also, each of  
>> these
>
> ___
> Andrew Lindesay
> www.silvereye.co.nz
>
>


Re: Blobs in the DataContext

Posted by Andrew Lindesay <ap...@lindesay.co.nz>.
Hello Mike;

> the architecture I inherited for this product bills the BLOBs in the DB as 
> a sort of "feature" that I don't see changing anytime soon.  Currently we 

Can you perhaps use "raw rows" equivalent to get the data out so that it does not cache?

> btw.  Andrew, I remember you and your LEWOstuff from wowodc... very cool 
> "stuff"!  :)

Thanks -- there's some neat changes coming soon too so watch this space.

cheers.

___
Andrew Lindesay
www.silvereye.co.nz


Re: Blobs in the DataContext

Posted by MG...@escholar.com.
Hi Andrew,

        These be flame war words.  :-p   Actually, I would normally 
totally agree with both your and Aristedes' recommendations, unfortunately 
the architecture I inherited for this product bills the BLOBs in the DB as 
a sort of "feature" that I don't see changing anytime soon.  Currently we 
are using hibernate and it handles these situations without a problem 
(although it causes more problems with many-to-many's) and even Cayenne 
seems to handle this the first time for the largest file I'm uploading. 
However, after multiple files the memory is not being released, it seems. 
Now I haven't thrown this under the profiler yet, which I will try, but 
the code isn't all the complicated and I did a code review and don't see 
anyplace where I'm holding a reference to the data.  I also saw some other 
posts this weekend of others getting OutOfMemoryExceptions and one of the 
recommendations was to create a new context.  If this is the case, then 
I'm assuming something is getting cached or new objects created in the 
context are not destroyed after a commit.  This is more of what I trying 
to find out... what happens in the Cayenne internals when you create a new 
object in the context, commit it, and continue using the same context.

btw.  Andrew, I remember you and your LEWOstuff from wowodc... very cool 
"stuff"!  :)

Thanks.
-Mike




From:
Andrew Lindesay <ap...@lindesay.co.nz>
To:
user@cayenne.apache.org
Date:
05/22/2010 02:43 AM
Subject:
Re: Blobs in the DataContext



Hi there Mike;

I'd have to agree with Ari there; small BLOBs (usually in a sub-table) 
work fine with an object-relational mapping system like Cayenne, but 
trying to use an object-relational technology for big BLOBs is generally 
troublesome owing to the cost of shifting those big hunks of data around 
and also gobbling-up all that memory.

Some database products offer "streaming" to and from BLOBs which is one 
get-around for these problems.  This means you can theoretically get away 
with not having to hold the whole hunk of data in memory at once.

Some time ago I was having to work cross-database with BLOBs of arbitrary 
size and had some such troubles.  For this reason I wrote a system which 
lays down a whole series of smaller BLOBs which are then linked by a 
header-table holding some very basic meta-data such as a "short unique 
code" to link into the object-relational world.  It's non-transactional 
across the whole data, but generally special handling is required to deal 
with large data sets anyway.  In java, I then have an input/output stream 
writing to and reading from this data structure.  There are some other 
advantages to this system such as being able to do "out of order" writes 
to the stream.

That is actually part of my "lestuff" project which is open-source so you 
are welcome to use that if you would like; drop me a note and I'll give 
you some pointers.  Otherwise, maybe this gives you some ideas.

Regards;

> I'm using cayenne to store large files in BLOBs as a process runs. 
> The first step of the process is storing large files (~ 600MB) and they 
> are ending up in the DB just fine, then we run some tasks and get some 
> output files, and then store the large output files (~ 500MB) to the DB. 

...
> note: I'm also compressing the stream in memory as I'm adding it to the 
> byte[], but still... it works for the input files.  also, each of these 

___
Andrew Lindesay
www.silvereye.co.nz




Re: Blobs in the DataContext

Posted by Andrew Lindesay <ap...@lindesay.co.nz>.
Hi there Mike;

I'd have to agree with Ari there; small BLOBs (usually in a sub-table) work fine with an object-relational mapping system like Cayenne, but trying to use an object-relational technology for big BLOBs is generally troublesome owing to the cost of shifting those big hunks of data around and also gobbling-up all that memory.

Some database products offer "streaming" to and from BLOBs which is one get-around for these problems.  This means you can theoretically get away with not having to hold the whole hunk of data in memory at once.

Some time ago I was having to work cross-database with BLOBs of arbitrary size and had some such troubles.  For this reason I wrote a system which lays down a whole series of smaller BLOBs which are then linked by a header-table holding some very basic meta-data such as a "short unique code" to link into the object-relational world.  It's non-transactional across the whole data, but generally special handling is required to deal with large data sets anyway.  In java, I then have an input/output stream writing to and reading from this data structure.  There are some other advantages to this system such as being able to do "out of order" writes to the stream.

That is actually part of my "lestuff" project which is open-source so you are welcome to use that if you would like; drop me a note and I'll give you some pointers.  Otherwise, maybe this gives you some ideas.

Regards;

> I'm using cayenne to store large files in BLOBs as a process runs. 
> The first step of the process is storing large files (~ 600MB) and they 
> are ending up in the DB just fine, then we run some tasks and get some 
> output files, and then store the large output files (~ 500MB) to the DB. 
...
> note: I'm also compressing the stream in memory as I'm adding it to the 
> byte[], but still... it works for the input files.  also, each of these 

___
Andrew Lindesay
www.silvereye.co.nz


Re: Blobs in the DataContext

Posted by Andrus Adamchik <an...@objectstyle.org>.
Yeah fully implementing streaming may require some non trivial effort.

But from your initial description it appeared that you have enough  
memory to process even the largest blobs in isolation. So maybe  
there's a memory leak somewhere that can be detected with a profiler  
and fixed by changing some cache settings / DataContext scope without  
rewriting Cayenne?

Andrus

On May 28, 2010, at 2:32 PM, MGargano@escholar.com wrote:
> Hi Tore,
>
>        I finally got around to looking at your code in the jira
> attachments.  It looks like it will need a little bit of updating to  
> get
> it 3.0 ready, but at the end of the day it still pulls the whole thing
> into memory at some point.  I might ping you over the next week, if  
> you
> don't mind, about this as I quickly (i.e. panic-mode) try to cobble
> something together.  I'm just learning cayenne so a lot of the  
> internals
> are still not very clear to me.
>
> Thanks.
> -Mike
>
>
>
>
> From:
> Tore Halset <ha...@pvv.ntnu.no>
> To:
> user@cayenne.apache.org
> Date:
> 05/25/2010 07:58 AM
> Subject:
> Re: Blobs in the DataContext
>
>
>
> Hello.
>
> I tried to implement support for streaming blobs back in 2006.  
> Sorry, but
> it never got completed. I still think it is a nice feature. If you  
> want to
> work on this issue, you might want to take a look at
> streaming_blob_try2.zip from https://issues.apache.org/jira/browse/CAY-316
>
> Regards,
> - Tore.
>
> On 21. mai 2010, at 23.27, MGargano@escholar.com wrote:
>
>> Hi,
>>
>>       I'm using cayenne to store large files in BLOBs as a process
> runs.
>> The first step of the process is storing large files (~ 600MB) and  
>> they
>> are ending up in the DB just fine, then we run some tasks and get  
>> some
>> output files, and then store the large output files (~ 500MB) to  
>> the DB.
>
>> The output files are not making it into the DB.  In fact it appears  
>> that
>
>> the whole program is just sitting and waiting, for what, i have no  
>> idea
>> and after you try and spawn another thread in the program it throws  
>> an
> out
>> of memory exception.  I was trying to figure out why the larger input
>> files got persisted fine, but the large output files cause a  
>> problem and
>
>> the only thing I could think of was that when the BLOBs are created  
>> they
>
>> are cached in the DataContext and are never cleared eventually just
>> causing the memory to be exhausted.  Is this possible?  Anything else
>> anyone can think of?
>>
>> note: I'm also compressing the stream in memory as I'm adding it to  
>> the
>> byte[], but still... it works for the input files.  also, each of  
>> these
>> phases of the process is followed by a commit, so all the input files
> are
>> committed together and all the output files should be committed  
>> together
>
>> as well, but this never happens.
>>
>> Thank you for any help you may be able to provide.
>> -Mike
>
>
>


Re: Blobs in the DataContext

Posted by MG...@escholar.com.
Hi Tore,

        I finally got around to looking at your code in the jira 
attachments.  It looks like it will need a little bit of updating to get 
it 3.0 ready, but at the end of the day it still pulls the whole thing 
into memory at some point.  I might ping you over the next week, if you 
don't mind, about this as I quickly (i.e. panic-mode) try to cobble 
something together.  I'm just learning cayenne so a lot of the internals 
are still not very clear to me.

Thanks.
-Mike




From:
Tore Halset <ha...@pvv.ntnu.no>
To:
user@cayenne.apache.org
Date:
05/25/2010 07:58 AM
Subject:
Re: Blobs in the DataContext



Hello.

I tried to implement support for streaming blobs back in 2006. Sorry, but 
it never got completed. I still think it is a nice feature. If you want to 
work on this issue, you might want to take a look at 
streaming_blob_try2.zip from https://issues.apache.org/jira/browse/CAY-316

Regards,
 - Tore.

On 21. mai 2010, at 23.27, MGargano@escholar.com wrote:

> Hi,
> 
>        I'm using cayenne to store large files in BLOBs as a process 
runs. 
> The first step of the process is storing large files (~ 600MB) and they 
> are ending up in the DB just fine, then we run some tasks and get some 
> output files, and then store the large output files (~ 500MB) to the DB. 

> The output files are not making it into the DB.  In fact it appears that 

> the whole program is just sitting and waiting, for what, i have no idea 
> and after you try and spawn another thread in the program it throws an 
out 
> of memory exception.  I was trying to figure out why the larger input 
> files got persisted fine, but the large output files cause a problem and 

> the only thing I could think of was that when the BLOBs are created they 

> are cached in the DataContext and are never cleared eventually just 
> causing the memory to be exhausted.  Is this possible?  Anything else 
> anyone can think of?
> 
> note: I'm also compressing the stream in memory as I'm adding it to the 
> byte[], but still... it works for the input files.  also, each of these 
> phases of the process is followed by a commit, so all the input files 
are 
> committed together and all the output files should be committed together 

> as well, but this never happens.
> 
> Thank you for any help you may be able to provide.
> -Mike




Re: Blobs in the DataContext

Posted by Tore Halset <ha...@pvv.ntnu.no>.
Hello.

I tried to implement support for streaming blobs back in 2006. Sorry, but it never got completed. I still think it is a nice feature. If you want to work on this issue, you might want to take a look at streaming_blob_try2.zip from https://issues.apache.org/jira/browse/CAY-316

Regards,
 - Tore.

On 21. mai 2010, at 23.27, MGargano@escholar.com wrote:

> Hi,
> 
>        I'm using cayenne to store large files in BLOBs as a process runs. 
> The first step of the process is storing large files (~ 600MB) and they 
> are ending up in the DB just fine, then we run some tasks and get some 
> output files, and then store the large output files (~ 500MB) to the DB. 
> The output files are not making it into the DB.  In fact it appears that 
> the whole program is just sitting and waiting, for what, i have no idea 
> and after you try and spawn another thread in the program it throws an out 
> of memory exception.  I was trying to figure out why the larger input 
> files got persisted fine, but the large output files cause a problem and 
> the only thing I could think of was that when the BLOBs are created they 
> are cached in the DataContext and are never cleared eventually just 
> causing the memory to be exhausted.  Is this possible?  Anything else 
> anyone can think of?
> 
> note: I'm also compressing the stream in memory as I'm adding it to the 
> byte[], but still... it works for the input files.  also, each of these 
> phases of the process is followed by a commit, so all the input files are 
> committed together and all the output files should be committed together 
> as well, but this never happens.
> 
> Thank you for any help you may be able to provide.
> -Mike