You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-dev@db.apache.org by mike matrigali <mi...@gmail.com> on 2013/06/06 00:40:38 UTC

looking for architectural advice for dealing with streams/LOBS as inputs to UPDATE operations

Recent investigation for DERBY-6214 uncovered a problem with 
streams/LOBS as inputs to UPDATE statements.  The problem
occurs when the update affects more than one row, causing the
system  to read the input once for each row, and finds
that the stream is closed after being drained once.

First wondering if anyone knows if the system is meant to do
something else in this case (for instance automatically instantiate
the object in memory), and the bug is just that we are not doing
so.

This part of the system has always been brittle as we don't want
to instantiate anything if we can help it.   For 2 gig blobs it
is just not going to work, but for smaller LOBs it may be better to use
a bunch of memory rather than failing the query.

It also would be nice if possbile to just make using a stream more than
once work, and somehow just
"reset" it in this case.  Not sure if jdbc Stream input specifies 
anything that we could count on in this case.

Another option is to recognize the case where we may have to read the
stream/lob twice up front and write it to disk first and then stream it
back multiple times.

Re: looking for architectural advice for dealing with streams/LOBS as inputs to UPDATE operations

Posted by Dag Wanvik <da...@oracle.com>.
It seems to be that if the compiler determines the update is not on a
unique row, it should save the stream while reading it, either to memory
or on disk if it gets too large. The latter may not be decidable up
front, so one can envisage cutting off moving it to memory at some
cut-off limit, and proceeding to save it to memory instead. Then update
2..n would read the data from memory or disk as the case may be.
This approach would be the most flexible one. At worst, it would save a
LOB twice instead of once if it turns out that only one row is touched
by the update. I don't think we can rely on a lob stream to be
resettable in the general case. I we could determine if it is in a
subset of the cases, that would be good, but to keep it simple, I'd be
ok with just doing the general case described above.


On 06.06.2013 08:40, mike matrigali wrote:
> Recent investigation for DERBY-6214 uncovered a problem with
> streams/LOBS as inputs to UPDATE statements.  The problem
> occurs when the update affects more than one row, causing the
> system  to read the input once for each row, and finds
> that the stream is closed after being drained once.
>
> First wondering if anyone knows if the system is meant to do
> something else in this case (for instance automatically instantiate
> the object in memory), and the bug is just that we are not doing
> so.
>
> This part of the system has always been brittle as we don't want
> to instantiate anything if we can help it.   For 2 gig blobs it
> is just not going to work, but for smaller LOBs it may be better to use
> a bunch of memory rather than failing the query.
>
> It also would be nice if possbile to just make using a stream more than
> once work, and somehow just
> "reset" it in this case.  Not sure if jdbc Stream input specifies
> anything that we could count on in this case.
>
> Another option is to recognize the case where we may have to read the
> stream/lob twice up front and write it to disk first and then stream it
> back multiple times.


Re: looking for architectural advice for dealing with streams/LOBS as inputs to UPDATE operations

Posted by Kristian Waagan <kr...@apache.org>.
torsdag 6. juni 2013 skrev Knut Anders Hatlen følgende:

> mike matrigali <mikemapp1@gmail.com <javascript:;>> writes:
>
> > Recent investigation for DERBY-6214 uncovered a problem with
> > streams/LOBS as inputs to UPDATE statements.  The problem
> > occurs when the update affects more than one row, causing the
> > system  to read the input once for each row, and finds
> > that the stream is closed after being drained once.
> >
> > First wondering if anyone knows if the system is meant to do
> > something else in this case (for instance automatically instantiate
> > the object in memory), and the bug is just that we are not doing
> > so.
> >
> > This part of the system has always been brittle as we don't want
> > to instantiate anything if we can help it.   For 2 gig blobs it
> > is just not going to work, but for smaller LOBs it may be better to use
> > a bunch of memory rather than failing the query.
> >
> > It also would be nice if possbile to just make using a stream more than
> > once work, and somehow just
> > "reset" it in this case.  Not sure if jdbc Stream input specifies
> > anything that we could count on in this case.
>
> java.io.InputStream has the methods markSupported(), mark(int) and
> reset() that could be used if the passed-in stream supports the
> mechanism. There is of course no guarantee that the user-supplied stream
> hasn't implemented mark/reset in a way that requires lots of memory for
> large values.
>
> > Another option is to recognize the case where we may have to read the
> > stream/lob twice up front and write it to disk first and then stream it
> > back multiple times.
>
> There is some functionality for this in LOBStreamControl. It stores a
> LOB in memory if it's smaller than 32KB and spills to a temporary file
> if it grows beyond that limit. Might be worth taking a look at.
>
> I don't know how to tell up front whether a statement is likely to be
> executed multiple times with the same value in the LOB parameter. I
> would suppose that it's more common not to reuse LOB values in multiple
> executions, and creating an extra copy on disk would probably add
> significantly to the cost of inserting/updating a large LOB.
>
>
At one point I thought about reading the value back from store the second
time around, but never really pursued the idea due to the assumed
complexity and lack of suitable API calls.

Besides from obtaining the data itself (as outlined above), there's also
the issue of determining when it's correct to do so.


-- 
Kristian


>
> --
> Knut Anders
>

Re: looking for architectural advice for dealing with streams/LOBS as inputs to UPDATE operations

Posted by Knut Anders Hatlen <kn...@oracle.com>.
mike matrigali <mi...@gmail.com> writes:

> Recent investigation for DERBY-6214 uncovered a problem with
> streams/LOBS as inputs to UPDATE statements.  The problem
> occurs when the update affects more than one row, causing the
> system  to read the input once for each row, and finds
> that the stream is closed after being drained once.
>
> First wondering if anyone knows if the system is meant to do
> something else in this case (for instance automatically instantiate
> the object in memory), and the bug is just that we are not doing
> so.
>
> This part of the system has always been brittle as we don't want
> to instantiate anything if we can help it.   For 2 gig blobs it
> is just not going to work, but for smaller LOBs it may be better to use
> a bunch of memory rather than failing the query.
>
> It also would be nice if possbile to just make using a stream more than
> once work, and somehow just
> "reset" it in this case.  Not sure if jdbc Stream input specifies
> anything that we could count on in this case.

java.io.InputStream has the methods markSupported(), mark(int) and
reset() that could be used if the passed-in stream supports the
mechanism. There is of course no guarantee that the user-supplied stream
hasn't implemented mark/reset in a way that requires lots of memory for
large values.

> Another option is to recognize the case where we may have to read the
> stream/lob twice up front and write it to disk first and then stream it
> back multiple times.

There is some functionality for this in LOBStreamControl. It stores a
LOB in memory if it's smaller than 32KB and spills to a temporary file
if it grows beyond that limit. Might be worth taking a look at.

I don't know how to tell up front whether a statement is likely to be
executed multiple times with the same value in the LOB parameter. I
would suppose that it's more common not to reuse LOB values in multiple
executions, and creating an extra copy on disk would probably add
significantly to the cost of inserting/updating a large LOB.



-- 
Knut Anders