You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Kristian Waagan <Kr...@Sun.COM> on 2007/05/22 20:37:24 UTC

Modified UTF-8 or UTF-16 for temporary Clobs?

Hello,

In my work on DERBY-2646, I have stumbled upon some issues that can 
greatly affect the performance of accessing Clobs, especially updating them.

Currently Clobs are stored on disk in the modified UTF-8 encoding. This 
uses one to three bytes to represent a single character. Since the 
number of bytes per character varies, there is no easy way to calculate 
the byte position from the character position, or vice versa. The naive, 
and maybe even the only feasible way, is to start decoding the bytes 
from the start of the Clob.

Note that the storage I speak of is the temporary storage of Clob 
copies. This is initiated when the user attempts to modify the Clob. I 
am not considering the case where the Clob is stored in the database itself.

Obviously, reading the Clob from the start every time you need to 
reposition is not very efficient. One optimization is to keep track of 
the "current position", but it might not help that much (depending on 
access pattern). This requires full knowledge about update actions, 
including on the various streams/writers.
Another option is storing the Clob in UTF-16. This would allow direct 
mapping between byte and character positions, as far as I have 
understood (I had brief contact with Dag and Bernt offline), even in the 
case of surrogate pairs.

However, using UTF-16 imposes space overhead when operating on Clobs 
with US-ASCII characters, in fact the overhead is 100% (each character 
represented by two bytes instead of one). For some other languages 
(and/or character sets), using UTF-16 reduces the space requirements 
(two versus three bytes).



To summarize my view on this...


Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
Pros, UTF-16: direct mapping between char/byte pos (easier logic)

Cons, UTF-8 : requires "counting"/decoding to find byte position
Cons, UTF-16: space overhead for US-ASCII, must be converted when/if 
Clob goes back into the database

I'm sure there are other aspects, and I would like some opinions and 
feedback on what to do. My two current alternatives on the table are 
using the naive counting technique, or changing to UTF-16. The former 
requires the least code changes.


To bound the scope of potential changes, I do plan to get this done for 
10.3...



thanks,
-- 
Kristian

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Kristian Waagan <Kr...@Sun.COM>.

Kristian Waagan wrote:
> Knut Anders Hatlen wrote:
>> Kristian Waagan <Kr...@Sun.COM> writes:
>>
>>> Hello,
>>>
>>> In my work on DERBY-2646, I have stumbled upon some issues that can
>>> greatly affect the performance of accessing Clobs, especially updating
>>> them.
>>
>> [....]
>>
>>> To summarize my view on this...
>>>
>>>
>>> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
>>> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>>>
>>> Cons, UTF-8 : requires "counting"/decoding to find byte position
>>> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
>>> Clob goes back into the database
>>>
>>> I'm sure there are other aspects, and I would like some opinions and
>>> feedback on what to do. My two current alternatives on the table are
>>> using the naive counting technique, or changing to UTF-16. The former
>>> requires the least code changes.
>>
>> Please correct me if I got it wrong, but based on what you wrote above,
>> it seems like we now have the following situation:
>>
>> To allow updates of a Clob at random positions (that is, with
>> Clob.setString()), we create a copy of the Clob in a temporary
>> file. However, we need to read the temporary file sequentially from the
>> beginning for each operation in order to find the correct byte
>> position. So we only have sequential access to the file that is supposed
>> to give us random access to the Clob.
>>
> 
> Your description of the current situation is generally correct, but not 
> quite accurate. We don't start at the beginning of the temporary file 
> when trying to find the byte position for a character position.
> The method doing the lookup is supplied with a "position hint", which is 
> a byte position. However, this will not always work and it is a bug in 
> the current implementation.
> 
> Even though we are able to locate *the next* character after the hint 
> position, and thus its byte position, we do now know which character 
> position it has.

Doh, too late...

It should be "we do *not* know".


sorry for the type/mindslip and the noise
-- 
Kristian

> 
> This bug was causing the UTFDataFormatException that has been observed 
> when using characters that are encoded with more than one byte in UTF-8.
> 
> 
> To me it seems like the method is not used the way it was intended to be 
> used. Looking more at it, it seems the charPos is relative to the 
> byte/hint position. I will be rewriting it anyway, and it is no secret 
> that working with UTF-8 is more complex than UTF-16 when it comes to 
> mapping character positions to byte positions.
> 
> 
>

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Kristian Waagan <Kr...@Sun.COM>.

Knut Anders Hatlen wrote:
> Kristian Waagan <Kr...@Sun.COM> writes:
> 
>> Hello,
>>
>> In my work on DERBY-2646, I have stumbled upon some issues that can
>> greatly affect the performance of accessing Clobs, especially updating
>> them.
> 
> [....]
> 
>> To summarize my view on this...
>>
>>
>> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
>> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>>
>> Cons, UTF-8 : requires "counting"/decoding to find byte position
>> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
>> Clob goes back into the database
>>
>> I'm sure there are other aspects, and I would like some opinions and
>> feedback on what to do. My two current alternatives on the table are
>> using the naive counting technique, or changing to UTF-16. The former
>> requires the least code changes.
> 
> Please correct me if I got it wrong, but based on what you wrote above,
> it seems like we now have the following situation:
> 
> To allow updates of a Clob at random positions (that is, with
> Clob.setString()), we create a copy of the Clob in a temporary
> file. However, we need to read the temporary file sequentially from the
> beginning for each operation in order to find the correct byte
> position. So we only have sequential access to the file that is supposed
> to give us random access to the Clob.
> 

Your description of the current situation is generally correct, but not 
quite accurate. We don't start at the beginning of the temporary file 
when trying to find the byte position for a character position.
The method doing the lookup is supplied with a "position hint", which is 
a byte position. However, this will not always work and it is a bug in 
the current implementation.

Even though we are able to locate *the next* character after the hint 
position, and thus its byte position, we do now know which character 
position it has.

This bug was causing the UTFDataFormatException that has been observed 
when using characters that are encoded with more than one byte in UTF-8.

To me it seems like the method is not used the way it was intended to be 
used. Looking more at it, it seems the charPos is relative to the 
byte/hint position. I will be rewriting it anyway, and it is no secret 
that working with UTF-8 is more complex than UTF-16 when it comes to 
mapping character positions to byte positions.

-- 
Kristian

[ snip - argumentation for using UTF-16 ]

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Knut Anders Hatlen <Kn...@Sun.COM>.

Kristian Waagan <Kr...@Sun.COM> writes:

> Hello,
>
> In my work on DERBY-2646, I have stumbled upon some issues that can
> greatly affect the performance of accessing Clobs, especially updating
> them.

[....]

> To summarize my view on this...
>
>
> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>
> Cons, UTF-8 : requires "counting"/decoding to find byte position
> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if
> Clob goes back into the database
>
> I'm sure there are other aspects, and I would like some opinions and
> feedback on what to do. My two current alternatives on the table are
> using the naive counting technique, or changing to UTF-16. The former
> requires the least code changes.

Please correct me if I got it wrong, but based on what you wrote above,
it seems like we now have the following situation:

To allow updates of a Clob at random positions (that is, with
Clob.setString()), we create a copy of the Clob in a temporary
file. However, we need to read the temporary file sequentially from the
beginning for each operation in order to find the correct byte
position. So we only have sequential access to the file that is supposed
to give us random access to the Clob.

If the purpose of the temporary file is to give random access to the
Clob, then I definitely think UTF-16 is a better choice than UTF-8. I'm
not sure how important the space overhead for 7-bit ASCII is, as long as
there is zero or negative overhead for all non-ASCII characters.

Space and performance considerations aside, the simpler relation between
byte positions and character positions in UTF-16 would probably make it
easier to write bug-free code. Since all chars are treated equally, we
wouldn't have to come up with a great number of tests testing all
possible combinations of single-byte chars and multi-byte chars, and it
would therefore be easier to gain confidence in the correctness of the
code.

-- 
Knut Anders

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Oystein Grovlen - Sun Norway <Oy...@Sun.COM>.

Kristian Waagan wrote:
> However, using UTF-16 imposes space overhead when operating on Clobs 
> with US-ASCII characters, in fact the overhead is 100% (each character 
> represented by two bytes instead of one). For some other languages 
> (and/or character sets), using UTF-16 reduces the space requirements 
> (two versus three bytes).

As long as this is just used for active LOBs (i.e, LOBs updated by 
active transactions), I do not think the extra disk usage should be a 
major concern.  It would normally be just a small fraction of the total 
disk space used by the database.

-- 
Øystein

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Øystein Grøvlen <Oy...@Sun.COM>.

Mike Matrigali wrote:
> 
> 
> Oystein Grovlen - Sun Norway wrote:
>> Mike Matrigali wrote:
>>  > What is the expected usage pattern for an update on a clob that uses
>>
>>> these "temporary" clobs?  What is the usual input format, what is the
>>> usual output format?  Do you expect more than one update usually?  Does
>>> an update have to rewrite the end of the file on and shrink or expand of
>>> a middle of the clob?
>>
>>
>> In client/server, a single update may cause a sequence of updates 
>> against the embedded driver.  This is because the locator 
>> implementation  uses a stored procedure to perform the update, and 
>> Derby does not support Clob as procedure parameters.  To overcome 
>> that, the Blob is split into segments of max size for VARCHAR.  Hence, 
>> it is important at the overhead of positioning is not big in these 
>> cases.  One way to optimize for this would be to remember the end 
>> position from last write.
>>
> How do these calls relate to calls to the store code for the update?  Is
> there still one call to store or one per segment?

The calls work on a temporary copy of the Blob. Store is affected when:

1) ResultSet.updateRow is a called and the Blob is a column in the 
result set.

2) PreparedStatement.execute is called and the Blob is a parameter.

At that point the Blob is streamed into Store the same way as before.

--
Øystein

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Mike Matrigali <mi...@sbcglobal.net>.


Oystein Grovlen - Sun Norway wrote:
> Mike Matrigali wrote:
>  > What is the expected usage pattern for an update on a clob that uses
> 
>> these "temporary" clobs?  What is the usual input format, what is the
>> usual output format?  Do you expect more than one update usually?  Does
>> an update have to rewrite the end of the file on and shrink or expand of
>> a middle of the clob?
> 
> 
> In client/server, a single update may cause a sequence of updates 
> against the embedded driver.  This is because the locator implementation 
>  uses a stored procedure to perform the update, and Derby does not 
> support Clob as procedure parameters.  To overcome that, the Blob is 
> split into segments of max size for VARCHAR.  Hence, it is important at 
> the overhead of positioning is not big in these cases.  One way to 
> optimize for this would be to remember the end position from last write.
> 
How do these calls relate to calls to the store code for the update?  Is
there still one call to store or one per segment?

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Oystein Grovlen - Sun Norway <Oy...@Sun.COM>.

Mike Matrigali wrote:
  > What is the expected usage pattern for an update on a clob that uses
> these "temporary" clobs?  What is the usual input format, what is the
> usual output format?  Do you expect more than one update usually?  Does
> an update have to rewrite the end of the file on and shrink or expand of
> a middle of the clob?

In client/server, a single update may cause a sequence of updates 
against the embedded driver.  This is because the locator implementation 
  uses a stored procedure to perform the update, and Derby does not 
support Clob as procedure parameters.  To overcome that, the Blob is 
split into segments of max size for VARCHAR.  Hence, it is important at 
the overhead of positioning is not big in these cases.  One way to 
optimize for this would be to remember the end position from last write.

-- 
Øystein

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Kristian Waagan <Kr...@Sun.COM>.

Mike Matrigali wrote:
> 
> 
> Kristian Waagan wrote:
>> Hello,
>>
>> In my work on DERBY-2646, I have stumbled upon some issues that can 
>> greatly affect the performance of accessing Clobs, especially updating 
>> them.
>>
>> Currently Clobs are stored on disk in the modified UTF-8 encoding. 
>> This uses one to three bytes to represent a single character. Since 
>> the number of bytes per character varies, there is no easy way to 
>> calculate the byte position from the character position, or vice 
>> versa. The naive, and maybe even the only feasible way, is to start 
>> decoding the bytes from the start of the Clob.
>>
>> Note that the storage I speak of is the temporary storage of Clob 
>> copies. This is initiated when the user attempts to modify the Clob. I 
>> am not considering the case where the Clob is stored in the database 
>> itself.
>>
>> Obviously, reading the Clob from the start every time you need to 
>> reposition is not very efficient. One optimization is to keep track of 
>> the "current position", but it might not help that much (depending on 
>> access pattern). This requires full knowledge about update actions, 
>> including on the various streams/writers.
>> Another option is storing the Clob in UTF-16. This would allow direct 
>> mapping between byte and character positions, as far as I have 
>> understood (I had brief contact with Dag and Bernt offline), even in 
>> the case of surrogate pairs.
>>
>> However, using UTF-16 imposes space overhead when operating on Clobs 
>> with US-ASCII characters, in fact the overhead is 100% (each character 
>> represented by two bytes instead of one). For some other languages 
>> (and/or character sets), using UTF-16 reduces the space requirements 
>> (two versus three bytes).
>>
>>
>>
>> To summarize my view on this...
>>
>>
>> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
>> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
>>
>> Cons, UTF-8 : requires "counting"/decoding to find byte position
>> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if 
>> Clob goes back into the database
> Can you describe more in what situations you are proposing to use UTF-16
> vs. UTF-8.  I know that there is a lot of performance overhead in
> converting from one to the other, and I know in the past Derby had often
> even converted back and forth through bad code.  Are the changes you
> are proposing going to affect the non-update case.

Hi Mike,

Briefly stated, the encoding issue comes into play when the first update 
to the clob is issued. After this everything goes via the temporary 
copy, which is in memory or on disk depending on size.
Non-update cases are not affected.

If the user never issues a modification operation (setString, 
setCharacterStream, setAsciiStream, truncate), only streams from store 
will be used to fetch the required data.

> 
> It would be nice if the following happens:
> 1) INSERT
>    From whatever input (stream, clob, string ...) we convert it once
>    to the modified UTF-8 format that store uses on disk.  In the case
>    of stream we should read it only once and never flow it to object
>    or disk before getting it into store.

For insertions of new clobs (or other appropriate data types) through 
PreparedStatement, this does happen - although I haven't checked if 
using setAsciiStream causes a byte-char-byte conversion or not.

Note that there are two different types of setCharacterStream methods;
  A) PreparedStatement.setCharacterStream(column, Reader)
  B) Writer writer = Clob.setCharacterStream(pos)

Using B, a temporary clob will be created and the contents will spill to 
disk when the size threshold is reached.

> 
> 2) SELECT
>    Should be converted once from modified utf-8 into whatever format
>    is requested by the select with no intermediate object or disk
>    copies.

I believe this is also the current state. Again, it is only after an 
update a temporary clob is used.

> 
> 
> What is the expected usage pattern for an update on a clob that uses
> these "temporary" clobs?

Use of temporary clobs are triggered by updates to the clob. If clobs 
are updated, how many updates per transaction there will be is totally 
application dependent.

 > What is the usual input format, what is the
> usual output format?  

The input formats are those of String, Reader and InputStream.
The output formats are those of String, Writer and OutputStream.

> Do you expect more than one update usually?  

I don't know...

 > Does an update have to rewrite the end of the file on and shrink or 
expand of
> a middle of the clob?

The expansion and shrinking certainly is necessary when using UTF-8.
When you do a setString(pos, str), you basically overwrite a range of 
existing characters with the characters in the insertion string. If the 
characters replaced are not represented with the same number of bytes as 
the inserted characters, the byte array on disk (or in memory) must be 
expanded or shrunk accordingly.

If we had used an encoding with a fixed number of bytes per char in 
store, we could have gotten away with skipping to the right position and 
then just stream the new value into store.
As I see it, this is not possible when using UTF-8 in store.

There is no functionality for inserting a new string without overwriting 
existing content, but appending to the value is possible. You can also 
truncate the clob.

Last, calling Connection.createClob() will always result in a temporary 
clob being created.

-- 
Kristian

>>
>> I'm sure there are other aspects, and I would like some opinions and 
>> feedback on what to do. My two current alternatives on the table are 
>> using the naive counting technique, or changing to UTF-16. The former 
>> requires the least code changes.
>>
>>
>> To bound the scope of potential changes, I do plan to get this done 
>> for 10.3...
>>
>>
>>
>> thanks,
>

Re: Modified UTF-8 or UTF-16 for temporary Clobs?

Posted by Mike Matrigali <mi...@sbcglobal.net>.


Kristian Waagan wrote:
> Hello,
> 
> In my work on DERBY-2646, I have stumbled upon some issues that can 
> greatly affect the performance of accessing Clobs, especially updating 
> them.
> 
> Currently Clobs are stored on disk in the modified UTF-8 encoding. This 
> uses one to three bytes to represent a single character. Since the 
> number of bytes per character varies, there is no easy way to calculate 
> the byte position from the character position, or vice versa. The naive, 
> and maybe even the only feasible way, is to start decoding the bytes 
> from the start of the Clob.
> 
> Note that the storage I speak of is the temporary storage of Clob 
> copies. This is initiated when the user attempts to modify the Clob. I 
> am not considering the case where the Clob is stored in the database 
> itself.
> 
> Obviously, reading the Clob from the start every time you need to 
> reposition is not very efficient. One optimization is to keep track of 
> the "current position", but it might not help that much (depending on 
> access pattern). This requires full knowledge about update actions, 
> including on the various streams/writers.
> Another option is storing the Clob in UTF-16. This would allow direct 
> mapping between byte and character positions, as far as I have 
> understood (I had brief contact with Dag and Bernt offline), even in the 
> case of surrogate pairs.
> 
> However, using UTF-16 imposes space overhead when operating on Clobs 
> with US-ASCII characters, in fact the overhead is 100% (each character 
> represented by two bytes instead of one). For some other languages 
> (and/or character sets), using UTF-16 reduces the space requirements 
> (two versus three bytes).
> 
> 
> 
> To summarize my view on this...
> 
> 
> Pros, UTF-8 : more space efficient for US-ASCII, same as used by store
> Pros, UTF-16: direct mapping between char/byte pos (easier logic)
> 
> Cons, UTF-8 : requires "counting"/decoding to find byte position
> Cons, UTF-16: space overhead for US-ASCII, must be converted when/if 
> Clob goes back into the database
Can you describe more in what situations you are proposing to use UTF-16
vs. UTF-8.  I know that there is a lot of performance overhead in
converting from one to the other, and I know in the past Derby had often
even converted back and forth through bad code.  Are the changes you
are proposing going to affect the non-update case.

It would be nice if the following happens:
1) INSERT
    From whatever input (stream, clob, string ...) we convert it once
    to the modified UTF-8 format that store uses on disk.  In the case
    of stream we should read it only once and never flow it to object
    or disk before getting it into store.

2) SELECT
    Should be converted once from modified utf-8 into whatever format
    is requested by the select with no intermediate object or disk
    copies.


What is the expected usage pattern for an update on a clob that uses
these "temporary" clobs?  What is the usual input format, what is the
usual output format?  Do you expect more than one update usually?  Does
an update have to rewrite the end of the file on and shrink or expand of
a middle of the clob?
> 
> I'm sure there are other aspects, and I would like some opinions and 
> feedback on what to do. My two current alternatives on the table are 
> using the naive counting technique, or changing to UTF-16. The former 
> requires the least code changes.
> 
> 
> To bound the scope of potential changes, I do plan to get this done for 
> 10.3...
> 
> 
> 
> thanks,