You are viewing a plain text version of this content. The canonical link for it is here.

Posted to derby-dev@db.apache.org by Suresh Thalamati <su...@gmail.com> on 2006/10/26 20:55:21 UTC

(DERBY-378) implementing import/export of large objects...

Since Jira e-mail is down, posting the following directly to 
derby-dev, not really that important, but did not want it to be lost 
when Jira system decides to sent all the e-mail once.


Currently derby does not allow users to perform import/export using 
the system procedures if the table contains a clobs/blob data types. I 
think enhancing derby to allow import/export of clob/blob data  will 
be useful to the users.

Some thoughts on how to implement import/export of large objects in 
Derby :

Currently Derby supports import/export using delimited file format. I 
think same format can be used to perform import/export of large object 
data also. Most of the issues are common to both import and export, 
for the obvious reason that one should be able to import the data into 
a derby database, if it was exported from Derby.

Large Objects can be written to the same export file given by the user 
or can be written to another(external) file and store the reference in 
the main export file.

Following sections discuss issues specific to each data type:

CLOB:

o Clobs data is exported similar to the other character data types,
   except that it can be written to an external file also.

o Double delimiters are used while writing the clob data to the export
file, if the data contains the delimiters. i.e for example
if data is  'He said "it is a nice day"', it will be written to the
file as ' He said ""it is a nice day"" '.

o If the clob data is written to an external file and then the 
reference is stored in the main export file,  double delimiters are 
not required.


BLOB:

o Blob data is written to export file as it is stored in the database, 
it is not converted to hex or any another format. character code-set 
conversion is not done for the binary data.

o If the data contains the delimiters , then it will be a problem if 
stored in the same file with other data types. It may not be such a 
good idea to attempt to interpret binary data to find the delimiters 
inside the data and inject double delimiters into the data. That 
leaves us with following two options :


   1) Allow import/export of blob data only when they it is  stored in 
an external file.

   2) Write the blob data to the export file along with other data 
types during export, assuming the blob data does not contain any 
delimiters and throw an error if on import if it finds delimiters 
inside the data, by interpreting it using the same code-set as the 
other character data.

o When the blob data is written to an external file then no need to 
scan data for delimiters during export/import.

Handling large objects in an external file:

Advantage of import/export of clob/blob data using an extern file that
different from the the main export file is that the data need not be
interpreted or modified if there are delimiters inside the large 
object data. Import/export of large object will perform better because 
is not scanned. Main import/export file will contain the reference to 
where the lobs are located.

Large object reference File Format:

Large object location reference will contain the file name, starting 
offset of the large object inside the file and the length of the file. 
Format will be something like : fileName.ext.StartOffSet.Length, as 
string, For example pictures.dat.100.999

Reference will be stored in the main export file. Data in the main 
export file will look like the following:

2,"pictures.dat.100.999","john"
3, "pictures.dat.999.9999", "Robert"


What will be the large objects file name during export ? :

There are two possible options:

1) Let the Derby generate the file name to store lobs, by appending 
some string like "_lobdata" to the export file name given by the user. 
For example if user had given "employees.del" , large object file name 
will be employees_lobdata.del.

Advantage of this approach might be no need to define new procedures. 
In this case, by default all the large object data will always written 
to an external file.

Disadvantage is if the user wants the all the data for a table in one 
file for some reason, for example if there are only few clobs in the 
table then there is no option. And also if by chance the file already 
exists, then export will throw an error and the user has to move the 
existing file ..etc..

2) Let the user specify the file name where the blob/clob will be 
written. To support this, Derby needs to add four new import/export 
procedures, because the user need to pass the external file name 
during export and to indicate that lobs are in an external file during 
import the.

All the import/export procedures will be appended with 
"LOBS_IN_EXTFILE" and export procedures will have an extra argument 
that will take the file name,where the blobs are written.

New Export Procedures:
SYSCS_UTIL.SYSCS_EXPORT_TABLE_LOBS_IN_EXTFILE(..,IN LOBSFILENAME 
VARCHAR(32672))
SYSCS_UTIL.SYSCS_EXPORT_QUERY_LOBS_IN_EXTFILE(..,IN LOBSFILENAME 
VARCHAR(32672))

New IMPORT Procedures:

SYSCS_UTIL.SYSCS_IMPORT_DATA_LOBS_IN_EXTFILE(..)
SYSCS_UTIL.SYSCS_IMPORT_TABLE_LOBS_IN_EXTFILE(.)


In this approach old procedures will fail to import/export or
write the large object data along with other columns data. I am 
inclined towards the  second approach, except for the fact that there 
another 4 new system procedures.

If maintaining backward compatibility for import/export procedures is
not necessary, then just new arguments are needed for the existing 
procedures. But if some users are using it in some application, then 
it will break. I am genereally hesitant to change the existing 
procedure signatures, may be it does not matter for import/export.


To summarize:

  1) Large objects are stored along with other data or in an external
     file.

  2) Binary data (Blob data type) is not modified even if the data 
contains delimiters inside the data during export.

  3) Import will fail gracefully or get confused and throw weird 
errors, if it finds delimiters inside blobs data if stored along with 
  other data in the import file.

  4) Four new procedure are required to allow users to specify where to
     read/write the large object data.


Any suggestions or comments ?


Thanks
-suresh

Re: (DERBY-378) implementing import/export of large objects...

Posted by Suresh Thalamati <su...@gmail.com>.

Mike Matrigali wrote:
> 
> 
> Suresh Thalamati wrote:
> 
>>
>> Hi Mike ,
>>
>> After thinking bit more about the cases you mentioned, better solution 
>> might be to do write the binary data in HEX format when exporting the 
>> data to only one file, instead of modifying the binary data with 
>> double delimiters. This way main export file will be text file
>> as it is now.
>>
>> Only restriction with this approach is users can not use the hex 
>> number (A-F , a-f, 0-9) as the delimiter; I think this restriction is 
>> reasonable one.
>>
>> If users required raw format , they can export binary data types to a 
>> separate file using the new procedures.
> 
> I think the separate file approach is fine and resolves the most 
> important blob import/export need.  It seems simpler, and likely to
> be faster.
> 
> Doing hex for inline seems a reasonable alternative to messing with 
> escaped delimiters in raw data.  I think the rest of the inline data
> is always in some sort of "character" format.
> 
> If you only do one implementation I would think the separate file 
> approach is better.
> 

Thanks for your feedback Mike. I will work on implementing the 
separate file approach for now.

/suresh

Re: (DERBY-378) implementing import/export of large objects...

Posted by Mike Matrigali <mi...@sbcglobal.net>.


Suresh Thalamati wrote:

> 
> Hi Mike ,
> 
> After thinking bit more about the cases you mentioned, better solution 
> might be to do write the binary data in HEX format when exporting the 
> data to only one file, instead of modifying the binary data with double 
> delimiters. This way main export file will be text file
> as it is now.
> 
> Only restriction with this approach is users can not use the hex number 
> (A-F , a-f, 0-9) as the delimiter; I think this restriction is 
> reasonable one.
> 
> If users required raw format , they can export binary data types to a 
> separate file using the new procedures.
I think the separate file approach is fine and resolves the most 
important blob import/export need.  It seems simpler, and likely to
be faster.

Doing hex for inline seems a reasonable alternative to messing with 
escaped delimiters in raw data.  I think the rest of the inline data
is always in some sort of "character" format.

If you only do one implementation I would think the separate file 
approach is better.

> 
> 
> 
> Thanks
> -suresh
> 
> 
> 
> 
>

Re: (DERBY-378) implementing import/export of large objects...

Posted by Suresh Thalamati <su...@gmail.com>.

Mike Matrigali wrote:
> 
> 
> Suresh Thalamati wrote:
> 
>> Mike Matrigali wrote:
>>
>>>
>>>
>>> Suresh Thalamati wrote:
>>>
>>>> Daniel John Debrunner wrote:
>>>>
>>>>> Suresh Thalamati wrote:
>>>>>
>>>>>> BLOB:
>>>>
>>>>
>>>>
>>>>
>>>> ....
>>>>
>>>>>>   1) Allow import/export of blob data only when they it is  stored 
>>>>>> in an external file.
>>>>>>
>>>>>>   2) Write the blob data to the export file along with other data 
>>>>>> types during export, assuming the blob data does not contain any 
>>>>>> delimiters and throw an error if on import if it finds delimiters 
>>>>>> inside the data, by interpreting it using the same code-set as the 
>>>>>> other character data.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I say option 1) and I assume it's for all binary data, not just 
>>>>> BLOB, e.g. VARCHAR FOR BIT DATA etc. Seems like with binary data 
>>>>> the chance of having a problematic character in the text file is 
>>>>> close to 100%.
>>>>>
>>>>>
>>>>> Dan.
>>>>>
>>>>>
>>>>
>>>> Thanks for reading the proposal, Dan.  I agree with you , chance of 
>>>> finding  delimiter character inside a binary data is very high. I 
>>>> will go with the option 1. Your assumption is correct , it applies 
>>>> to all the binary data.
>>>
>>>
>>>
>>>
>>> I also agree, it seems like a reasonable first step to default binary 
>>> export/import to external files.
>>>
>>> I am probably missing something here though.  I thought for char data 
>>> there is an algorithm that handles delimiter data within it.  Why does
>>> that not work for binary data also?  Is it a codeset issue?
>>>
>>
>> Yes. For character data, double delimiters are used if there are 
>> delimiter characters inside the data. i.e  say if  a column contains
>> 'he said "derby is solid database" ' , then it is written to the 
>> export file as "he said ""derby is a solid database "" " . So on 
>> export of the data, data is modified before writing to the file.
>>
>> It may be possible to do the same thing by interpreting the binary 
>> data in specified code-set and  add an extra delimiter for every 
>> delimiter found on export and do the reverse on import.  But unlike 
>> the character data , if the binary data is changed and if user import 
>> it to some other application, the data may mean/look completely 
>> different if the added extra delimiter characters are not removed.
> 
> Again I think the separate file/no delimiter solution is a good first
> approach, I just wanted to understand the issue.  As you point out there
> are multiple usage scenario's here:
> 1) someone has a derby db and wants to export for use into another derby 
> db.
> 2) someone has a derby db and wants to export for use in another 
> application.
> 3) someone has some data from another app and wants to import into derby.
> 
> I think the separate file solution works for #1.  I don't know how well
> it works for option #2 and #3.  But at least for #2 it results in the
> raw data without need to process it.

Hi Mike ,

After thinking bit more about the cases you mentioned, better solution 
might be to do write the binary data in HEX format when exporting the 
data to only one file, instead of modifying the binary data with 
double delimiters. This way main export file will be text file
as it is now.

Only restriction with this approach is users can not use the hex 
number (A-F , a-f, 0-9) as the delimiter; I think this restriction is 
reasonable one.

If users required raw format , they can export binary data types to a 
separate file using the new procedures.



Thanks
-suresh

Re: (DERBY-378) implementing import/export of large objects...

Posted by Suresh Thalamati <su...@gmail.com>.

Mike Matrigali wrote:
> 
> 
> Suresh Thalamati wrote:
> 
>> Mike Matrigali wrote:
>>
>>>
>>>
>>> Suresh Thalamati wrote:
>>>
>>>> Daniel John Debrunner wrote:
>>>>
>>>>> Suresh Thalamati wrote:
>>>>>
>>>>>> BLOB:
>>>>
>>>>
>>>>
>>>>
>>>> ....
>>>>
>>>>>>   1) Allow import/export of blob data only when they it is  stored 
>>>>>> in an external file.
>>>>>>
>>>>>>   2) Write the blob data to the export file along with other data 
>>>>>> types during export, assuming the blob data does not contain any 
>>>>>> delimiters and throw an error if on import if it finds delimiters 
>>>>>> inside the data, by interpreting it using the same code-set as the 
>>>>>> other character data.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I say option 1) and I assume it's for all binary data, not just 
>>>>> BLOB, e.g. VARCHAR FOR BIT DATA etc. Seems like with binary data 
>>>>> the chance of having a problematic character in the text file is 
>>>>> close to 100%.
>>>>>
>>>>>
>>>>> Dan.
>>>>>
>>>>>
>>>>
>>>> Thanks for reading the proposal, Dan.  I agree with you , chance of 
>>>> finding  delimiter character inside a binary data is very high. I 
>>>> will go with the option 1. Your assumption is correct , it applies 
>>>> to all the binary data.
>>>
>>>
>>>
>>>
>>> I also agree, it seems like a reasonable first step to default binary 
>>> export/import to external files.
>>>
>>> I am probably missing something here though.  I thought for char data 
>>> there is an algorithm that handles delimiter data within it.  Why does
>>> that not work for binary data also?  Is it a codeset issue?
>>>
>>
>> Yes. For character data, double delimiters are used if there are 
>> delimiter characters inside the data. i.e  say if  a column contains
>> 'he said "derby is solid database" ' , then it is written to the 
>> export file as "he said ""derby is a solid database "" " . So on 
>> export of the data, data is modified before writing to the file.
>>
>> It may be possible to do the same thing by interpreting the binary 
>> data in specified code-set and  add an extra delimiter for every 
>> delimiter found on export and do the reverse on import.  But unlike 
>> the character data , if the binary data is changed and if user import 
>> it to some other application, the data may mean/look completely 
>> different if the added extra delimiter characters are not removed.
> 
> Again I think the separate file/no delimiter solution is a good first
> approach, I just wanted to understand the issue.  As you point out there
> are multiple usage scenario's here:
> 1) someone has a derby db and wants to export for use into another derby 
> db.
> 2) someone has a derby db and wants to export for use in another 
> application.
> 3) someone has some data from another app and wants to import into derby.
> 
> I think the separate file solution works for #1.  I don't know how well
> it works for option #2 and #3.  But at least for #2 it results in the
> raw data without need to process it.
> 


seperate file/no delimiter solution will also work for 2 & 3 also, if 
the other application export/import data in the same fashion. But I 
guess in reality other application may not follow the same formats.

Do you think it will make it easier to do  2 & 3, if  derby supports 
import/export of the binary data also using the single file, similar 
to the character data ?

if you think it will be useful, Derby can support  the following :

1) Instead of throwing error if user attempts to perform import/export 
of binary data using the existing procedures, they can be modified to 
handle import/export of binary data in the same file as other data.
Binary data is exported using the double delimiters , similar to 
character data and on import extra delimiter characters will be removed.

2) import/export of binary data to a separate file.

If user does not want the the binary data to be modified , then they 
can use the option 2.


Thanks
-suresh

Re: (DERBY-378) implementing import/export of large objects...

Posted by Mike Matrigali <mi...@sbcglobal.net>.


Suresh Thalamati wrote:
> Mike Matrigali wrote:
> 
>>
>>
>> Suresh Thalamati wrote:
>>
>>> Daniel John Debrunner wrote:
>>>
>>>> Suresh Thalamati wrote:
>>>>
>>>>> BLOB:
>>>
>>>
>>>
>>> ....
>>>
>>>>>   1) Allow import/export of blob data only when they it is  stored 
>>>>> in an external file.
>>>>>
>>>>>   2) Write the blob data to the export file along with other data 
>>>>> types during export, assuming the blob data does not contain any 
>>>>> delimiters and throw an error if on import if it finds delimiters 
>>>>> inside the data, by interpreting it using the same code-set as the 
>>>>> other character data.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> I say option 1) and I assume it's for all binary data, not just 
>>>> BLOB, e.g. VARCHAR FOR BIT DATA etc. Seems like with binary data the 
>>>> chance of having a problematic character in the text file is close 
>>>> to 100%.
>>>>
>>>>
>>>> Dan.
>>>>
>>>>
>>>
>>> Thanks for reading the proposal, Dan.  I agree with you , chance of 
>>> finding  delimiter character inside a binary data is very high. I 
>>> will go with the option 1. Your assumption is correct , it applies to 
>>> all the binary data.
>>
>>
>>
>> I also agree, it seems like a reasonable first step to default binary 
>> export/import to external files.
>>
>> I am probably missing something here though.  I thought for char data 
>> there is an algorithm that handles delimiter data within it.  Why does
>> that not work for binary data also?  Is it a codeset issue?
>>
> 
> Yes. For character data, double delimiters are used if there are 
> delimiter characters inside the data. i.e  say if  a column contains
> 'he said "derby is solid database" ' , then it is written to the export 
> file as "he said ""derby is a solid database "" " . So on export of the 
> data, data is modified before writing to the file.
> 
> It may be possible to do the same thing by interpreting the binary data 
> in specified code-set and  add an extra delimiter for every delimiter 
> found on export and do the reverse on import.  But unlike the character 
> data , if the binary data is changed and if user import it to some other 
> application, the data may mean/look completely different if the added 
> extra delimiter characters are not removed.
Again I think the separate file/no delimiter solution is a good first
approach, I just wanted to understand the issue.  As you point out there
are multiple usage scenario's here:
1) someone has a derby db and wants to export for use into another derby db.
2) someone has a derby db and wants to export for use in another 
application.
3) someone has some data from another app and wants to import into derby.

I think the separate file solution works for #1.  I don't know how well
it works for option #2 and #3.  But at least for #2 it results in the
raw data without need to process it.

> 
> Another thing to note here is in a character data user knows there might 
> be delimiter characters inside and specify a delimiter character that is 
> not in the data, if he/she does not want data to modified on export. 
> Where as with binary data that is not possible at all.
> 
> If you think modifying a binary data on export is ok,  then it might be 
> possible to use same concept as character data.  My only concern here is 
> if we export image of a "cat"  and add delimiters to it by interpreting 
> it as character data, it may look  like a "tiger" if extra characters 
> added is not removed :-)
> 
> Thanks
> -suresh
> 
>

Re: (DERBY-378) implementing import/export of large objects...

Posted by Suresh Thalamati <su...@gmail.com>.

Mike Matrigali wrote:
> 
> 
> Suresh Thalamati wrote:
> 
>> Daniel John Debrunner wrote:
>>
>>> Suresh Thalamati wrote:
>>>
>>>> BLOB:
>>
>>
>> ....
>>
>>>>   1) Allow import/export of blob data only when they it is  stored 
>>>> in an external file.
>>>>
>>>>   2) Write the blob data to the export file along with other data 
>>>> types during export, assuming the blob data does not contain any 
>>>> delimiters and throw an error if on import if it finds delimiters 
>>>> inside the data, by interpreting it using the same code-set as the 
>>>> other character data.
>>>
>>>
>>>
>>>
>>> I say option 1) and I assume it's for all binary data, not just BLOB, 
>>> e.g. VARCHAR FOR BIT DATA etc. Seems like with binary data the chance 
>>> of having a problematic character in the text file is close to 100%.
>>>
>>>
>>> Dan.
>>>
>>>
>>
>> Thanks for reading the proposal, Dan.  I agree with you , chance of 
>> finding  delimiter character inside a binary data is very high. I will 
>> go with the option 1. Your assumption is correct , it applies to all 
>> the binary data.
> 
> 
> I also agree, it seems like a reasonable first step to default binary 
> export/import to external files.
> 
> I am probably missing something here though.  I thought for char data 
> there is an algorithm that handles delimiter data within it.  Why does
> that not work for binary data also?  Is it a codeset issue?
> 

Yes. For character data, double delimiters are used if there are 
delimiter characters inside the data. i.e  say if  a column contains
'he said "derby is solid database" ' , then it is written to the 
export file as "he said ""derby is a solid database "" " . So on 
export of the data, data is modified before writing to the file.

It may be possible to do the same thing by interpreting the binary 
data in specified code-set and  add an extra delimiter for every 
delimiter found on export and do the reverse on import.  But unlike 
the character data , if the binary data is changed and if user import 
it to some other application, the data may mean/look completely 
different if the added extra delimiter characters are not removed.

Another thing to note here is in a character data user knows there 
might be delimiter characters inside and specify a delimiter character 
that is not in the data, if he/she does not want data to modified on 
export. Where as with binary data that is not possible at all.

If you think modifying a binary data on export is ok,  then it might 
be possible to use same concept as character data.  My only concern 
here is if we export image of a "cat"  and add delimiters to it by 
interpreting it as character data, it may look  like a "tiger" if 
extra characters added is not removed :-)

Thanks
-suresh

Re: (DERBY-378) implementing import/export of large objects...

Posted by Mike Matrigali <mi...@sbcglobal.net>.


Suresh Thalamati wrote:
> Daniel John Debrunner wrote:
> 
>> Suresh Thalamati wrote:
>>
>>> BLOB:
> 
> ....
> 
>>>   1) Allow import/export of blob data only when they it is  stored in 
>>> an external file.
>>>
>>>   2) Write the blob data to the export file along with other data 
>>> types during export, assuming the blob data does not contain any 
>>> delimiters and throw an error if on import if it finds delimiters 
>>> inside the data, by interpreting it using the same code-set as the 
>>> other character data.
>>
>>
>>
>> I say option 1) and I assume it's for all binary data, not just BLOB, 
>> e.g. VARCHAR FOR BIT DATA etc. Seems like with binary data the chance 
>> of having a problematic character in the text file is close to 100%.
>>
>>
>> Dan.
>>
>>
> 
> Thanks for reading the proposal, Dan.  I agree with you , chance of 
> finding  delimiter character inside a binary data is very high. I will 
> go with the option 1. Your assumption is correct , it applies to all the 
> binary data.

I also agree, it seems like a reasonable first step to default binary 
export/import to external files.

I am probably missing something here though.  I thought for char data 
there is an algorithm that handles delimiter data within it.  Why does
that not work for binary data also?  Is it a codeset issue?

> 
> 
> Thanks
> -suresh
> 
> 
>

Re: (DERBY-378) implementing import/export of large objects...

Posted by Suresh Thalamati <su...@gmail.com>.

Daniel John Debrunner wrote:
> Suresh Thalamati wrote:
> 
>> BLOB:
....

>>   1) Allow import/export of blob data only when they it is  stored in 
>> an external file.
>>
>>   2) Write the blob data to the export file along with other data 
>> types during export, assuming the blob data does not contain any 
>> delimiters and throw an error if on import if it finds delimiters 
>> inside the data, by interpreting it using the same code-set as the 
>> other character data.
> 
> 
> I say option 1) and I assume it's for all binary data, not just BLOB, 
> e.g. VARCHAR FOR BIT DATA etc. Seems like with binary data the chance of 
> having a problematic character in the text file is close to 100%.
> 
> 
> Dan.
> 
> 

Thanks for reading the proposal, Dan.  I agree with you , chance of 
finding  delimiter character inside a binary data is very high. I will 
go with the option 1. Your assumption is correct , it applies to all 
the binary data.


Thanks
-suresh

Re: (DERBY-378) implementing import/export of large objects...

Posted by Daniel John Debrunner <dj...@apache.org>.

Suresh Thalamati wrote:

> BLOB:
> 
> o Blob data is written to export file as it is stored in the database, 
> it is not converted to hex or any another format. character code-set 
> conversion is not done for the binary data.
> 
> o If the data contains the delimiters , then it will be a problem if 
> stored in the same file with other data types. It may not be such a good 
> idea to attempt to interpret binary data to find the delimiters inside 
> the data and inject double delimiters into the data. That leaves us with 
> following two options :
> 
> 
>   1) Allow import/export of blob data only when they it is  stored in an 
> external file.
> 
>   2) Write the blob data to the export file along with other data types 
> during export, assuming the blob data does not contain any delimiters 
> and throw an error if on import if it finds delimiters inside the data, 
> by interpreting it using the same code-set as the other character data.

I say option 1) and I assume it's for all binary data, not just BLOB, 
e.g. VARCHAR FOR BIT DATA etc. Seems like with binary data the chance of 
having a problematic character in the text file is close to 100%.


Dan.