You are viewing a plain text version of this content. The canonical link for it is here.
Posted to derby-user@db.apache.org by Albert Kam <mo...@gmail.com> on 2008/06/25 12:59:21 UTC

CSV with fails with doublequote inside the value

Hello Apache Derby,

I'm using db-derby-10.4.1.3-bin, and when i'm importing a csv file like this
:

*ij> CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE
(null,'SMS_OUT','/home/albert/mnt/linuxdb/rnd/sofco_sms/CSVs/sms_out.csv',null,null,null,0);
*
an error pops out like this :
*ERROR XIE0R: Import error on line 11 of file
/home/albert/mnt/linuxdb/rnd/sofco_sms/CSVs/sms_out.csv: Data found on line
11 for column 3 after the stop delimiter.
ERROR XIE03: Data found on line 11 for column 3 after the stop delimiter.
*
I check the file, and in the line 11, it goes like this :
"92","+628xxxxxxx","*Testinggg -__-"*","2006-08-03
11:43:42","0","0","-1","-1","-1"
(Notice the -__-"") the content should be *Testinggg -__-"*

So, i tried changing the *Testinggg -__-"* into *Testinggg -__-\"*, but
still errornous ..

Later after googling for a while, i found out about this :
http://www.perlmonks.org/?node_id=678257

Here's one interesting quote :
*

CSV as defined by RFC 4180 <http://tools.ietf.org/html/rfc4180> does not
"escape" double quotes with a backslash, but rather by an additional set of
double quotes. Your parser fails to handle this format properly.

CSV is hard.
*
Is this a bug or is it something that i did ? For now, i removed the
doublequotes and it worked like charm.

Regards,
Albert Kam

-- 
Do not pursue the past. Do not lose yourself in the future.
The past no longer is. The future has not yet come.
Looking deeply at life as it is in the very here and now,
the practitioner dwells in stability and freedom.
(Thich Nhat Hanh)

Re: CSV with fails with doublequote inside the value

Posted by Rick Hillegas <Ri...@Sun.COM>.
Hi Albert,

Derby's understanding of CSV predates RFC 4180. For instance, Derby 
trims leading and trailing spaces from imported columns although this 
behavior is forbidden by RFC 4180. Derby's CSV dialect is described in 
the Tools Guide in a section titled "File format for input and output". 
This description is impressionistic rather than rigorous. I don't think 
that the grammar of Derby's CSV dialect is published anywhere. Other 
than that sketch in the Tools Guide, I think that the only contract is 
that Derby promises to be able to re-import anything which Derby itself 
exported.

In practice, I think that this means that plenty of edge cases plague 
Derby when it imports files exported by other data sources. A lot of 
these edge cases can be handled by using the middle arguments of 
SYSCS_IMPORT_TABLE to fine-tune delimiters and code sets. If you can't 
fine-tune SYSCS_IMPORT_TABLE well enough, you can always write your own 
table function to wrap the foreign data file and then import the file 
like this

  insert into MyTable
  select * from table( MyTableFunction( '/path/to/the/data/file' ) ) s

Given the guidelines sketched in the Tools Guide, I would expect Derby 
to object to the file you are trying to import. If you want Derby to 
include the extra double-quote in the imported text, then I would escape 
it with another double-quote. For instance, the following file:

"foo", 1, "bar"
"wibble""", 2, "wombat"

imports successfully for me:

ij> call syscs_util.syscs_import_table
( 'APP', 'T', 'doubleQuote.csv', null, null, null, 0 );
0 rows inserted/updated/deleted
ij> select * from t;
A                                                 |B          
|C                                                
-----------------------------------------------------------------------------------------------------------------
foo                                               |1          
|bar                                              
wibble"                                           |2          
|wombat                                           

2 rows selected

Hope this helps,
-Rick


Albert Kam wrote:
> Hello Apache Derby,
>
> I'm using db-derby-10.4.1.3-bin, and when i'm importing a csv file 
> like this :
>
> *ij> CALL SYSCS_UTIL.SYSCS_IMPORT_TABLE 
> (null,'SMS_OUT','/home/albert/mnt/linuxdb/rnd/sofco_sms/CSVs/sms_out.csv',null,null,null,0);
> *
> an error pops out like this :
> *ERROR XIE0R: Import error on line 11 of file 
> /home/albert/mnt/linuxdb/rnd/sofco_sms/CSVs/sms_out.csv: Data found on 
> line 11 for column 3 after the stop delimiter.
> ERROR XIE03: Data found on line 11 for column 3 after the stop delimiter.
> *
> I check the file, and in the line 11, it goes like this :
> "92","+628xxxxxxx","*Testinggg -__-"*","2006-08-03 
> 11:43:42","0","0","-1","-1","-1"
> (Notice the -__-"") the content should be *Testinggg -__-"*
>
> So, i tried changing the *Testinggg -__-"* into *Testinggg -__-\"*, 
> but still errornous ..
>
> Later after googling for a while, i found out about this : 
> http://www.perlmonks.org/?node_id=678257
>
> Here's one interesting quote :
> /
>
> CSV as defined by RFC 4180 <http://tools.ietf.org/html/rfc4180> does 
> not "escape" double quotes with a backslash, but rather by an 
> additional set of double quotes. Your parser fails to handle this 
> format properly.
>
> CSV is hard.
>
> /
> Is this a bug or is it something that i did ? For now, i removed the 
> doublequotes and it worked like charm.
>
> Regards,
> Albert Kam
>
> -- 
> Do not pursue the past. Do not lose yourself in the future.
> The past no longer is. The future has not yet come.
> Looking deeply at life as it is in the very here and now,
> the practitioner dwells in stability and freedom.
> (Thich Nhat Hanh)