You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@turbine.apache.org by Jeff Painter <pa...@kiasoft.com> on 2003/11/11 20:37:22 UTC
question on CSVParser
org.apache.turbine.util.CSVParser is giving me some trouble
I have an import routine that will probably have many fields left blank.
Unfortunately it looks like if the first field is left blank, it throws
off that whole line for retrieving values with ValueParser.
example
row 1 always contains my headers
System ID, Bank, Branch Number, Address
subsequent rows carry data
100,"Bank of America", 1230, "123 Main St."
,"Bank of America", 1432, "923 Front St."
I want to keep the format of the csv consistent for allowing multiple
updates as well as new data entry en masse.
For line 1, all the data comes through correctly, if ValueParser can find
a system id, then it will attempt to update the info rather than create a
new entry. My goal is that for line 2, it will see no value for System ID
and then attempt to create a new entry.
However, from my logs, all the fields are shifted one place to the left
since System Id is blank
log output:
[Tue Nov 11 14:18:18 EST 2003] -- INFO -- Adding new PT_BRANCH
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- ID not found, input as new participant.
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Bank: 8830
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch #: Franklinton
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch: SWL - SW Rural (Lake Charles)
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Region: 946 Pearl Street
so it should have been branch # 8830 and not Franklinton... bleah
any pointers on how to fix this or do I have to setup two different
routines... one for imports and one for updates. I haven't tested to see
if fields left blank in between are accounted for correctly.
I'm using tdk-2.2_01 release of turbine and utils
--
Regards,
Jeffery Painter
- --
painter@kiasoft.com http://kiasoft.com
PGP FP: 9CE8 83A2 33FA 32B1 0AB1 4E62 E4CB E4DA 5913 EFBC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
YNXabx7DZOV2Hd9LwSTmGpY=
=dWiu
-----END PGP SIGNATURE-----
---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org
RE: question on CSVParser
Posted by Jeff Painter <pa...@kiasoft.com>.
On Wed, 12 Nov 2003, Greg Kerdemelidis wrote:
>
> Hi Jeff!
>
> We've just been doing the same thing down here. CSV bank data threw me out
> as well. I used the java.util.StringTokenizer rather than the utils classes,
> which kindly mishandled empty values
> ("123,,223.69,12-11-2003,8495-399,DIRECT DEBIT"). If it's missing the first
> field, you can bet it'll do the same for empty fields in-line.
>
> We solved this problem by implementing a "strict" tokenizer:
This looks great... I'll give it a try... if I come up with any
improvements I'll be sure to send them along.
--
Regards,
Jeffery Painter
- --
painter@kiasoft.com http://kiasoft.com
PGP FP: 9CE8 83A2 33FA 32B1 0AB1 4E62 E4CB E4DA 5913 EFBC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
YNXabx7DZOV2Hd9LwSTmGpY=
=dWiu
-----END PGP SIGNATURE-----
---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org
RE: question on CSVParser
Posted by Jeff Painter <pa...@kiasoft.com>.
On Wed, 12 Nov 2003, Greg Kerdemelidis wrote:
>
> Hi Jeff!
>
> We've just been doing the same thing down here. CSV bank data threw me out
> as well. I used the java.util.StringTokenizer rather than the utils classes,
> which kindly mishandled empty values
> ("123,,223.69,12-11-2003,8495-399,DIRECT DEBIT"). If it's missing the first
> field, you can bet it'll do the same for empty fields in-line.
>
> We solved this problem by implementing a "strict" tokenizer:
>
I wanted to post an update to my findings in case anyone else needs a
better CSVParser... I was leary to switch over to StringTokenizer as I had
already written most of the code to utilize the turbine CSVParser methods,
so I went back to Ostermiller's parser and implemented some routines in a
helper class that seem to be more in the mindset of Turbine.
I have posted my class here for anyone who is interested to review
http://www.kiasoft.com/opensource/CsvParser.java
It may not be the most efficient routine but I have been able to use this
class and import about 55 records per second with 20 columns of data
creating 5 distinct database entries and creating the foreign key links on
these objects. Under my initial test it only took about 18 seconds to
import 1000 records (creating 5000 objects) which is probably the most I
will have to deal with.
This class does require ostermiller's csv libraries
http://ostermiller.org/utils/CSV.html
I have implemented the class as an Iterator interface so as it traverses
through the data, it does remove the record which helps a little with
performance.
to use:
CsvParser parser = new CsvParser( fileObject );
while ( parser.hasNext() )
{
parser.next();
String field_1 = parser.getColumn("fieldname_1");
String field_2 = parser.getColumn("fieldname_2");
}
and if a field is blank, it does as expected and returns the correct field
as a blank value.. even if it is the first field :-)
I liked the way that the turbine CSVParser was built on top of ValueParser
for pulling column key value pairs and tried to implement a similar
method.
One drawback is that all fields are returned as String objects,
but this was the easiest hack to get the job done for now. It is easy
enough to convert data types as needed.
I would appreciate any improvements on this if anyone is interested.
--
Regards,
Jeffery Painter
- --
painter@kiasoft.com http://kiasoft.com
PGP FP: 9CE8 83A2 33FA 32B1 0AB1 4E62 E4CB E4DA 5913 EFBC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
YNXabx7DZOV2Hd9LwSTmGpY=
=dWiu
-----END PGP SIGNATURE-----
---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org
RE: question on CSVParser
Posted by Greg Kerdemelidis <gk...@snap.net.nz>.
Hi Jeff!
We've just been doing the same thing down here. CSV bank data threw me out
as well. I used the java.util.StringTokenizer rather than the utils classes,
which kindly mishandled empty values
("123,,223.69,12-11-2003,8495-399,DIRECT DEBIT"). If it's missing the first
field, you can bet it'll do the same for empty fields in-line.
We solved this problem by implementing a "strict" tokenizer:
------[com.genix.commons.tools.StrictTokenizer]--------------
package com.genix.commons.tools;
import java.util.Enumeration;
import java.util.NoSuchElementException;
import java.util.Vector;
/**
* @author Greg Kerdemelidis (greg@genixsystems.com)
*/
public class StrictTokenizer implements Enumeration
{
String _data;
char _delim;
Vector _tokens;
int _count;
public StrictTokenizer(String input, char delim)
{
_data=input;
_delim=delim;
parse();
_count=0;
}
private void parse()
{
char[] dat = _data.toCharArray();
_tokens = new Vector();
char[] buff = new char[100]; // bad bad bad
int buffc = 0;
for (int i = 0; i < dat.length; i++)
{
char c = dat[i];
if(c!=_delim)
buff[buffc++]=c;
else
{
if(buffc==1)
_tokens.add("");
else
_tokens.add(String.copyValueOf(buff,0,buffc));
buffc=0; // reset buff pointer
}
}
}
public String getNext() throws NoSuchElementException
{
if(_count==_tokens.size())
throw new NoSuchElementException();
return (String)_tokens.get(_count++);
}
public String nextToken() throws NoSuchElementException
{
return getNext();
}
public boolean hasNext()
{
return _count>=_tokens.size();
}
public int getTokens()
{
return _tokens.size();
}
public void reset(int num)
{
_count=num;
}
public void reset()
{
this.reset(0);
}
public boolean hasMoreElements()
{
return hasNext();
}
public Object nextElement()
{
return getNext();
}
}
-------[end]-------------
Use it the same as the java.util.StringTokenizer:
String inputLine = CSV.readLine();
StrictTokenizer tok = new StrictTokenizer(inputLine, ',');
String first = tok.getNext();
String second = tok.getNext();
It'll only throw a NoSuchElementException if you iterate off the end of the
string.
The programming police won't like my "char[] buff = new char[100];", but
then neither do I. YMMV - it's perfect for our use as no CSV input line is
greater than 100 chars.
Hope that helps, mate!
Regards,
-Greg
> -----Original Message-----
> From: Jeff Painter [mailto:painter@kiasoft.com]
> Sent: Wednesday, 12 November 2003 8:37 a.m.
> To: turbine-user@jakarta.apache.org
> Subject: question on CSVParser
>
>
> org.apache.turbine.util.CSVParser is giving me some trouble
>
> I have an import routine that will probably have many fields left blank.
> Unfortunately it looks like if the first field is left blank, it throws
> off that whole line for retrieving values with ValueParser.
>
> example
>
> row 1 always contains my headers
> System ID, Bank, Branch Number, Address
>
> subsequent rows carry data
> 100,"Bank of America", 1230, "123 Main St."
> ,"Bank of America", 1432, "923 Front St."
>
> I want to keep the format of the csv consistent for allowing multiple
> updates as well as new data entry en masse.
>
> For line 1, all the data comes through correctly, if ValueParser can find
> a system id, then it will attempt to update the info rather than create a
> new entry. My goal is that for line 2, it will see no value for System ID
> and then attempt to create a new entry.
>
> However, from my logs, all the fields are shifted one place to the left
> since System Id is blank
>
> log output:
>
> [Tue Nov 11 14:18:18 EST 2003] -- INFO -- Adding new PT_BRANCH
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- ID not found, input as new
> participant.
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Bank: 8830
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch #: Franklinton
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch: SWL - SW Rural (Lake
> Charles)
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Region: 946 Pearl Street
>
> so it should have been branch # 8830 and not Franklinton... bleah
>
> any pointers on how to fix this or do I have to setup two different
> routines... one for imports and one for updates. I haven't tested to see
> if fields left blank in between are accounted for correctly.
>
> I'm using tdk-2.2_01 release of turbine and utils
>
> --
> Regards,
>
> Jeffery Painter
>
> - --
> painter@kiasoft.com http://kiasoft.com
> PGP FP: 9CE8 83A2 33FA 32B1 0AB1 4E62 E4CB E4DA 5913 EFBC
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.1 (GNU/Linux)
>
> iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
> YNXabx7DZOV2Hd9LwSTmGpY=
> =dWiu
> -----END PGP SIGNATURE-----
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: turbine-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org