You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@turbine.apache.org by Jeff Painter <pa...@kiasoft.com> on 2003/11/11 20:37:22 UTC

question on CSVParser

 org.apache.turbine.util.CSVParser is giving me some trouble

I have an import routine that will probably have many fields left blank. 
Unfortunately it looks like if the first field is left blank, it throws 
off that whole line for retrieving values with ValueParser.

example

	row 1 always contains my headers
	System ID, Bank, Branch Number, Address

	subsequent rows carry data
	100,"Bank of America", 1230, "123 Main St."
	,"Bank of America", 1432, "923 Front St."

I want to keep the format of the csv consistent for allowing multiple 
updates as well as new data entry en masse.

For line 1, all the data comes through correctly, if ValueParser can find 
a system id, then it will attempt to update the info rather than create a 
new entry. My goal is that for line 2, it will see no value for System ID 
and then attempt to create a new entry.

However, from my logs, all the fields are shifted one place to the left 
since System Id is blank

log output:

[Tue Nov 11 14:18:18 EST 2003] -- INFO -- Adding new PT_BRANCH
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- ID not found, input as new participant.
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Bank: 8830
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch #: Franklinton
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch: SWL - SW Rural (Lake Charles)
[Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Region: 946 Pearl Street

so it should have been branch # 8830 and not Franklinton... bleah

any pointers on how to fix this or do I have to setup two different 
routines... one for imports and one for updates. I haven't tested to see 
if fields left blank in between are accounted for correctly.

I'm using tdk-2.2_01 release of turbine and utils

-- 
Regards,

Jeffery Painter

- --
painter@kiasoft.com                     http://kiasoft.com
PGP FP: 9CE8 83A2 33FA 32B1 0AB1  4E62 E4CB E4DA 5913 EFBC

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
 
iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
YNXabx7DZOV2Hd9LwSTmGpY=
=dWiu
-----END PGP SIGNATURE-----


---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org

RE: question on CSVParser

Posted by Jeff Painter <pa...@kiasoft.com>.

On Wed, 12 Nov 2003, Greg Kerdemelidis wrote:

> 
> Hi Jeff!
> 
> We've just been doing the same thing down here. CSV bank data threw me out
> as well. I used the java.util.StringTokenizer rather than the utils classes,
> which kindly mishandled empty values
> ("123,,223.69,12-11-2003,8495-399,DIRECT DEBIT"). If it's missing the first
> field, you can bet it'll do the same for empty fields in-line. 
> 
> We solved this problem by implementing a "strict" tokenizer:

This looks great... I'll give it a try... if I come up with any 
improvements I'll be sure to send them along.

-- 
Regards,

Jeffery Painter

- --
painter@kiasoft.com                     http://kiasoft.com
PGP FP: 9CE8 83A2 33FA 32B1 0AB1  4E62 E4CB E4DA 5913 EFBC

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
 
iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
YNXabx7DZOV2Hd9LwSTmGpY=
=dWiu
-----END PGP SIGNATURE-----


---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org

RE: question on CSVParser

Posted by Jeff Painter <pa...@kiasoft.com>.

On Wed, 12 Nov 2003, Greg Kerdemelidis wrote:

> 
> Hi Jeff!
> 
> We've just been doing the same thing down here. CSV bank data threw me out
> as well. I used the java.util.StringTokenizer rather than the utils classes,
> which kindly mishandled empty values
> ("123,,223.69,12-11-2003,8495-399,DIRECT DEBIT"). If it's missing the first
> field, you can bet it'll do the same for empty fields in-line. 
> 
> We solved this problem by implementing a "strict" tokenizer:
> 

I wanted to post an update to my findings in case anyone else needs a 
better CSVParser... I was leary to switch over to StringTokenizer as I had 
already written most of the code to utilize the turbine CSVParser methods, 
so I went back to Ostermiller's parser and implemented some routines in a 
helper class that seem to be more in the mindset of Turbine.

I have posted my class here for anyone who is interested to review

	http://www.kiasoft.com/opensource/CsvParser.java

It may not be the most efficient routine but I have been able to use this 
class and import about 55 records per second with 20 columns of data 
creating 5 distinct database entries and creating the foreign key links on 
these objects. Under my initial test it only took about 18 seconds to 
import 1000 records (creating 5000 objects) which is probably the most I 
will have to deal with.

This class does require ostermiller's csv libraries 

	http://ostermiller.org/utils/CSV.html

I have implemented the class as an Iterator interface so as it traverses 
through the data, it does remove the record which helps a little with 
performance.

to use:

	CsvParser parser = new CsvParser( fileObject );
	while ( parser.hasNext() )
	{
		parser.next();
		String field_1 = parser.getColumn("fieldname_1");
		String field_2 = parser.getColumn("fieldname_2");
	}

and if a field is blank, it does as expected and returns the correct field 
as a blank value.. even if it is the first field :-)

I liked the way that the turbine CSVParser was built on top of ValueParser 
for pulling column key value pairs and tried to implement a similar 
method. 

One drawback is that all fields are returned as String objects, 
but this was the easiest hack to get the job done for now. It is easy 
enough to convert data types as needed.

I would appreciate any improvements on this if anyone is interested.

-- 
Regards,

Jeffery Painter

- --
painter@kiasoft.com                     http://kiasoft.com
PGP FP: 9CE8 83A2 33FA 32B1 0AB1  4E62 E4CB E4DA 5913 EFBC

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
YNXabx7DZOV2Hd9LwSTmGpY=
=dWiu
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org

RE: question on CSVParser

Posted by Greg Kerdemelidis <gk...@snap.net.nz>.

Hi Jeff!

We've just been doing the same thing down here. CSV bank data threw me out
as well. I used the java.util.StringTokenizer rather than the utils classes,
which kindly mishandled empty values
("123,,223.69,12-11-2003,8495-399,DIRECT DEBIT"). If it's missing the first
field, you can bet it'll do the same for empty fields in-line. 

We solved this problem by implementing a "strict" tokenizer:

------[com.genix.commons.tools.StrictTokenizer]--------------

package com.genix.commons.tools;

import java.util.Enumeration;
import java.util.NoSuchElementException;
import java.util.Vector;

/**
 * @author Greg Kerdemelidis (greg@genixsystems.com)
 */
public class StrictTokenizer implements Enumeration
{
	String _data;
	char _delim;
	Vector _tokens;
	int _count;
	
	public StrictTokenizer(String input, char delim)
	{
		_data=input;
		_delim=delim;
		parse();
		_count=0;
	}
	
	private void parse()
	{
		char[] dat = _data.toCharArray();
		_tokens = new Vector();
		
		char[] buff = new char[100];		// bad bad bad
		int buffc = 0;
		
		for (int i = 0; i < dat.length; i++)
		{
			char c = dat[i];
			
			if(c!=_delim)
				buff[buffc++]=c;
			else
			{
				
				if(buffc==1)
					_tokens.add("");
				else
	
_tokens.add(String.copyValueOf(buff,0,buffc));
					
				buffc=0;	// reset buff pointer
			}
		}
	}
	
	public String getNext() throws NoSuchElementException
	{
		if(_count==_tokens.size())
			throw new NoSuchElementException();
		return (String)_tokens.get(_count++);
	}
	
	public String nextToken() throws NoSuchElementException
	{
		return getNext();
	}
	
	public boolean hasNext()
	{
		return _count>=_tokens.size();
	}
	
	public int getTokens()
	{
		return _tokens.size();
	}
	
	public void reset(int num)
	{
		_count=num;
	}
	
	public void reset()
	{
		this.reset(0);
	}

	public boolean hasMoreElements()
	{
		return hasNext();
	}

	public Object nextElement()
	{
		return getNext();
	}
}

-------[end]-------------

Use it the same as the java.util.StringTokenizer:

String inputLine = CSV.readLine();
StrictTokenizer tok = new StrictTokenizer(inputLine, ',');

String first = tok.getNext();	
String second = tok.getNext();

It'll only throw a NoSuchElementException if you iterate off the end of the
string.

The programming police won't like my "char[] buff = new char[100];", but
then neither do I. YMMV - it's perfect for our use as no CSV input line is
greater than 100 chars.

Hope that helps, mate!

Regards,

-Greg


> -----Original Message-----
> From: Jeff Painter [mailto:painter@kiasoft.com]
> Sent: Wednesday, 12 November 2003 8:37 a.m.
> To: turbine-user@jakarta.apache.org
> Subject: question on CSVParser
> 
> 
>  org.apache.turbine.util.CSVParser is giving me some trouble
> 
> I have an import routine that will probably have many fields left blank.
> Unfortunately it looks like if the first field is left blank, it throws
> off that whole line for retrieving values with ValueParser.
> 
> example
> 
> 	row 1 always contains my headers
> 	System ID, Bank, Branch Number, Address
> 
> 	subsequent rows carry data
> 	100,"Bank of America", 1230, "123 Main St."
> 	,"Bank of America", 1432, "923 Front St."
> 
> I want to keep the format of the csv consistent for allowing multiple
> updates as well as new data entry en masse.
> 
> For line 1, all the data comes through correctly, if ValueParser can find
> a system id, then it will attempt to update the info rather than create a
> new entry. My goal is that for line 2, it will see no value for System ID
> and then attempt to create a new entry.
> 
> However, from my logs, all the fields are shifted one place to the left
> since System Id is blank
> 
> log output:
> 
> [Tue Nov 11 14:18:18 EST 2003] -- INFO -- Adding new PT_BRANCH
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- ID not found, input as new
> participant.
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Bank: 8830
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch #: Franklinton
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Branch: SWL - SW Rural (Lake
> Charles)
> [Tue Nov 11 14:18:18 EST 2003] -- DEBUG -- Region: 946 Pearl Street
> 
> so it should have been branch # 8830 and not Franklinton... bleah
> 
> any pointers on how to fix this or do I have to setup two different
> routines... one for imports and one for updates. I haven't tested to see
> if fields left blank in between are accounted for correctly.
> 
> I'm using tdk-2.2_01 release of turbine and utils
> 
> --
> Regards,
> 
> Jeffery Painter
> 
> - --
> painter@kiasoft.com                     http://kiasoft.com
> PGP FP: 9CE8 83A2 33FA 32B1 0AB1  4E62 E4CB E4DA 5913 EFBC
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.1 (GNU/Linux)
> 
> iD8DBQE/qEQE5Mvk2lkT77wRAnMJAJ9vJ6qOkg/mvqqIpz7troCEQJ8bFACglu/U
> YNXabx7DZOV2Hd9LwSTmGpY=
> =dWiu
> -----END PGP SIGNATURE-----
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: turbine-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: turbine-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: turbine-user-help@jakarta.apache.org