You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Andy Sautins <an...@returnpath.net> on 2009/02/10 21:18:20 UTC

Best practices on spliltting an input line?

 

   I have question.  I've dabbled with different ways of tokenizing an
input file line for processing.  I've noticed in my somewhat limited
tests that there seem to be some pretty reasonable performance
differences between different tokenizing methods.  For example, roughly
it seems to split a line on tokens ( tab delimited in my case ) that
Scanner is the slowest, followed by String.spit and StringTokenizer
being the fastest.  StringTokenizer, for my application, has the
unfortunate characteristic of not returning blank tokens ( i.e., parsing
"a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d").
The WordCount example uses StringTokenizer which makes sense to me,
except I'm currently getting hung up on not returning blank tokens.  I
did run across the com.Ostermiller.util StringTokenizer replacement that
handles null/blank tokens
(http://ostermiller.org/utils/StringTokenizer.html ) which seems
possible to use, but it sure seems like someone else has solved this
problem already better than I have.

 

   So, my question is, is there a "best practice" for splitting an input
line especially when NULL tokens are expected ( i.e., two consecutive
delimiter characters )?

 

   Any thoughts would be appreciated

 

   Thanks

 

   Andy

Re: Best practices on spliltting an input line?

Posted by Steve Loughran <st...@apache.org>.

Stefan Podkowinski wrote:
> I'm currently using OpenCSV which can be found at
> http://opencsv.sourceforge.net/  but haven't done any performance
> tests on it yet. In my case simply splitting strings would not work
> anyways, since I need to handle quotes and separators within quoted
> values, e.g. "a","a,b","c".

I've used it in the past; found it pretty reliable. Again, no perf 
tests, just reading in CSV files exported from spreadsheets

Re: Best practices on spliltting an input line?

Posted by Stefan Podkowinski <sp...@gmail.com>.

I'm currently using OpenCSV which can be found at
http://opencsv.sourceforge.net/  but haven't done any performance
tests on it yet. In my case simply splitting strings would not work
anyways, since I need to handle quotes and separators within quoted
values, e.g. "a","a,b","c".

On Tue, Feb 10, 2009 at 9:18 PM, Andy Sautins
<an...@returnpath.net> wrote:
>
>
>   I have question.  I've dabbled with different ways of tokenizing an
> input file line for processing.  I've noticed in my somewhat limited
> tests that there seem to be some pretty reasonable performance
> differences between different tokenizing methods.  For example, roughly
> it seems to split a line on tokens ( tab delimited in my case ) that
> Scanner is the slowest, followed by String.spit and StringTokenizer
> being the fastest.  StringTokenizer, for my application, has the
> unfortunate characteristic of not returning blank tokens ( i.e., parsing
> "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d").
> The WordCount example uses StringTokenizer which makes sense to me,
> except I'm currently getting hung up on not returning blank tokens.  I
> did run across the com.Ostermiller.util StringTokenizer replacement that
> handles null/blank tokens
> (http://ostermiller.org/utils/StringTokenizer.html ) which seems
> possible to use, but it sure seems like someone else has solved this
> problem already better than I have.
>
>
>
>   So, my question is, is there a "best practice" for splitting an input
> line especially when NULL tokens are expected ( i.e., two consecutive
> delimiter characters )?
>
>
>
>   Any thoughts would be appreciated
>
>
>
>   Thanks
>
>
>
>   Andy
>
>

Re: Best practices on spliltting an input line?

Posted by Rasit OZDAS <ra...@gmail.com>.

Hi, Andy

Your problem seems to be a general Java problem, rather than hadoop.
In a java forum you may get better help.
String.split uses regular expressions, which you definitely don't need.
I would write my own split function, without regular expressions.

This link may help to better understand underlying operations:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split

Also there is a constructor of StringTokenizer to return also delimeters:
StringTokenizer(String string, String delimeters, boolean returnDelimeters);
(I would write my own, though.)

Rasit

2009/2/10 Andy Sautins <an...@returnpath.net>:
>
>
>   I have question.  I've dabbled with different ways of tokenizing an
> input file line for processing.  I've noticed in my somewhat limited
> tests that there seem to be some pretty reasonable performance
> differences between different tokenizing methods.  For example, roughly
> it seems to split a line on tokens ( tab delimited in my case ) that
> Scanner is the slowest, followed by String.spit and StringTokenizer
> being the fastest.  StringTokenizer, for my application, has the
> unfortunate characteristic of not returning blank tokens ( i.e., parsing
> "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d").
> The WordCount example uses StringTokenizer which makes sense to me,
> except I'm currently getting hung up on not returning blank tokens.  I
> did run across the com.Ostermiller.util StringTokenizer replacement that
> handles null/blank tokens
> (http://ostermiller.org/utils/StringTokenizer.html ) which seems
> possible to use, but it sure seems like someone else has solved this
> problem already better than I have.
>
>
>
>   So, my question is, is there a "best practice" for splitting an input
> line especially when NULL tokens are expected ( i.e., two consecutive
> delimiter characters )?
>
>
>
>   Any thoughts would be appreciated
>
>
>
>   Thanks
>
>
>
>   Andy
>
>



-- 
M. Raşit ÖZDAŞ