You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Felix Dreher <dr...@molgen.mpg.de> on 2009/11/24 12:22:56 UTC

Reading Excel files: character encoding / whole number issues

Hi all,

I'm trying to read Excel files (classes HSSFWorkbook, HSSFSheet, 
HSSFCell etc.). There are two issues that I'm currently struggling with:

1) non-ASCII characters like é or ü are read correctly on Windows, but 
not on Linux (where these characters are all converted to question 
marks). I tried to use the "Locale.setDefault(myLocale)" method, but 
with no success.

2) The tables I would like to read contain lots of whole numbers, which 
I would like to treat as strings because they are IDs or order numbers 
(i.e. I would like to simply read all cells in the same way as Excel 
displays them). However, they get read as decimal numbers, e.g. the 
number 1234 in an Excel table (formatted as 'General') becomes 1234.0 in 
its POI representation.
I tried two workarounds, but they didn't work:
a) open the Excel file, format all cells as Text, save.
b) use the HSSFDataFormatter class to define a 'Decimal format' with 
actually no decimals (e.g. DecimalFormat df = new DecimalFormat ("#"))

Any help would be greatly appreciated!

Thank you very much,
Felix


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Reading Excel files: character encoding / whole number issues

Posted by MSB <ma...@tiscali.co.uk>.
Hello Felix

I missed your reply for some reason; not to matter thgough.

Pleased that you managed to get things working but it is very odd that you
are experiencing problems running under Linux. I am fairly confident that we
have other list members running that operating system and I do not think
they experience any issues. Sadly, I have never used Linux and so cannot
comment but I hope we will hear from someone wuth more experience soon.

All the best with the API and if you need any more help simply post to this
list.

Yours

Mark B


Felix Dreher-2 wrote:
> 
> Hi Chris, Hi Mark,
> 
> thank you very much for the quick response!
> 
> MSB wrote:
>> Just to expand a little on Chris's reply, when you are reading data from
>> the
>> sheet, the correct way to use the HSSFDataFormatter class is something
>> like
>> this;
>>
>> HSSFDataFormatter formatter = new HSSFDataFormatter();
>> HSSFWorkbok workbook = new HSSFWorkbook...........
>> HSSFSheet sheet = workbook.getSheet(0);
>> HSSFRow row = sheet.getRow(0);
>> HSSFCell cell = row.getCell(0);
>> System.out.println(formatter.formatCellValue(cell));
>>
>> That way, the HSSFDataFormatter should use the cell style the user
>> applied
>> to the cell when the created the workbook/worksheet and you ought to see
>> what you expect output to the screen.
>>
>> Yours
>>
>> Mark B
>>
>>
>>   
> The HSSFDataFormatter.formatCellValue() function works fine and gives 
> the expected results. However, again on Linux (plus OpenOffice) it 
> didn't work for some reason. So I decided to reduce the number of 
> complicating factors and to simply use a Windows machine for this project.
> 
> 
>> ChrisLott wrote:
>>   
>>> Please don't assume that the view shown on the screen is a close match 
>>> to the data stored internally!  M$FT-Excel stores all numeric data as 
>>> floating-point numbers.  Then it applies suitable formatting and other 
>>> rules when it displays to you.  Working in POI you see the raw internal 
>>> data.  Storing and reading date/time values is a particular joy!
>>>
>>> Re character encoding: POI has cell encoding features, and copes with 
>>> UTF-8 and UTF-16.  Java is perfectly capable of dealing with special 
>>> characters.  You don't mention where on Linux the data is "converted to 
>>> question marks".  Did you use an IDE like Eclipse and view the data 
>>> right there immediately when it's read?
> No, I ran the Java program (jar) on the command-line and printed the 
> output to a text file. To see how it looks like directly in Netbeans, I 
> just tried it out and there it is the same output (questions marks 
> instead of special characters).
> Anyway, as I already said, the easiest workaround for now will be to run 
> the program on a Windows machine... :-)
> 
> Best regards,
> Felix
> 
> 
> 
> 
>>>   Also search Nabble for UTF-8 
>>> and encoding.  I got this hit:
>>>
>>> http://old.nabble.com/how-to-read-unicode-data-from-xls-sheet-td13601249.html#a13604018
>>>
>>> chris...
>>>
>>> p.s. Gruess aus New Jersey!
>>>
>>>
>>> Felix Dreher wrote:
>>>     
>>>> Hi all,
>>>>
>>>> I'm trying to read Excel files (classes HSSFWorkbook, HSSFSheet, 
>>>> HSSFCell etc.). There are two issues that I'm currently struggling
>>>> with:
>>>>
>>>> 1) non-ASCII characters like é or ü are read correctly on Windows, but 
>>>> not on Linux (where these characters are all converted to question 
>>>> marks). I tried to use the "Locale.setDefault(myLocale)" method, but 
>>>> with no success.
>>>>
>>>> 2) The tables I would like to read contain lots of whole numbers, which 
>>>> I would like to treat as strings because they are IDs or order numbers 
>>>> (i.e. I would like to simply read all cells in the same way as Excel 
>>>> displays them). However, they get read as decimal numbers, e.g. the 
>>>> number 1234 in an Excel table (formatted as 'General') becomes 1234.0
>>>> in 
>>>> its POI representation.
>>>> I tried two workarounds, but they didn't work:
>>>> a) open the Excel file, format all cells as Text, save.
>>>> b) use the HSSFDataFormatter class to define a 'Decimal format' with 
>>>> actually no decimals (e.g. DecimalFormat df = new DecimalFormat ("#"))
>>>>
>>>> Any help would be greatly appreciated!
>>>>
>>>> Thank you very much,
>>>> Felix
>>>>       
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>>> For additional commands, e-mail: user-help@poi.apache.org
>>>
>>>
>>>
>>>     
>>
>>   
> 
> 
> -- 
>  ___________________________________________________________
> 
>  Felix Dreher
>  Max Planck Institute for Molecular Genetics
>  Department of Vertebrate Genomics
>  Bioinformatics Group
> 
>  Ihnestr. 73 (mail) | Fabeckstr. 60-62 (visitors)
>  D-14195 Berlin, Germany
>  phone: +49-30-84131745 | mobile: +49-163-7542426
>  ___________________________________________________________
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Reading-Excel-files%3A-character-encoding---whole-number-issues-tp26494243p26499540.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Reading Excel files: character encoding / whole number issues

Posted by Felix Dreher <dr...@molgen.mpg.de>.
Hi Chris, Hi Mark,

thank you very much for the quick response!

MSB wrote:
> Just to expand a little on Chris's reply, when you are reading data from the
> sheet, the correct way to use the HSSFDataFormatter class is something like
> this;
>
> HSSFDataFormatter formatter = new HSSFDataFormatter();
> HSSFWorkbok workbook = new HSSFWorkbook...........
> HSSFSheet sheet = workbook.getSheet(0);
> HSSFRow row = sheet.getRow(0);
> HSSFCell cell = row.getCell(0);
> System.out.println(formatter.formatCellValue(cell));
>
> That way, the HSSFDataFormatter should use the cell style the user applied
> to the cell when the created the workbook/worksheet and you ought to see
> what you expect output to the screen.
>
> Yours
>
> Mark B
>
>
>   
The HSSFDataFormatter.formatCellValue() function works fine and gives 
the expected results. However, again on Linux (plus OpenOffice) it 
didn't work for some reason. So I decided to reduce the number of 
complicating factors and to simply use a Windows machine for this project.


> ChrisLott wrote:
>   
>> Please don't assume that the view shown on the screen is a close match 
>> to the data stored internally!  M$FT-Excel stores all numeric data as 
>> floating-point numbers.  Then it applies suitable formatting and other 
>> rules when it displays to you.  Working in POI you see the raw internal 
>> data.  Storing and reading date/time values is a particular joy!
>>
>> Re character encoding: POI has cell encoding features, and copes with 
>> UTF-8 and UTF-16.  Java is perfectly capable of dealing with special 
>> characters.  You don't mention where on Linux the data is "converted to 
>> question marks".  Did you use an IDE like Eclipse and view the data 
>> right there immediately when it's read?
No, I ran the Java program (jar) on the command-line and printed the 
output to a text file. To see how it looks like directly in Netbeans, I 
just tried it out and there it is the same output (questions marks 
instead of special characters).
Anyway, as I already said, the easiest workaround for now will be to run 
the program on a Windows machine... :-)

Best regards,
Felix




>>   Also search Nabble for UTF-8 
>> and encoding.  I got this hit:
>>
>> http://old.nabble.com/how-to-read-unicode-data-from-xls-sheet-td13601249.html#a13604018
>>
>> chris...
>>
>> p.s. Gruess aus New Jersey!
>>
>>
>> Felix Dreher wrote:
>>     
>>> Hi all,
>>>
>>> I'm trying to read Excel files (classes HSSFWorkbook, HSSFSheet, 
>>> HSSFCell etc.). There are two issues that I'm currently struggling with:
>>>
>>> 1) non-ASCII characters like é or ü are read correctly on Windows, but 
>>> not on Linux (where these characters are all converted to question 
>>> marks). I tried to use the "Locale.setDefault(myLocale)" method, but 
>>> with no success.
>>>
>>> 2) The tables I would like to read contain lots of whole numbers, which 
>>> I would like to treat as strings because they are IDs or order numbers 
>>> (i.e. I would like to simply read all cells in the same way as Excel 
>>> displays them). However, they get read as decimal numbers, e.g. the 
>>> number 1234 in an Excel table (formatted as 'General') becomes 1234.0 in 
>>> its POI representation.
>>> I tried two workarounds, but they didn't work:
>>> a) open the Excel file, format all cells as Text, save.
>>> b) use the HSSFDataFormatter class to define a 'Decimal format' with 
>>> actually no decimals (e.g. DecimalFormat df = new DecimalFormat ("#"))
>>>
>>> Any help would be greatly appreciated!
>>>
>>> Thank you very much,
>>> Felix
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
>>
>>     
>
>   


-- 
 ___________________________________________________________

 Felix Dreher
 Max Planck Institute for Molecular Genetics
 Department of Vertebrate Genomics
 Bioinformatics Group

 Ihnestr. 73 (mail) | Fabeckstr. 60-62 (visitors)
 D-14195 Berlin, Germany
 phone: +49-30-84131745 | mobile: +49-163-7542426
 ___________________________________________________________




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Reading Excel files: character encoding / whole number issues

Posted by MSB <ma...@tiscali.co.uk>.
Just to expand a little on Chris's reply, when you are reading data from the
sheet, the correct way to use the HSSFDataFormatter class is something like
this;

HSSFDataFormatter formatter = new HSSFDataFormatter();
HSSFWorkbok workbook = new HSSFWorkbook...........
HSSFSheet sheet = workbook.getSheet(0);
HSSFRow row = sheet.getRow(0);
HSSFCell cell = row.getCell(0);
System.out.println(formatter.formatCellValue(cell));

That way, the HSSFDataFormatter should use the cell style the user applied
to the cell when the created the workbook/worksheet and you ought to see
what you expect output to the screen.

Yours

Mark B


ChrisLott wrote:
> 
> Please don't assume that the view shown on the screen is a close match 
> to the data stored internally!  M$FT-Excel stores all numeric data as 
> floating-point numbers.  Then it applies suitable formatting and other 
> rules when it displays to you.  Working in POI you see the raw internal 
> data.  Storing and reading date/time values is a particular joy!
> 
> Re character encoding: POI has cell encoding features, and copes with 
> UTF-8 and UTF-16.  Java is perfectly capable of dealing with special 
> characters.  You don't mention where on Linux the data is "converted to 
> question marks".  Did you use an IDE like Eclipse and view the data 
> right there immediately when it's read?  Also search Nabble for UTF-8 
> and encoding.  I got this hit:
> 
> http://old.nabble.com/how-to-read-unicode-data-from-xls-sheet-td13601249.html#a13604018
> 
> chris...
> 
> p.s. Gruess aus New Jersey!
> 
> 
> Felix Dreher wrote:
>> Hi all,
>> 
>> I'm trying to read Excel files (classes HSSFWorkbook, HSSFSheet, 
>> HSSFCell etc.). There are two issues that I'm currently struggling with:
>> 
>> 1) non-ASCII characters like é or ü are read correctly on Windows, but 
>> not on Linux (where these characters are all converted to question 
>> marks). I tried to use the "Locale.setDefault(myLocale)" method, but 
>> with no success.
>> 
>> 2) The tables I would like to read contain lots of whole numbers, which 
>> I would like to treat as strings because they are IDs or order numbers 
>> (i.e. I would like to simply read all cells in the same way as Excel 
>> displays them). However, they get read as decimal numbers, e.g. the 
>> number 1234 in an Excel table (formatted as 'General') becomes 1234.0 in 
>> its POI representation.
>> I tried two workarounds, but they didn't work:
>> a) open the Excel file, format all cells as Text, save.
>> b) use the HSSFDataFormatter class to define a 'Decimal format' with 
>> actually no decimals (e.g. DecimalFormat df = new DecimalFormat ("#"))
>> 
>> Any help would be greatly appreciated!
>> 
>> Thank you very much,
>> Felix
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Reading-Excel-files%3A-character-encoding---whole-number-issues-tp26494243p26495141.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: Reading Excel files: character encoding / whole number issues

Posted by Chris Lott <ma...@invest-faq.com>.
Please don't assume that the view shown on the screen is a close match 
to the data stored internally!  M$FT-Excel stores all numeric data as 
floating-point numbers.  Then it applies suitable formatting and other 
rules when it displays to you.  Working in POI you see the raw internal 
data.  Storing and reading date/time values is a particular joy!

Re character encoding: POI has cell encoding features, and copes with 
UTF-8 and UTF-16.  Java is perfectly capable of dealing with special 
characters.  You don't mention where on Linux the data is "converted to 
question marks".  Did you use an IDE like Eclipse and view the data 
right there immediately when it's read?  Also search Nabble for UTF-8 
and encoding.  I got this hit:

http://old.nabble.com/how-to-read-unicode-data-from-xls-sheet-td13601249.html#a13604018

chris...

p.s. Gruess aus New Jersey!


Felix Dreher wrote:
> Hi all,
> 
> I'm trying to read Excel files (classes HSSFWorkbook, HSSFSheet, 
> HSSFCell etc.). There are two issues that I'm currently struggling with:
> 
> 1) non-ASCII characters like é or ü are read correctly on Windows, but 
> not on Linux (where these characters are all converted to question 
> marks). I tried to use the "Locale.setDefault(myLocale)" method, but 
> with no success.
> 
> 2) The tables I would like to read contain lots of whole numbers, which 
> I would like to treat as strings because they are IDs or order numbers 
> (i.e. I would like to simply read all cells in the same way as Excel 
> displays them). However, they get read as decimal numbers, e.g. the 
> number 1234 in an Excel table (formatted as 'General') becomes 1234.0 in 
> its POI representation.
> I tried two workarounds, but they didn't work:
> a) open the Excel file, format all cells as Text, save.
> b) use the HSSFDataFormatter class to define a 'Decimal format' with 
> actually no decimals (e.g. DecimalFormat df = new DecimalFormat ("#"))
> 
> Any help would be greatly appreciated!
> 
> Thank you very much,
> Felix


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org