You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2012/11/01 09:01:59 UTC

[Bug 54084] New: Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

          Priority: P2
            Bug ID: 54084
          Assignee: dev@poi.apache.org
           Summary: Some Unicode chars(e.g chinese chars) are not written
                    corectly in xlsx file.
          Severity: normal
    Classification: Unclassified
          Reporter: l_alexandra2010@yahoo.com
          Hardware: PC
            Status: NEW
           Version: 3.8
         Component: SXSSF
           Product: POI

Set the value of a SXSSFCell to a string that contains chinese chars:
cell.setCellValue("

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

--- Comment #7 from sumedh <su...@gmail.com> ---
Created attachment 30251
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=30251&action=edit
Greek alphabet beyond BMP

PFA the UTF-16 (little endian) file with greek characters from beyond basic
multilingual plane.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

Alexandra Luca <l_...@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

--- Comment #5 from sumedh <su...@gmail.com> ---
I also found that surrogate pair characters (supplementary utf16) are not
getting written correctly.

e.g. If you have character "\uD835\uDF4B" - 4 byte surrogate pair encoding of
unicode U+1D74B (big endian), which is "mathematical italics bold phi", it gets
converted to ? when it's exported to excel.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

Alexandra Luca <l_...@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #1 from Alexandra Luca <l_...@yahoo.com> ---
The chinese chars are replaced with ? in the xlsx file:

??????खफआछ??????

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

--- Comment #10 from stanescu florentina <st...@gmail.com> ---
What is the status of this defect? Is somebody still working to fix this
defect?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #4 from Yegor Kozlov <ye...@dinom.ru> ---
I can't reproduce the problem with the latest build from trunk. Can you please
upload a unit test that demonstrates the problem?

I see that in the corrupted file unicode characters are garbled, but as of
POI-3.9, we don't write raw unicode - every character above ASCII is written in
the &#charCode; form which means that the problem is mostly certainly fixed in
trunk. 

Links to download nightly builds are on http://poi.apache.org/

Yegor

(In reply to comment #3)
> Created attachment 29537 [details]
> the 2 xlsx files
> 
> Here are 2 xlsx files.
> The first file(TestUnicode.xlsx) is used to load the data from it to the
> database.
> 
> The data is inserted corectly in database, and then is displayed corectly on
> the UI.
> The same data we are trying to export to another xlsx file, but the chars
> are not encoded corectly. Both files have the same font(Calibri 11).

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

Nick Burch <ap...@gagravarr.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #2 from Nick Burch <ap...@gagravarr.org> ---
Could you please upload a unit test that shows the problem?

Also, are you sure that you're correctly getting the characters into Java
without breaking the encoding, and are you sure that the font you're using can
correctly render the characters?

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

--- Comment #8 from sumedh <su...@gmail.com> ---
Created attachment 30252
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=30252&action=edit
Greek alphabet beyond BMP - Manually created xlsx

PFA manually created excel for these characters. MS Excel correctly writes the
values in shared string table. SXSSF writes ???? (inline) for them.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

Dominik Stadler <do...@gmx.at> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |NEW

--- Comment #9 from Dominik Stadler <do...@gmx.at> ---
I worked on reproducing the reported problems with greek characters. This seems
to happen when loading shared strings from the XLSX file. The XML file is
encoded correctly (UTF-8 codes e.g. from
http://www.fileformat.info/info/unicode/char/1d74a/index.htm) and characters
appear in OpenOffice and when opening the file in a text-editor.

Also initial loading of the Workbook using XSSF works, the cell contains the
necessary data, however after writing out the data and reading back in, it does
not match any more.

As far as I see, the shared-strings are read incorrectly and thus break the
writing of the data back out.

I could debug the code as far as xmlbeans handles the string where it seems to
be fine, but as soon as the SstDocumentImpl takes over, it seems to become
corrupted, however debugging there is not possible for me currently because the
.class files are stripped... :(

I have for now added a testcase to the special test-class TestUnfixedBugs.java
called testBug54084Unicode() which verifies the problem, no fix available
yet...

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

--- Comment #11 from Yaniv Kunda <ya...@kundas.net> ---
I've tried to debug it using POI's TestUnfixedBugs, but the loss is happening
deep inside XMLBeans.
Probably due to https://issues.apache.org/jira/browse/XMLBEANS-332

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

--- Comment #6 from Nick Burch <ap...@gagravarr.org> ---
If you write that character in Excel, how does Excel encode it to the file?
(Might be worth checking both the raw xml inside the .xlsx, and how POI sees
it)

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 54084] Some Unicode chars(e.g chinese chars) are not written corectly in xlsx file.

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=54084

--- Comment #3 from Alexandra Luca <l_...@yahoo.com> ---
Created attachment 29537
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=29537&action=edit
the 2 xlsx files

Here are 2 xlsx files.
The first file(TestUnicode.xlsx) is used to load the data from it to the
database.

The data is inserted corectly in database, and then is displayed corectly on
the UI.
The same data we are trying to export to another xlsx file, but the chars are
not encoded corectly. Both files have the same font(Calibri 11).

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org