You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2003/03/10 13:59:26 UTC
DO NOT REPLY [Bug 17824] New: -
about reading ms. doc file
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824
about reading ms. doc file
Summary: about reading ms. doc file
Product: POI
Version: unspecified
Platform: Sun
OS/Version: Other
Status: NEW
Severity: Major
Priority: Other
Component: HDF
AssignedTo: poi-dev@jakarta.apache.org
ReportedBy: tdyildirim@yahoo.com
When i read a ms doc file with using HDF classes. I have got a big problem. If
my data is not unicode and contains english char then there is no problem. But
when i use unicode or utf-8 charset then i have a big problem. because when we
use those type of charter string. It doesn't read all the data. it stopped to
read some part of the data for example if i use something like inside of
demo.doc document: ����
then when we read we got ���
and it is increasing like this.
i will send my example given below
public class Deneme {
public static void main(String[] args) {
testDoc deneme = new testDoc("demo.doc","demo.txt");
deneme.getText();
}
}
-----------------------------
//------- this code writes doc file to txt-----------
//------go get hfd libs from jakarta.poi (scratchpad at the moment)-------------
-------------------
//------------------------------------------------------------------------------
---------------
import org.apache.poi.hdf.extractor.util.*;
import org.apache.poi.hdf.extractor.data.*;
import org.apache.poi.hdf.extractor.*;
import java.util.*;
import java.io.*;
import javax.swing.*;
import java.awt.*;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.POIFSDocument;
import org.apache.poi.poifs.filesystem.DocumentEntry;
import org.apache.poi.util.LittleEndian;
class testDoc extends Deneme{
String origFileName;
String tempFile;
WordDocument wd;
testDoc(String origFileName, String tempFile) {
this.tempFile=tempFile;
this.origFileName=origFileName;
}
public void getText() {
try {
wd = new WordDocument(origFileName);
//Writer out = new BufferedWriter(new FileWriter(tempFile)); //eskisi
Writer out = new OutputStreamWriter(new FileOutputStream(tempFile),"utf-8");
wd.writeAllText(out);
out.flush();
out.close();
}
catch (Exception eN) {
System.out.println("Error reading document:"+origFileName+"\n"+eN.toString());
}
} // end for getText
} // end of class
------------------------
the problem starts in
wd.writeAllText(out);
when we look at the this method we see that end integer doesn't get the end
point when we use unicode ms doc file..
Thank you for your supports.