You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2003/03/10 13:59:26 UTC
DO NOT REPLY [Bug 17824] New: - about reading ms. doc file

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824

about reading ms. doc file

           Summary: about reading ms. doc file
           Product: POI
           Version: unspecified
          Platform: Sun
        OS/Version: Other
            Status: NEW
          Severity: Major
          Priority: Other
         Component: HDF
        AssignedTo: poi-dev@jakarta.apache.org
        ReportedBy: tdyildirim@yahoo.com


When i read a ms doc file with using HDF classes. I have got a big problem. If 
my data is not unicode and contains english char then there is no problem. But 
when i use unicode or utf-8 charset then i have a big problem. because when we 
use those type of charter string. It doesn't read all the data. it stopped to 
read some part of the data for example if i use something like inside of 
demo.doc document:  ���� 
then when we read we got ���
and it is increasing like this.

i will send my example given below

public class Deneme {

	public static void main(String[] args) {
		
	testDoc deneme = new testDoc("demo.doc","demo.txt");
	deneme.getText();
	}
}

-----------------------------
//------- this code writes doc file to txt-----------
//------go get hfd libs from jakarta.poi (scratchpad at the moment)-------------
-------------------
//------------------------------------------------------------------------------
---------------
import org.apache.poi.hdf.extractor.util.*;
import org.apache.poi.hdf.extractor.data.*;
import org.apache.poi.hdf.extractor.*;
import java.util.*;
import java.io.*;
import javax.swing.*;

import java.awt.*;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.POIFSDocument;
import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.util.LittleEndian;

class testDoc extends Deneme{
String origFileName;
String tempFile;
WordDocument wd;

testDoc(String origFileName, String tempFile) {
this.tempFile=tempFile;
this.origFileName=origFileName;
}

public void getText() {
try {
wd = new WordDocument(origFileName);
//Writer out = new BufferedWriter(new FileWriter(tempFile)); //eskisi
Writer out = new OutputStreamWriter(new FileOutputStream(tempFile),"utf-8");
				
            

wd.writeAllText(out);
out.flush();
out.close();
}
catch (Exception eN) {
System.out.println("Error reading document:"+origFileName+"\n"+eN.toString());
}
} // end for getText

} // end of class
 
 
------------------------
the problem starts in 
wd.writeAllText(out);
when we look at the this method we see that end integer doesn't get the end 
point when we use unicode ms doc file..

Thank you for your supports.