You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by ra...@epiphany.com on 2003/03/10 19:55:48 UTC
RE: DO NOT REPLY [Bug 17824] New: - about reading ms. doc file
Which build are u using, WordDocument class has been depricated in 1.10. I
am facing the same issue while using 1.10 where the UTF-8 input Stream gets
converted to junk characters in out put file.
code snippet attached:
public void testWordUsingPOI() throws Exception {
FileInputStream inputStream = null;
FileOutputStream outputStream = null;
long startTime = System.currentTimeMillis();
try{
inputStream = new
FileInputStream("C:\\KMS\\TEXTMINING\\TextExtraction\\tests\\input\\word\\Ch
inese.doc");
outputStream = new
FileOutputStream("C:\\KMS\\TEXTMINING\\TextExtraction\\tests\\output\\chines
e.doc");
}catch(FileNotFoundException fnfe){
fnfe.printStackTrace();
}catch(Exception e){
e.printStackTrace();
}
WordExtractor wordExtractor = new WordExtractor();
if(inputStream != null){
long intermediateTime = System.currentTimeMillis();
String output = wordExtractor.extractText(inputStream);
long timeUsedonlyForExtraction =
System.currentTimeMillis()-intermediateTime;
System.out.println("Time for only extraction "+
timeUsedonlyForExtraction);
try{
BufferedWriter out = new BufferedWriter(new
OutputStreamWriter(outputStream, "UTF-8"));
out.write(output);
out.flush();
out.close();
}catch(IOException ioe){
ioe.printStackTrace();
}
}
}
Any one has tried extracting text from utf-8 word or excel file types.
-----Original Message-----
From: bugzilla@apache.org [mailto:bugzilla@apache.org]
Sent: Monday, March 10, 2003 4:59 AM
To: poi-dev@jakarta.apache.org
Subject: DO NOT REPLY [Bug 17824] New: - about reading ms. doc file
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824
about reading ms. doc file
Summary: about reading ms. doc file
Product: POI
Version: unspecified
Platform: Sun
OS/Version: Other
Status: NEW
Severity: Major
Priority: Other
Component: HDF
AssignedTo: poi-dev@jakarta.apache.org
ReportedBy: tdyildirim@yahoo.com
When i read a ms doc file with using HDF classes. I have got a big problem.
If
my data is not unicode and contains english char then there is no problem.
But
when i use unicode or utf-8 charset then i have a big problem. because when
we
use those type of charter string. It doesn't read all the data. it stopped
to
read some part of the data for example if i use something like inside of
demo.doc document: ýýüü
then when we read we got ýýü
and it is increasing like this.
i will send my example given below
public class Deneme {
public static void main(String[] args) {
testDoc deneme = new testDoc("demo.doc","demo.txt");
deneme.getText();
}
}
-----------------------------
//------- this code writes doc file to txt-----------
//------go get hfd libs from jakarta.poi (scratchpad at the
moment)-------------
-------------------
//--------------------------------------------------------------------------
----
---------------
import org.apache.poi.hdf.extractor.util.*;
import org.apache.poi.hdf.extractor.data.*;
import org.apache.poi.hdf.extractor.*;
import java.util.*;
import java.io.*;
import javax.swing.*;
import java.awt.*;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.POIFSDocument;
import org.apache.poi.poifs.filesystem.DocumentEntry;
import org.apache.poi.util.LittleEndian;
class testDoc extends Deneme{
String origFileName;
String tempFile;
WordDocument wd;
testDoc(String origFileName, String tempFile) {
this.tempFile=tempFile;
this.origFileName=origFileName;
}
public void getText() {
try {
wd = new WordDocument(origFileName);
//Writer out = new BufferedWriter(new FileWriter(tempFile)); //eskisi
Writer out = new OutputStreamWriter(new FileOutputStream(tempFile),"utf-8");
wd.writeAllText(out);
out.flush();
out.close();
}
catch (Exception eN) {
System.out.println("Error reading
document:"+origFileName+"\n"+eN.toString());
}
} // end for getText
} // end of class
------------------------
the problem starts in
wd.writeAllText(out);
when we look at the this method we see that end integer doesn't get the end
point when we use unicode ms doc file..
Thank you for your supports.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-dev-help@jakarta.apache.org