You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by ra...@epiphany.com on 2003/03/10 19:55:48 UTC
RE: DO NOT REPLY [Bug 17824] New: - about reading ms. doc file

Which build are u using, WordDocument class has been depricated in 1.10. I
am facing the same issue while using 1.10 where the UTF-8 input Stream gets
converted to junk characters in out put file.

code snippet attached:

  public void testWordUsingPOI() throws Exception {
        FileInputStream inputStream = null;
        FileOutputStream outputStream = null;
        long startTime = System.currentTimeMillis();
        try{

            inputStream = new
FileInputStream("C:\\KMS\\TEXTMINING\\TextExtraction\\tests\\input\\word\\Ch
inese.doc");
            outputStream = new
FileOutputStream("C:\\KMS\\TEXTMINING\\TextExtraction\\tests\\output\\chines
e.doc");

        }catch(FileNotFoundException fnfe){
            fnfe.printStackTrace();
        }catch(Exception e){
            e.printStackTrace();
        }
        WordExtractor wordExtractor = new WordExtractor();
        if(inputStream != null){
            long intermediateTime = System.currentTimeMillis();
            String output = wordExtractor.extractText(inputStream);
            long timeUsedonlyForExtraction =
System.currentTimeMillis()-intermediateTime;
            System.out.println("Time for only extraction "+
timeUsedonlyForExtraction);
            try{
                BufferedWriter out = new BufferedWriter(new
OutputStreamWriter(outputStream, "UTF-8"));
                out.write(output);
                out.flush();
                out.close();
              }catch(IOException ioe){
                ioe.printStackTrace();
            }
        }


    }

Any one has tried extracting text from utf-8 word or excel file types.



-----Original Message-----
From: bugzilla@apache.org [mailto:bugzilla@apache.org]
Sent: Monday, March 10, 2003 4:59 AM
To: poi-dev@jakarta.apache.org
Subject: DO NOT REPLY [Bug 17824] New: - about reading ms. doc file


DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=17824

about reading ms. doc file

           Summary: about reading ms. doc file
           Product: POI
           Version: unspecified
          Platform: Sun
        OS/Version: Other
            Status: NEW
          Severity: Major
          Priority: Other
         Component: HDF
        AssignedTo: poi-dev@jakarta.apache.org
        ReportedBy: tdyildirim@yahoo.com


When i read a ms doc file with using HDF classes. I have got a big problem.
If 
my data is not unicode and contains english char then there is no problem.
But 
when i use unicode or utf-8 charset then i have a big problem. because when
we 
use those type of charter string. It doesn't read all the data. it stopped
to 
read some part of the data for example if i use something like inside of 
demo.doc document:  ýýüü 
then when we read we got ýýü
and it is increasing like this.

i will send my example given below

public class Deneme {

	public static void main(String[] args) {
		
	testDoc deneme = new testDoc("demo.doc","demo.txt");
	deneme.getText();
	}
}

-----------------------------
//------- this code writes doc file to txt-----------
//------go get hfd libs from jakarta.poi (scratchpad at the
moment)-------------
-------------------
//--------------------------------------------------------------------------
----
---------------
import org.apache.poi.hdf.extractor.util.*;
import org.apache.poi.hdf.extractor.data.*;
import org.apache.poi.hdf.extractor.*;
import java.util.*;
import java.io.*;
import javax.swing.*;

import java.awt.*;

import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.POIFSDocument;
import org.apache.poi.poifs.filesystem.DocumentEntry;

import org.apache.poi.util.LittleEndian;

class testDoc extends Deneme{
String origFileName;
String tempFile;
WordDocument wd;

testDoc(String origFileName, String tempFile) {
this.tempFile=tempFile;
this.origFileName=origFileName;
}

public void getText() {
try {
wd = new WordDocument(origFileName);
//Writer out = new BufferedWriter(new FileWriter(tempFile)); //eskisi
Writer out = new OutputStreamWriter(new FileOutputStream(tempFile),"utf-8");
				
            

wd.writeAllText(out);
out.flush();
out.close();
}
catch (Exception eN) {
System.out.println("Error reading
document:"+origFileName+"\n"+eN.toString());
}
} // end for getText

} // end of class
 
 
------------------------
the problem starts in 
wd.writeAllText(out);
when we look at the this method we see that end integer doesn't get the end 
point when we use unicode ms doc file..

Thank you for your supports.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-dev-help@jakarta.apache.org