You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com> on 2004/03/31 10:30:36 UTC

MS Word 2.X to Text (Source Code)

Hello All,


This is the sample code for parsing the MS Word 2.x Documents.
Please Let me know if I wanted to do any changes in it. Your
help is always welcome and appreciatable


Yhanks & Regards,
Sudhakar


//Beginning of Source  Code





/**
 * <p>Title: Word Document Parser</p>
 * <p>Description: This parser parses the Microsoft Word
Documents of Version  2.0 text</p>
 * <p>Copyright: Open Source Code/p>
 * @author Sudhakar Chavali Sharma
 * @version 1.0
 */

public class Word2 {
  public Word2() {
  }

  public static void main(String[] args) throws Exception{
    Word2 word21 = new Word2();
    System.out.println(word21.getText(args[0])) ;
  }
  /**
   * takes the Document name as an argument and reads the
document for getting the parsed text
   * @param file
   * @return String
   * @throws java.lang.Exception
   */
  public String getText(String file) throws Exception
  {
    java.io.FileInputStream stream=new java.io.FileInputStream
(file);
    String buffer="";
    byte bytes[]=new byte[stream.available()];
    int length=stream.read(bytes);
    buffer=new String(bytes,length);
    return ParseWord2(buffer,buffer.length());
  }

  /**
   * Parses the Word Document (Version 2.0) Buffer to normal
Text Buffer
   * @param sourceBuffer
   * @param sourceLength
   * @return String
   */
  String ParseWord2(String sourceBuffer, long sourceLength) {

    int counter; //source buffer pointer
    long quitcounter; //pointer to quit the parsing
    int incrementer; // general incrementer, used in loops
    String destinationString; //destination string;
    counter = 384; //starting position of text
    /*
     Traverse the buffer until pointer reaches the cument length
     */
    destinationString = "";
    while (counter < sourceLength) {
      quitcounter = 0;
      if (sourceBuffer.charAt(counter) == 0) {
        for (incrementer = 1; incrementer <= 10; incrementer++)
{
          if ( (sourceBuffer.charAt(counter + incrementer) ==
0)) {
            quitcounter = quitcounter + 1;
          }
          else {
            break;
          }
        }
      }
      if (quitcounter >= 10) {

        break;
      }

      if (sourceBuffer.charAt(counter) == 19) { //&&
(sourceBuffer[counter+1]='t') && (sourceBuffer[counter+2]='o')
&& (sourceBuffer[counter+3]='c'))
        counter = counter + 1;
        while (true) {
          if (sourceBuffer.charAt(counter) == 20) {
            counter = counter + 1;
            break;
          }
          counter = counter + 1;
        }
        while (true) {
          if (sourceBuffer.charAt(counter) == 21) {
            counter = counter + 1;
            break;
          }
          destinationString = destinationString +
              (char) sourceBuffer.charAt(counter);
          counter = counter + 1;
        }
      }
      else {
        if ( (sourceBuffer.charAt(counter) == 13) &&
            (sourceBuffer.charAt(counter + 1) == 7)) {
          if ( (sourceBuffer.charAt(counter + 2) == 13) &&
              (sourceBuffer.charAt(counter + 3) == 7)) {
            /*
                This is row break in a table
             */

            destinationString = destinationString + (char) 13;
            destinationString = destinationString + (char) 10;
            counter = counter + 4;
          }
          else {
            /*                This is column Break in Table     
       */

            destinationString = destinationString + (char) 9;
            counter = counter + 2;
          }
        }
        else {
          //this is for column breaks
          if ( (sourceBuffer.charAt(counter) == 13) &&
              (sourceBuffer.charAt(counter + 1) == 10) &&
              (sourceBuffer.charAt(counter + 2) == 14)) {
            destinationString = destinationString + (char) 13;
            destinationString = destinationString + (char) 10;
            counter = counter + 3;
          }
          else if ( (sourceBuffer.charAt(counter) == 13) &&
                   (sourceBuffer.charAt(counter + 1) == 10) &&
                   (sourceBuffer.charAt(counter + 2) == 12)) {
            /*This is Page Break*/
            destinationString = destinationString + (char) 13;
            destinationString = destinationString + (char) 10;
            counter = counter + 3;
          }
          else {
            /*               Normal flow of charachters         
   */
            if (sourceBuffer.charAt(counter) != 0) {
              destinationString = destinationString +
                  (char) sourceBuffer.charAt(counter);
            }
            counter = counter + 1;
          }
        }
      }
    }
    return destinationString;
  }
}





// End of Source Code

=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)

__________________________________
Do you Yahoo!?
Yahoo! Finance Tax Center - File online. File on time.
http://taxes.yahoo.com/filing.html

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org


Re: MS Word 2.X to Text (Source Code)

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.
I Didn't get you

I haven't understand what you are expecting can you ask it more
elaborately


Thanks,
Sudhakar
--- supranormale <su...@msn.com> wrote:
> If I want only text that user wrote inside Word document, how
> can I do ?
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail:
> poi-user-help@jakarta.apache.org
> 


=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)

__________________________________
Do you Yahoo!?
Yahoo! Small Business $15K Web Design Giveaway 
http://promotions.yahoo.com/design_giveaway/

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org


Re: Textmining Memory Performance

Posted by Ryan Ackley <sa...@cfl.rr.com>.
Send me your document, there may be some other problem. I don't think the
StringBuffer would cause that because text doesn't require that much memory
(3MB of text == about 5 books).

-Ryan

----- Original Message ----- 
From: "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>
To: "POI Users List" <po...@jakarta.apache.org>
Sent: Saturday, April 03, 2004 1:59 AM
Subject: Textmining Memory Performance


> Hi Ryan,
>
> When I tested the application with large documents, Application
> is raising java.lang.OutOfMemoryError. When I analyse your
> source code, I have identified that you are using StringBuffer.
> Performance wise  using of StringBuffer is worth ful but
> considering memory issue I think this is a serious issue.
>
> Why because When I had done a test run on StringBuffer memory
> Performance, after reaching the size to 13786405 bytes, JVM
> raised an exception though i have lot of free physical memory
> because I am using 256 MB RAM .I think this RAM is enough to
> parse the 15 MB document but it is not doing that. Any ideas to
> improve the performance of application in this scenario.
>
> Thanks,
> sudhakar
>
>
>
> =====
> "No one can earn a million dollars honestly."- William Jennings Bryan
(1860-1925)
>
> "Make everything as simple as possible, but not simpler."- Albert Einstein
(1879-1955)
>
> "It is dangerous to be sincere unless you are also stupid."- George
Bernard Shaw (1856-1950)
>
> __________________________________
> Do you Yahoo!?
> Yahoo! Small Business $15K Web Design Giveaway
> http://promotions.yahoo.com/design_giveaway/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org


Textmining Memory Performance

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.
Hi Ryan,

When I tested the application with large documents, Application
is raising java.lang.OutOfMemoryError. When I analyse your
source code, I have identified that you are using StringBuffer.
Performance wise  using of StringBuffer is worth ful but
considering memory issue I think this is a serious issue.

Why because When I had done a test run on StringBuffer memory
Performance, after reaching the size to 13786405 bytes, JVM
raised an exception though i have lot of free physical memory
because I am using 256 MB RAM .I think this RAM is enough to
parse the 15 MB document but it is not doing that. Any ideas to
improve the performance of application in this scenario.

Thanks,
sudhakar



=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)

__________________________________
Do you Yahoo!?
Yahoo! Small Business $15K Web Design Giveaway 
http://promotions.yahoo.com/design_giveaway/

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org


Re: MS Word 2.X to Text (Source Code)

Posted by supranormale <su...@msn.com>.
If I want only text that user wrote inside Word document, how can I do ?


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org