You are viewing a plain text version of this content. The canonical link for it is here.

Posted to slide-user@jakarta.apache.org by Ryan Rhodes <ry...@hotmail.com> on 2004/08/01 08:07:59 UTC

RE: MSPowerPointExtractor problem

Hi Ralph,

I haven't tested the PPT extractor with any other languages.  I remember
reading about other people having problems with different character sets
though.

Could you send a before and after example file here or to bugzilla?

-Ryan Rhodes


-----Original Message-----
From: Ralph Scheuer [mailto:ralph.scheuer@gmx.net] 
Sent: Wednesday, July 28, 2004 10:01 AM
To: slide
Subject: MSPowerPointExtractor problem

Hello everybody,

When I was searching for a Java class to extract text from PowerPoint 
files, I accidentally discovered Slide.

I pulled the MSPowerPointExtractor class and some other stuff it 
depends on via CVS and tried it for some text extraction.

The method I used looks very similar to the provided example main 
method (see below).

However. when I tried to extract text from a German PowerPoint 
presentation, I had some problems with the encoding. I did not know 
which encoding to use, converting the output to ISO Latin 1 with my 
text editor solved only part of the problem (some German Umlaute were 
displayed correctly, some were not).

Is this a known issue or am I doing something wrong? Any hints for me?

Thanks in advance.

Ralph Scheuer

BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding 
on this platform is MacRoman.


     public static String contentStringForData(NSData data){
	
	StringBuffer buf = new StringBuffer();
	try{
	    ByteArrayInputStream input = data.stream();
	    MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
null);
	
	    Reader reader = ex.extract(input);
	
	    int c;
	    do
		{
		    c = reader.read();
		
		    buf.append((char)c);
		}
	    while( c != -1 );
	}catch(Exception e){
	
	}
	
	return buf.toString();
     }

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

[Solved]: MSPowerPointExtractor problem

Posted by Ralph Scheuer <ra...@gmx.net>.

Hi everybody,

just a quick note for everybody:

Meanwhile, I managed to solve the problem. Ryan's and Sudhakar's 
sources work flawlessly (at least with German special characters) after 
adding one additional method to the source (see below).

The extracted String needs to be interpreted with "Cp1252" encoding.

It may well be that this is a Mac-specific encoding problem - I cannot 
verify the Windows or Linux behavior here.

Anyway, the following code solved the problem for me.

Again, thanks for all the great work you have done.

Ralph Scheuer

     private static String convertEncoding(String incoming){
         String outgoing = null;
         try {
	    outgoing = new String(incoming.getBytes(), "Cp1252");
	
         } catch (Exception e) {
             SDLogger.catchException(e);
         }
         return outgoing;
     }

PS: If there are no objections, I would like to contact the POI 
developer team and file a bug in bugzilla as I have the feeling that 
the code both of you have provided would be ideally suited for 
integrating some variant of it into the POI framework.

Kind regards.

Ralph

Am 02.08.2004 um 13:13 schrieb Koundinya ((Sudhakar Chavali)):

> Hmmmmm,
>
> Basically we have concentrated on English language. So we never faced 
> any problems. It become a
> new task for our team now :-)
>
> Thanks to Ralph in pointing that problem.
>
> We Will work on related and let the Jakarta team knows :-)
>
> Regards
> Sudhakar
>


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

Re: MSPowerPointExtractor problem

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Hmmmmm,

Basically we have concentrated on English language. So we never faced any problems. It become a
new task for our team now :-) 

Thanks to Ralph in pointing that problem.

We Will work on related and let the Jakarta team knows :-)

Regards
Sudhakar





--- Ralph Scheuer <ra...@gmx.net> wrote:

> Ryan,
> 
> thanks for your reply.
> 
> I have also seen the posts from Sudhakar on this subject who seems to 
> be contributing a whole lot of code here - which is a great thing but 
> in this code the problem also persists so I think we solve this 
> encoding problem in your code (which is simpler - the fix could later 
> be integrated into Sudhakar's code if this is checked in or 
> whatever...).
> 
> I have tested this with a simple PPT file containing just the following 
> text:
> 
> Umlaut-Test
> �kologie, M�hsal, Gr��e, Gr�tsche
> 
> I get the following console output with this text:
> 
> Umlaut-Test
> \326kologie, M\374hsal, Gr\374\337e, Gr\344tsche
> 
> Here is the output I get in a web browser (through a web app, "view 
> HTML source" mode):
> 
> Umlaut-Test �kologie, M�hsal, Gr�?e, Gr?tsche
> 
> German "umlaute" and other special characters work fine that way 
> whenever I extract text from Word documents or Excel spreadsheets using 
> POI and Ryan Ackley's TextMining framework.
> 
> just for the record: I have only tested this on my own configuration: 
> Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes 
> might behave on Linux or Windows. Can anybody confirm this? I have seen 
> some German names on this list ;-)
> 
> Thanks for all the work you put into this.
> 
> Ralph Scheuer
> 
> Am 01.08.2004 um 08:07 schrieb Ryan Rhodes:
> 
> > Hi Ralph,
> >
> > I haven't tested the PPT extractor with any other languages.  I 
> > remember
> > reading about other people having problems with different character 
> > sets
> > though.
> >
> > Could you send a before and after example file here or to bugzilla?
> >
> > -Ryan Rhodes
> >
> >
> > -----Original Message-----
> > From: Ralph Scheuer [mailto:ralph.scheuer@gmx.net]
> > Sent: Wednesday, July 28, 2004 10:01 AM
> > To: slide
> > Subject: MSPowerPointExtractor problem
> >
> > Hello everybody,
> >
> > When I was searching for a Java class to extract text from PowerPoint
> > files, I accidentally discovered Slide.
> >
> > I pulled the MSPowerPointExtractor class and some other stuff it
> > depends on via CVS and tried it for some text extraction.
> >
> > The method I used looks very similar to the provided example main
> > method (see below).
> >
> > However. when I tried to extract text from a German PowerPoint
> > presentation, I had some problems with the encoding. I did not know
> > which encoding to use, converting the output to ISO Latin 1 with my
> > text editor solved only part of the problem (some German Umlaute were
> > displayed correctly, some were not).
> >
> > Is this a known issue or am I doing something wrong? Any hints for me?
> >
> > Thanks in advance.
> >
> > Ralph Scheuer
> >
> > BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding
> > on this platform is MacRoman.
> >
> >
> >      public static String contentStringForData(NSData data){
> > 	
> > 	StringBuffer buf = new StringBuffer();
> > 	try{
> > 	    ByteArrayInputStream input = data.stream();
> > 	    MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
> > null);
> > 	
> > 	    Reader reader = ex.extract(input);
> > 	
> > 	    int c;
> > 	    do
> > 		{
> > 		    c = reader.read();
> > 		
> > 		    buf.append((char)c);
> > 		}
> > 	    while( c != -1 );
> > 	}catch(Exception e){
> > 	
> > 	}
> > 	
> > 	return buf.toString();
> >      }
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: slide-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
> 
> 


=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

Re: MSPowerPointExtractor problem

Posted by Ralph Scheuer <ra...@gmx.net>.

Ryan,

thanks for your reply.

I have also seen the posts from Sudhakar on this subject who seems to 
be contributing a whole lot of code here - which is a great thing but 
in this code the problem also persists so I think we solve this 
encoding problem in your code (which is simpler - the fix could later 
be integrated into Sudhakar's code if this is checked in or 
whatever...).

I have tested this with a simple PPT file containing just the following 
text:

Umlaut-Test
Ökologie, Mühsal, Größe, Grätsche

I get the following console output with this text:

Umlaut-Test
\326kologie, M\374hsal, Gr\374\337e, Gr\344tsche

Here is the output I get in a web browser (through a web app, "view 
HTML source" mode):

Umlaut-Test ÷kologie, M¸hsal, Gr¸?e, Gr?tsche

German "umlaute" and other special characters work fine that way 
whenever I extract text from Word documents or Excel spreadsheets using 
POI and Ryan Ackley's TextMining framework.

just for the record: I have only tested this on my own configuration: 
Mac OS X 10.3.4, Java 1.4.2_03 so I have no idea how these classes 
might behave on Linux or Windows. Can anybody confirm this? I have seen 
some German names on this list ;-)

Thanks for all the work you put into this.

Ralph Scheuer

Am 01.08.2004 um 08:07 schrieb Ryan Rhodes:

> Hi Ralph,
>
> I haven't tested the PPT extractor with any other languages.  I 
> remember
> reading about other people having problems with different character 
> sets
> though.
>
> Could you send a before and after example file here or to bugzilla?
>
> -Ryan Rhodes
>
>
> -----Original Message-----
> From: Ralph Scheuer [mailto:ralph.scheuer@gmx.net]
> Sent: Wednesday, July 28, 2004 10:01 AM
> To: slide
> Subject: MSPowerPointExtractor problem
>
> Hello everybody,
>
> When I was searching for a Java class to extract text from PowerPoint
> files, I accidentally discovered Slide.
>
> I pulled the MSPowerPointExtractor class and some other stuff it
> depends on via CVS and tried it for some text extraction.
>
> The method I used looks very similar to the provided example main
> method (see below).
>
> However. when I tried to extract text from a German PowerPoint
> presentation, I had some problems with the encoding. I did not know
> which encoding to use, converting the output to ISO Latin 1 with my
> text editor solved only part of the problem (some German Umlaute were
> displayed correctly, some were not).
>
> Is this a known issue or am I doing something wrong? Any hints for me?
>
> Thanks in advance.
>
> Ralph Scheuer
>
> BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding
> on this platform is MacRoman.
>
>
>      public static String contentStringForData(NSData data){
> 	
> 	StringBuffer buf = new StringBuffer();
> 	try{
> 	    ByteArrayInputStream input = data.stream();
> 	    MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
> null);
> 	
> 	    Reader reader = ex.extract(input);
> 	
> 	    int c;
> 	    do
> 		{
> 		    c = reader.read();
> 		
> 		    buf.append((char)c);
> 		}
> 	    while( c != -1 );
> 	}catch(Exception e){
> 	
> 	}
> 	
> 	return buf.toString();
>      }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

Re: PPTExtractor Testing

Posted by Stephane James Vaucher <va...@cirano.qc.ca>.

I'll try it out this week, I'll let you know my results.

sv

On Tue, 3 Aug 2004, Koundinya (Sudhakar Chavali) wrote:

> Hello All,
>
> I think world has started testing the codes I have given for Powerpoint to Text extraction.
> I would like to know any feedback/constuctive criticism on the codes so that I will try to modify
> them and share with you.
>
> Thanks & Best Regards
> Sudhakar Chavali Sharma
>
> =====
> "No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925)
>
> "Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)
>
> "It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)
>
>
>
>
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 100MB free storage!
> http://promotions.yahoo.com/new_mail
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: PPTExtractor Testing

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Hello All,

I think world has started testing the codes I have given for Powerpoint to Text extraction.
I would like to know any feedback/constuctive criticism on the codes so that I will try to modify
them and share with you.

Thanks & Best Regards
Sudhakar Chavali Sharma

=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: PPTExtractor Testing

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Hello All,

I think world has started testing the codes I have given for Powerpoint to Text extraction.
I would like to know any feedback/constuctive criticism on the codes so that I will try to modify
them and share with you.

Thanks & Best Regards
Sudhakar Chavali Sharma

=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-dev-help@jakarta.apache.org

Re: PPTExtractor Testing

Posted by Avik Sengupta <av...@itellix.com>.

Sudhakar, 

This is indeed very interesting, and surely useful to many people here.
A couple of points as we think of a plan..

Do you intend for this code to be part of POI? If so, do you think you
will be around to maintain and enhace this code? 

Regards
-
Avik


On Mon, 2004-08-02 at 17:03, Koundinya (Sudhakar Chavali) wrote:
> To All,
> 
> In the earlier mails I have sent,
> 
> You find
> 
> PPT2Text, PPTSlide, PPTClientTextBox, PPTConstants, MSPowerPointDocumentParserException classes.
> Use the only specified classes and compile them.
> 
> For Testing here is an example
> 
> String docName="MyPPT.ppt"; //or it works for .pps files also
> PPT2Text ppt2Text = new PPT2Text(docName);
> ppt2Text.read();
> String pptText = ppt2Text.getText();
> 
> 
> 
> 
> =====
> "No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 
> 
> "Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)
> 
> "It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)
> 
> 
> 		
> __________________________________
> Do you Yahoo!?
> Yahoo! Mail Address AutoComplete - You start. We finish.
> http://promotions.yahoo.com/new_mail
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

PPTExtractor Testing

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

To All,

In the earlier mails I have sent,

You find

PPT2Text, PPTSlide, PPTClientTextBox, PPTConstants, MSPowerPointDocumentParserException classes.
Use the only specified classes and compile them.

For Testing here is an example

String docName="MyPPT.ppt"; //or it works for .pps files also
PPT2Text ppt2Text = new PPT2Text(docName);
ppt2Text.read();
String pptText = ppt2Text.getText();




=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


		
__________________________________
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: MSPowerPointExtractor problem

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

The main classes you can find there are

PPT2Text
PPTSlide
PPTConstants
PPTClientTextBox

Only use that classes only.

The other classes are for some other purposes

Regards
Sudhakar



--- Ralph Scheuer <ra...@gmx.net> wrote:

> Sudhakar,
> 
> thanks for this great contribution to the community.
> 
> I have already tested the code you provided, and while everything seems  
> fine so far, there is one reference in the code that makes the compiler  
> fail:
> 
> In the PPTDocument constructor, you refer to a class called  
> MSDocumentSummary.
> 
> However, you did not provide this class. Did you intend to provide it  
> or should that reference to it just be killed (thereby eliminating the  
> summary feature)? Or is this something that still needs to be  
> implemented?
> 
> Anyway, thanks again for sharing this code.
> 
> Ralph Scheuer
> 
> 
> Am 01.08.2004 um 22:44 schrieb Koundinya ((Sudhakar Chavali)):
> 
> > Hello All,
> >
> > This was my first contribution
> > http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/ 
> > attachments/PPT2Text.java for
> > jakarta team. And it seems another expert(Ryan Rhodes-  
> > ryanshaerhodes@hotmail.com) has already
> > started working on that based on my first given contribution.
> >
> > That sounds great to me.
> >
> > So In order to increase the development process for Powerpoint  
> > extractor, I just wanted to
> > contribute Our team efforts in developing the Powerpoint extractor
> >
> > Authors :- Sudhakar Chavali (sudhakar_koundinya@yahoo.com) and Hari  
> > Shanker Goud
> > (harigoud@yahoo.com)
> >
> >
> > Have a look on the below source codes
> >
> >
> > Regards
> > Sudhakar
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
> 
> 


=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


		
__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: MSPowerPointExtractor problem

Posted by Ralph Scheuer <ra...@gmx.net>.

Sudhakar,

thanks for this great contribution to the community.

I have already tested the code you provided, and while everything seems  
fine so far, there is one reference in the code that makes the compiler  
fail:

In the PPTDocument constructor, you refer to a class called  
MSDocumentSummary.

However, you did not provide this class. Did you intend to provide it  
or should that reference to it just be killed (thereby eliminating the  
summary feature)? Or is this something that still needs to be  
implemented?

Anyway, thanks again for sharing this code.

Ralph Scheuer


Am 01.08.2004 um 22:44 schrieb Koundinya ((Sudhakar Chavali)):

> Hello All,
>
> This was my first contribution
> http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/ 
> attachments/PPT2Text.java for
> jakarta team. And it seems another expert(Ryan Rhodes-  
> ryanshaerhodes@hotmail.com) has already
> started working on that based on my first given contribution.
>
> That sounds great to me.
>
> So In order to increase the development process for Powerpoint  
> extractor, I just wanted to
> contribute Our team efforts in developing the Powerpoint extractor
>
> Authors :- Sudhakar Chavali (sudhakar_koundinya@yahoo.com) and Hari  
> Shanker Goud
> (harigoud@yahoo.com)
>
>
> Have a look on the below source codes
>
>
> Regards
> Sudhakar


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

RE: MSPowerPointExtractor problem

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Hello All,

This was my first contribution 
http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java for
jakarta team. And it seems another expert(Ryan Rhodes- ryanshaerhodes@hotmail.com) has already
started working on that based on my first given contribution.

That sounds great to me.

So In order to increase the development process for Powerpoint extractor, I just wanted to
contribute Our team efforts in developing the Powerpoint extractor

Authors :- Sudhakar Chavali (sudhakar_koundinya@yahoo.com) and Hari Shanker Goud
(harigoud@yahoo.com)


Have a look on the below source codes


Regards
Sudhakar

____________________________________________________________________________________

/**
 * Title: DocumentParserException class
 * Description: This is root Exceptional class for throwing the runtime errors that can be raised
by different parsers
 * @author Sudhakar
 * @version 1.0
 */

public class DocumentParserException
    extends Exception {

  /**
   * Constructs a new exception with null as its detail message.
   */

  public DocumentParserException() {
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   */

  public DocumentParserException(String message) {
    super(message);
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   * @param cause
   */
  public DocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}
_____________________________________________________________________________________

import java.io.*;

/**
 *
 * Title: Summary Base
 * Description: A Generic one that reads the document's summary information and returns it through
different internal methods
 * @author Sudhakar Chavali
 * @version 1.0
 */
public interface SummaryBase {
  /**
   * A method returns the Document's Author
   * @return String
   */
  public String getDocAuthor();

  /**
   * A method that returns the Document Created Date
   * @return String
   */
  public String getDocCreatedDate();

  /**
   * A method that returns the Document's Key words
   * @return String
   */
  public String getDocKeywords();

  /**
   * A method that returns the Document's comments
   * @return String
   */
  public String getDocComments();

  /**
   * A method that returns the Document Name
   * @return String
   */
  public String getDocName();

  /**
   * A method that returns the Document's Subject
   * @return String
   */
  public String getDocSubject();

  /**
   * A method that returns the Document's title
   */

  public String getDocTitle();

  /**
   * A method that reads the document's Summary Information
   * @throws DocumentParserException
   */
  public void read() throws DocumentParserException;

  /**
   * A method that writes the Document's summary information as an XML into the file
   * @param strXMLFile
   * @throws DocumentParserException
   */
  public void write(String strXMLFile) throws 
      DocumentParserException;

  /**
   * A method that writes the document's summary information as an XML into OutputStream Object
   * @param out
   * @throws DocumentParserException
   */
  public void write(OutputStream out) throws 
      DocumentParserException;

  /**
   * A method that returns the Document's summary as an XML String
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsXML() throws 
      DocumentParserException;

  /**
   * A method that returns document's summary information as normal text
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsText() throws 
      DocumentParserException;
}

______________________________________________________________________________________

import java.io.*;

/**
 * A generic document that reads the document's text and parses it into normal Ascii text using
the different methods.
 */
public interface Document {

  /**
   * A method that returns the document's text after parsing. This method should be called after
calling the read method
   * @return String
   * @see #read()
   * @throws DocumentParserException
   */
  public abstract String getText() throws 
      DocumentParserException;

  /**
   * A method that returns the parsed text as byte array. This method should be called after
calling the read method
   * @return byte[]
   * @throws DocumentParserException
   */
  public abstract byte[] getBytes() throws 
      DocumentParserException;

  /**
   * A method that writes the parsed text into the OutputStream object. This method should be
called after calling the read method
   * @param out
   * @throws DocumentParserException
   */
  public abstract void write(OutputStream out) throws 
      DocumentParserException, Exception;

  /**
   * A method that reads and parses the document into Normal text
   * @throws DocumentParserException
   */
  public abstract void read() throws 
      DocumentParserException, Exception;

  /**
   * A method that returns the document summary information as Normal Text
   * @return String
   */
  public abstract String getDocumentSummaryAsText();

  /**
   * A method that returns the document summary information as an XML String
   * @return String
   */
  public abstract String getDocumentSummaryAsXML();

  /**
   * A method that returns the InputStream which contains the document summary information as a
Normal Text
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentTextStream() throws 
      DocumentParserException;

  ;

  /**
   * A method that returns the InputStream which contains the document summary information as an
XML String
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentSummaryStream() throws 
      DocumentParserException;

  ;

}


______________________________________________________________________________________

/**
 * <p>Title: PPTSlide</p>
 * <p>Description: A class that holds the Powerpoint slide</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

class PPTSlide {
  /**
   * Holds the Slide Number
   */
  protected long slideNumber;

  /**
   * Holds the contents of the Slide
   */
  protected Vector contents;

  /**
   * Initialise the Object for holding the contents of Power Point Slide
   * @param number
   */
  public PPTSlide(long number) {
    slideNumber = number;
    contents = new Vector();
  }

  /**
   * Add the Content of Slide to this Object
   * @param content
   */

  public void addContent(String content) {
    contents.addElement(content);
  }

  /**
   * returns the contents of slide as a vector object
   * @return Vector
   */
  public Vector getContent() {
    return contents;
  }

  /**
   * returns the slide value
   * @return long
   */

  public long getSlideNumber() {
    return slideNumber;
  }

}

_____________________________________________________________________________________





import java.io.OutputStream;
import java.io.InputStream;
import java.io.*;

/**
 * <p>Title: Parsers</p>
 * <p>Description: Class that parses the Power Point Document Content and it's summary to text</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public class PPTDocument
    implements Document {
  /**
   * Checks whether the summary information was already read or not
   */

  private boolean bRead = false;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private SummaryBase summary = null;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private String docName = null;

  /**
   * Holds the Powerpoint Document   Text
   */
  private String pptText = "";

  /**
   * Constructor for setting  the PowerPoint document path for initialising the POI object
   * @param docName
   */
  public PPTDocument(String docName) {
    this.docName = docName;
    summary = new MSDocumentSummary(docName);

  }

  /**
   * returns the parsed Powerpoint Document text
   * @return String
   */
  public String getText() {

    return pptText;

  }

  /**
   * returns the Powerpoint Document text as bytes
   * @return byte[]
   */
  public byte[] getBytes() {
    return getText().getBytes();
  }

  /**
   * Writes the Powerpoint Document Text into OutputStream object
   * @param out
   * @throws MSPowerPointDocumentParserException
   */
  public void write(OutputStream out) throws
      MSPowerPointDocumentParserException {
    try {
      out.write(getText().getBytes());
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Error Raised while writing the text into OutputStream Object \nError Cause : " +
          _docError, _docError);
    }

  }

  /**
   * Reads the Powerpoint document for getting the text from it.
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    PPT2Text ppt2Text = new PPT2Text(docName);
    ppt2Text.read();
    pptText = ppt2Text.getText();

    pptText = pptText.replaceAll("\r", "\n");
    pptText = pptText.replaceAll("\n", "\r\n");

  }

  /**
   * returns the document summary as tab delimited text
   * @return
   */
  public String getDocumentSummaryAsText() {
    try {
      if (!bRead) {
        summary.read(); //read the summary object
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsText(); //Build and return the Summary as Normal text

    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document summary as xml
   * @return
   */
  public String getDocumentSummaryAsXML() {
    try {
      if (!bRead) {
        summary.read(); //read the summary information of a document
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsXML(); //build annd return the summary as XML string
    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document text as Stream Object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentTextStream() throws
      MSPowerPointDocumentParserException {
    try {
      ByteArrayInputStream in = new ByteArrayInputStream(this.getBytes()); //Write the Document
Text in InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //EXCEPTION RAISED WHILE CREATING THE InputStream OBJECT
      throw new MSPowerPointDocumentParserException(
          "Unable to return the document text as an InputStream\n\tException Root :" +
          _documentError);
    }

  }

  /**
   * returns the document summary as stream object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentSummaryStream() throws
      MSPowerPointDocumentParserException {
    try {
      /*
       A method that reads the Document Summary Stream in  InputStream Object
       */
      ByteArrayInputStream in = new ByteArrayInputStream(
          getDocumentSummaryAsText().getBytes()); //Write the Summary Information into the
InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //error raised while creating the document summary info stream object
      // throw it
      throw new MSPowerPointDocumentParserException(
          "Unable to get Document Summary Information as Stream\n\tException Root:  " +
          _documentError);
    }

  }

}
________________________________________________________________________________________________



/**
 * <p>Title: PPTConstants</p>
 * <p>Description: A class that holds the Powerpoint constants</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public interface PPTConstants {


  public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
  public static final int PPT_DIAGRAMGROUP_ATOM = 61448;
  public static final int PPT_TEXTCHAR_ATOM = 4000;
  public static final int PPT_TEXTBYTE_ATOM = 4008;
  public static final int PPT_USEREDIT_ATOM = 4085;
  public static final int PPT_MASTERSLIDE = 1024;
}

_________________________________________________________________________________________________






/**
 * <p>Title: PPTClientTextBox</p>
 * <p>Description:  A class that holds the Powrpoint Client Text box content</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

public class PPTClientTextBox {
  /**
   * Holds the current id of a client text box
   */
  protected long currentID;

  /**
   * holds the content of client text box
   */
  protected String content;

  /**
   * Instantiates the client text box object
   * @param number
   */
  public PPTClientTextBox(long number) {
    currentID = number;
    this.content = "";
  }

  /**
   * Instantiates the client text box object
   * @param number
   * @param content
   */
  public PPTClientTextBox(long number, String content) {
    currentID = number;
    this.content = content;
  }

  /**
   * Sets the content of a client text box
   * @param content
   */
  public void setContent(String content) {
    this.content = content;
  }

  /**
   * returns the content of a client text box
   * @return
   */

  public String getContent() {
    return content;
  }

  /**
   * returns the current client text box id
   * @return long
   */
  public long getID() {
    return currentID;
  }
}
_________________________________________________________________________________________________



import java.io.*;
import java.util.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;
import org.apache.poi.util.HexDump;
import org.apache.poi.util.LittleEndian;
import org.apache.poi.hdf.extractor.*;

/**
 * <p>Title: PPT2Text</p>
 * <p>Description: A class that parses the Powerpoint document content to text </p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

public class PPT2Text
    implements PPTConstants {
  /**
   *
   * <p>Title: PPTListener</p>
   * <p>Description:  Class that used to handle the Power Point Events</p>
   * @author Hari Shanker,Sudhakar Chavali
   * @version 1.0
   */

  class PPTListener
      implements POIFSReaderListener {

    public void processPOIFSReaderEvent(POIFSReaderEvent event) {

      try {

        org.apache.poi.poifs.filesystem.DocumentInputStream _documentStream = null;

        // Checking for PowerPoint Document Stream
        if (!event.getName().startsWith("PowerPoint Document")) {
          return;
        }

        _documentStream = event.getStream();

        byte pptdata[] = new byte[_documentStream.available()];
        _documentStream.read(pptdata, 0, _documentStream.available());

        int sNum = 0;

        long offset = 0, offsetEnd = 0;
        long offsetPD = 0, oldoffsetPD = 0, docRef = 0, maxPresist = 0;

        // Traverse Bytearray to get CurrentUserEditAtom

        // Call to extract the Text in all PlaceHolders

        // To hold PPTClientTextBox objects for mapping into Slide Objects
        java.util.Hashtable _containerClientTextBox = new java.util.Hashtable();

        // Traverse ByteArray to identiy edit paths of ClientTextBoxes
        for (long i = 0; i < pptdata.length - 20; i++) {
          long type = LittleEndian.getUShort(pptdata, (int) i + 2);
          long size = LittleEndian.getUInt(pptdata, (int) i + 4);

          if (type == PPT_USEREDIT_ATOM) { // Checking the Record Header (UserEditAtom)
            long lastSlideID = LittleEndian.getInt(pptdata, (int) i + 8);
            long version = LittleEndian.getUInt(pptdata, (int) i + 12);
            offset = LittleEndian.getUInt(pptdata, (int) i + 16);
            offsetPD = LittleEndian.getUInt(pptdata,
                                            (int) i + 20);

            // Call to extract ClientTextBox text in each UserEditAtom
            _containerClientTextBox = extractClientTextBoxes(
                _containerClientTextBox, offset, pptdata, offsetPD);

          }
        }

        Vector slides = extractPlaceHoders(offset, pptdata, offsetPD);

        if (slides.size() == 0) {
          slides.addElement(new PPTSlide(256));
        }

        PPTSlide _slide = (PPTSlide) slides.get(slides.size() - 1);

        for (Enumeration enum = _containerClientTextBox.elements();
             enum.hasMoreElements(); ) {
          PPTClientTextBox _clientTextBox = (PPTClientTextBox) enum.nextElement();
          _slide.addContent(_clientTextBox.getContent());

        }

        //Merging ClientTextBox data with Slide Data

        // Printing the text from Slides vector object (need further modification)

        for (int i = 0; i < slides.size(); i++) {

          _slide = (PPTSlide) slides.get(i);

          Vector scontent = _slide.getContent();
//          StringBuffer _stringbuffer = new StringBuffer();
          for (int j = 0; j < scontent.size(); j++) {
            pptTextBuffer.append(scontent.get(j).toString());
          }

        }

      }
      catch (Throwable ex) {

        return;
      }
    }
  }

  /**
   * Method that returns the client text boxes of a slide
   * @param clientTextBoxContainer
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Hashtable
   * @throws MSPowerPointDocumentParserException
   */

  public java.util.Hashtable extractClientTextBoxes(java.util.Hashtable
      _containerClientTextBox, long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    //To hold temparary data
    ByteArrayOutputStream _outStream = new ByteArrayOutputStream();

    PPTClientTextBox _clientTextBox = null;

    // Traversing the bytearray upto Presist directory position
    for (long i = offset; i < offsetPD - 20; i++) {

      try {

        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2); // Record Type
        long size = LittleEndian.getUInt(pptdata, (int) i + 4); // Record Size

        if (type == PPT_DIAGRAMGROUP_ATOM) { //Record type is of Drawing Group

          long shapeCount = LittleEndian.getUInt(pptdata, (int) i + 8); // Total number of objects
          long _currentID = LittleEndian.getInt(pptdata, (int) i + 12); // Group ID+number of
objects

          _currentID = ( (int) (_currentID / 1024)) * 1024;

          if (_currentID == PPT_MASTERSLIDE) { // Ignore Master Slide objects
            i++;
            continue;
          }

          //Check for the ClientTextBox GroupID existence
          if (!_containerClientTextBox.containsKey(new Long(_currentID))) {
            _clientTextBox = new PPTClientTextBox(_currentID);
            _containerClientTextBox.put(new Long(_currentID), _clientTextBox);
          }
          else {
            // If exists get Client Textbox Group
            _clientTextBox = (PPTClientTextBox) _containerClientTextBox.get(new
                Long(_currentID));
            _clientTextBox.setContent("");
          }

          // Iterating the bytearray for TextCharAtoms and TextBytesAtom
          for (long j = i + 8; j < offsetPD - 20; j++) {
            try {
              long nrinfo = LittleEndian.getUShort(pptdata, (int) j);
              long ntype = LittleEndian.getUShort(pptdata, (int) j + 2); //Record Type
              long nsize = LittleEndian.getUInt(pptdata, (int) j + 4); // Record size

              if (ntype == PPT_DIAGRAMGROUP_ATOM) { // Break the loop if next GroupID found
                i = j - 1;
                break;
              }
              else if (ntype == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

                _outStream = new ByteArrayOutputStream();
                long ii = 0;
                for (ii = j + 6; ii <= j + 6 + nsize; ii++) { // For loop to changed to a function
                  short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }
                  else if (ch == 0x201c) { // for left double quote
                    _outStream.write( (byte) 147);
                  }
                  else if (ch == 0x201d) { // for right double quote
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) { // for right single quote
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) { // for left single quote
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for '-' character
                    _outStream.write( (byte) 150);
                  }
                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));

              }
              else if (ntype == PPT_TEXTCHAR_ATOM) { // TextCharAtom record
                _outStream = new ByteArrayOutputStream();
                String strTempContent = new String(pptdata, (int) j + 6,
                    (int) (nsize) + 2);
                byte bytes[] = strTempContent.getBytes();

                for (int ii = 0; ii < bytes.length - 1; ii += 2) { // For loop to changed to a
function
                  short ch = Utils.convertBytesToShort(bytes, ii);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }

                  else if (ch == 0x201c) {
                    _outStream.write( (byte) 147);

                  }
                  else if (ch == 0x201d) {
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) {
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) {
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for - character
                    _outStream.write( (byte) 150);
                  }

                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));
              }
            }
            catch (Throwable e) {
              break;
            }
          }
        }
      }
      catch (Throwable ee) {
        return _containerClientTextBox;
      }
    }
    return _containerClientTextBox;
  }

  /**
   * Method that returns the Powerpoint place holders
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Vector
   * @throws MSPowerPointDocumentParserException
   */

  public Vector extractPlaceHoders(long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    int sNum = 0;

    Vector slides = new Vector(); // To All Slides data

    PPTSlide currentSlide = null; // Object to hold current slide data

    ByteArrayOutputStream _outStream = new ByteArrayOutputStream(); // To store data found in
TextCharAtoms and TextBytesAtoms

    for (long i = offset; i < pptdata.length - 20; i++) {
      try {
        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2);
        long size = LittleEndian.getUInt(pptdata, (int) i + 4);

        if (type == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

          _outStream = new ByteArrayOutputStream();
          long ii = 0;
          for (ii = i + 6; ii <= i + 6 + size; ii++) {
            short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');

            }
            else if (ch == 0x201c) { // for left double quote
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) { // for right double quote
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) { // for right single quote
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) { // for left single quote
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for '-' character
              _outStream.write( (byte) 150);
            }
            else {
              _outStream.write(ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());

        }
        else if (type == PPT_TEXTCHAR_ATOM) { //TextCharAtom record
          _outStream = new ByteArrayOutputStream();
          String strTempContent = new String(pptdata, (int) i + 6,
                                             (int) (size) + 2);
          byte bytes[] = strTempContent.getBytes();

          for (int ii = 0; ii < bytes.length - 1; ii += 2) {
            short ch = Utils.convertBytesToShort(bytes, ii);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');
            }
            else if (ch == 0x201c) {
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) {
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) {
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) {
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for - character
              _outStream.write( (byte) 150);
            }

            else {
              _outStream.write( (byte) ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());
        }
        else if (type == PPT_SLIDEPERSISTANT_ATOM) { // SlidePresistAtom Record type
          if (sNum != 0) {
            _outStream = new ByteArrayOutputStream();

            long slideID = LittleEndian.getUInt(pptdata,
                                                (int) i + 20);

            currentSlide = new PPTSlide(slideID);
            //currentSlide.addContent(_outStream.toString());
            slides.addElement(currentSlide);
          }
          sNum++;
        }
        else
        if (type == PPT_DIAGRAMGROUP_ATOM) { //DG
          break;
        }
      }
      catch (Throwable ee) {

      }

      /*******************************************************************/

    }

    return slides;
  }

  /**
   * Constructor that takes a Powerpoint document name as an argument for getting the text
   * @param fileName
   */
  public PPT2Text(String fileName) {
    this.docName = fileName;
  }

  /**
   * Method that reads the Powerpoint document for parsing the text
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    try {
      POIFSReader reader = new POIFSReader();
      reader.registerListener(new PPTListener());
      reader.read(new FileInputStream(docName));
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Unable to read the PPT Document \nError Cause : " + _docError,
          _docError);
    }

  }

  /**
   * returns the PowerPoint text
   * @return String
   */
  public String getText() {

    return pptTextBuffer.toString();
  }

  /**
   * Holds the Powerpoint document name
   */
  private String docName;

  /**
   * Holds the parsed Powerpoint Text
   */
  private StringBuffer pptTextBuffer = new StringBuffer();

}
_________________________________________________________________________________________________







/**
 * Title: MSPowerPointDocumentParserException
 * Description: This is An Exception class and used for catching the runtime exceptions for the
Powerpoint Document class
 * @author Sudhakar Chavali
 * @see microsoft.powerpoint.PowerPointDocument
 * @version 1.0
 */

public class MSPowerPointDocumentParserException
    extends DocumentParserException {

  /**
   * A constructor that builds the Exception object
   * @param message
   */
  public MSPowerPointDocumentParserException(String message) {
    super(message);
  }

  /**
   * A constructor that builds the Exception object
   * @param message
   * @param cause
   */

  public MSPowerPointDocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}





=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


		
__________________________________
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

RE: MSPowerPointExtractor problem

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Hello All,

This was my first contribution 
http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java for
jakarta team. And it seems another expert(Ryan Rhodes- ryanshaerhodes@hotmail.com) has already
started working on that based on my first given contribution.

That sounds great to me.

So In order to increase the development process for Powerpoint extractor, I just wanted to
contribute Our team efforts in developing the Powerpoint extractor

Authors :- Sudhakar Chavali (sudhakar_koundinya@yahoo.com) and Hari Shanker Goud
(harigoud@yahoo.com)


Have a look on the below source codes


Regards
Sudhakar

____________________________________________________________________________________

/**
 * Title: DocumentParserException class
 * Description: This is root Exceptional class for throwing the runtime errors that can be raised
by different parsers
 * @author Sudhakar
 * @version 1.0
 */

public class DocumentParserException
    extends Exception {

  /**
   * Constructs a new exception with null as its detail message.
   */

  public DocumentParserException() {
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   */

  public DocumentParserException(String message) {
    super(message);
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   * @param cause
   */
  public DocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}
_____________________________________________________________________________________

import java.io.*;

/**
 *
 * Title: Summary Base
 * Description: A Generic one that reads the document's summary information and returns it through
different internal methods
 * @author Sudhakar Chavali
 * @version 1.0
 */
public interface SummaryBase {
  /**
   * A method returns the Document's Author
   * @return String
   */
  public String getDocAuthor();

  /**
   * A method that returns the Document Created Date
   * @return String
   */
  public String getDocCreatedDate();

  /**
   * A method that returns the Document's Key words
   * @return String
   */
  public String getDocKeywords();

  /**
   * A method that returns the Document's comments
   * @return String
   */
  public String getDocComments();

  /**
   * A method that returns the Document Name
   * @return String
   */
  public String getDocName();

  /**
   * A method that returns the Document's Subject
   * @return String
   */
  public String getDocSubject();

  /**
   * A method that returns the Document's title
   */

  public String getDocTitle();

  /**
   * A method that reads the document's Summary Information
   * @throws DocumentParserException
   */
  public void read() throws DocumentParserException;

  /**
   * A method that writes the Document's summary information as an XML into the file
   * @param strXMLFile
   * @throws DocumentParserException
   */
  public void write(String strXMLFile) throws 
      DocumentParserException;

  /**
   * A method that writes the document's summary information as an XML into OutputStream Object
   * @param out
   * @throws DocumentParserException
   */
  public void write(OutputStream out) throws 
      DocumentParserException;

  /**
   * A method that returns the Document's summary as an XML String
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsXML() throws 
      DocumentParserException;

  /**
   * A method that returns document's summary information as normal text
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsText() throws 
      DocumentParserException;
}

______________________________________________________________________________________

import java.io.*;

/**
 * A generic document that reads the document's text and parses it into normal Ascii text using
the different methods.
 */
public interface Document {

  /**
   * A method that returns the document's text after parsing. This method should be called after
calling the read method
   * @return String
   * @see #read()
   * @throws DocumentParserException
   */
  public abstract String getText() throws 
      DocumentParserException;

  /**
   * A method that returns the parsed text as byte array. This method should be called after
calling the read method
   * @return byte[]
   * @throws DocumentParserException
   */
  public abstract byte[] getBytes() throws 
      DocumentParserException;

  /**
   * A method that writes the parsed text into the OutputStream object. This method should be
called after calling the read method
   * @param out
   * @throws DocumentParserException
   */
  public abstract void write(OutputStream out) throws 
      DocumentParserException, Exception;

  /**
   * A method that reads and parses the document into Normal text
   * @throws DocumentParserException
   */
  public abstract void read() throws 
      DocumentParserException, Exception;

  /**
   * A method that returns the document summary information as Normal Text
   * @return String
   */
  public abstract String getDocumentSummaryAsText();

  /**
   * A method that returns the document summary information as an XML String
   * @return String
   */
  public abstract String getDocumentSummaryAsXML();

  /**
   * A method that returns the InputStream which contains the document summary information as a
Normal Text
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentTextStream() throws 
      DocumentParserException;

  ;

  /**
   * A method that returns the InputStream which contains the document summary information as an
XML String
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentSummaryStream() throws 
      DocumentParserException;

  ;

}


______________________________________________________________________________________

/**
 * <p>Title: PPTSlide</p>
 * <p>Description: A class that holds the Powerpoint slide</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

class PPTSlide {
  /**
   * Holds the Slide Number
   */
  protected long slideNumber;

  /**
   * Holds the contents of the Slide
   */
  protected Vector contents;

  /**
   * Initialise the Object for holding the contents of Power Point Slide
   * @param number
   */
  public PPTSlide(long number) {
    slideNumber = number;
    contents = new Vector();
  }

  /**
   * Add the Content of Slide to this Object
   * @param content
   */

  public void addContent(String content) {
    contents.addElement(content);
  }

  /**
   * returns the contents of slide as a vector object
   * @return Vector
   */
  public Vector getContent() {
    return contents;
  }

  /**
   * returns the slide value
   * @return long
   */

  public long getSlideNumber() {
    return slideNumber;
  }

}

_____________________________________________________________________________________





import java.io.OutputStream;
import java.io.InputStream;
import java.io.*;

/**
 * <p>Title: Parsers</p>
 * <p>Description: Class that parses the Power Point Document Content and it's summary to text</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public class PPTDocument
    implements Document {
  /**
   * Checks whether the summary information was already read or not
   */

  private boolean bRead = false;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private SummaryBase summary = null;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private String docName = null;

  /**
   * Holds the Powerpoint Document   Text
   */
  private String pptText = "";

  /**
   * Constructor for setting  the PowerPoint document path for initialising the POI object
   * @param docName
   */
  public PPTDocument(String docName) {
    this.docName = docName;
    summary = new MSDocumentSummary(docName);

  }

  /**
   * returns the parsed Powerpoint Document text
   * @return String
   */
  public String getText() {

    return pptText;

  }

  /**
   * returns the Powerpoint Document text as bytes
   * @return byte[]
   */
  public byte[] getBytes() {
    return getText().getBytes();
  }

  /**
   * Writes the Powerpoint Document Text into OutputStream object
   * @param out
   * @throws MSPowerPointDocumentParserException
   */
  public void write(OutputStream out) throws
      MSPowerPointDocumentParserException {
    try {
      out.write(getText().getBytes());
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Error Raised while writing the text into OutputStream Object \nError Cause : " +
          _docError, _docError);
    }

  }

  /**
   * Reads the Powerpoint document for getting the text from it.
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    PPT2Text ppt2Text = new PPT2Text(docName);
    ppt2Text.read();
    pptText = ppt2Text.getText();

    pptText = pptText.replaceAll("\r", "\n");
    pptText = pptText.replaceAll("\n", "\r\n");

  }

  /**
   * returns the document summary as tab delimited text
   * @return
   */
  public String getDocumentSummaryAsText() {
    try {
      if (!bRead) {
        summary.read(); //read the summary object
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsText(); //Build and return the Summary as Normal text

    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document summary as xml
   * @return
   */
  public String getDocumentSummaryAsXML() {
    try {
      if (!bRead) {
        summary.read(); //read the summary information of a document
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsXML(); //build annd return the summary as XML string
    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document text as Stream Object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentTextStream() throws
      MSPowerPointDocumentParserException {
    try {
      ByteArrayInputStream in = new ByteArrayInputStream(this.getBytes()); //Write the Document
Text in InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //EXCEPTION RAISED WHILE CREATING THE InputStream OBJECT
      throw new MSPowerPointDocumentParserException(
          "Unable to return the document text as an InputStream\n\tException Root :" +
          _documentError);
    }

  }

  /**
   * returns the document summary as stream object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentSummaryStream() throws
      MSPowerPointDocumentParserException {
    try {
      /*
       A method that reads the Document Summary Stream in  InputStream Object
       */
      ByteArrayInputStream in = new ByteArrayInputStream(
          getDocumentSummaryAsText().getBytes()); //Write the Summary Information into the
InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //error raised while creating the document summary info stream object
      // throw it
      throw new MSPowerPointDocumentParserException(
          "Unable to get Document Summary Information as Stream\n\tException Root:  " +
          _documentError);
    }

  }

}
________________________________________________________________________________________________



/**
 * <p>Title: PPTConstants</p>
 * <p>Description: A class that holds the Powerpoint constants</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public interface PPTConstants {


  public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
  public static final int PPT_DIAGRAMGROUP_ATOM = 61448;
  public static final int PPT_TEXTCHAR_ATOM = 4000;
  public static final int PPT_TEXTBYTE_ATOM = 4008;
  public static final int PPT_USEREDIT_ATOM = 4085;
  public static final int PPT_MASTERSLIDE = 1024;
}

_________________________________________________________________________________________________






/**
 * <p>Title: PPTClientTextBox</p>
 * <p>Description:  A class that holds the Powrpoint Client Text box content</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

public class PPTClientTextBox {
  /**
   * Holds the current id of a client text box
   */
  protected long currentID;

  /**
   * holds the content of client text box
   */
  protected String content;

  /**
   * Instantiates the client text box object
   * @param number
   */
  public PPTClientTextBox(long number) {
    currentID = number;
    this.content = "";
  }

  /**
   * Instantiates the client text box object
   * @param number
   * @param content
   */
  public PPTClientTextBox(long number, String content) {
    currentID = number;
    this.content = content;
  }

  /**
   * Sets the content of a client text box
   * @param content
   */
  public void setContent(String content) {
    this.content = content;
  }

  /**
   * returns the content of a client text box
   * @return
   */

  public String getContent() {
    return content;
  }

  /**
   * returns the current client text box id
   * @return long
   */
  public long getID() {
    return currentID;
  }
}
_________________________________________________________________________________________________



import java.io.*;
import java.util.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;
import org.apache.poi.util.HexDump;
import org.apache.poi.util.LittleEndian;
import org.apache.poi.hdf.extractor.*;

/**
 * <p>Title: PPT2Text</p>
 * <p>Description: A class that parses the Powerpoint document content to text </p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

public class PPT2Text
    implements PPTConstants {
  /**
   *
   * <p>Title: PPTListener</p>
   * <p>Description:  Class that used to handle the Power Point Events</p>
   * @author Hari Shanker,Sudhakar Chavali
   * @version 1.0
   */

  class PPTListener
      implements POIFSReaderListener {

    public void processPOIFSReaderEvent(POIFSReaderEvent event) {

      try {

        org.apache.poi.poifs.filesystem.DocumentInputStream _documentStream = null;

        // Checking for PowerPoint Document Stream
        if (!event.getName().startsWith("PowerPoint Document")) {
          return;
        }

        _documentStream = event.getStream();

        byte pptdata[] = new byte[_documentStream.available()];
        _documentStream.read(pptdata, 0, _documentStream.available());

        int sNum = 0;

        long offset = 0, offsetEnd = 0;
        long offsetPD = 0, oldoffsetPD = 0, docRef = 0, maxPresist = 0;

        // Traverse Bytearray to get CurrentUserEditAtom

        // Call to extract the Text in all PlaceHolders

        // To hold PPTClientTextBox objects for mapping into Slide Objects
        java.util.Hashtable _containerClientTextBox = new java.util.Hashtable();

        // Traverse ByteArray to identiy edit paths of ClientTextBoxes
        for (long i = 0; i < pptdata.length - 20; i++) {
          long type = LittleEndian.getUShort(pptdata, (int) i + 2);
          long size = LittleEndian.getUInt(pptdata, (int) i + 4);

          if (type == PPT_USEREDIT_ATOM) { // Checking the Record Header (UserEditAtom)
            long lastSlideID = LittleEndian.getInt(pptdata, (int) i + 8);
            long version = LittleEndian.getUInt(pptdata, (int) i + 12);
            offset = LittleEndian.getUInt(pptdata, (int) i + 16);
            offsetPD = LittleEndian.getUInt(pptdata,
                                            (int) i + 20);

            // Call to extract ClientTextBox text in each UserEditAtom
            _containerClientTextBox = extractClientTextBoxes(
                _containerClientTextBox, offset, pptdata, offsetPD);

          }
        }

        Vector slides = extractPlaceHoders(offset, pptdata, offsetPD);

        if (slides.size() == 0) {
          slides.addElement(new PPTSlide(256));
        }

        PPTSlide _slide = (PPTSlide) slides.get(slides.size() - 1);

        for (Enumeration enum = _containerClientTextBox.elements();
             enum.hasMoreElements(); ) {
          PPTClientTextBox _clientTextBox = (PPTClientTextBox) enum.nextElement();
          _slide.addContent(_clientTextBox.getContent());

        }

        //Merging ClientTextBox data with Slide Data

        // Printing the text from Slides vector object (need further modification)

        for (int i = 0; i < slides.size(); i++) {

          _slide = (PPTSlide) slides.get(i);

          Vector scontent = _slide.getContent();
//          StringBuffer _stringbuffer = new StringBuffer();
          for (int j = 0; j < scontent.size(); j++) {
            pptTextBuffer.append(scontent.get(j).toString());
          }

        }

      }
      catch (Throwable ex) {

        return;
      }
    }
  }

  /**
   * Method that returns the client text boxes of a slide
   * @param clientTextBoxContainer
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Hashtable
   * @throws MSPowerPointDocumentParserException
   */

  public java.util.Hashtable extractClientTextBoxes(java.util.Hashtable
      _containerClientTextBox, long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    //To hold temparary data
    ByteArrayOutputStream _outStream = new ByteArrayOutputStream();

    PPTClientTextBox _clientTextBox = null;

    // Traversing the bytearray upto Presist directory position
    for (long i = offset; i < offsetPD - 20; i++) {

      try {

        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2); // Record Type
        long size = LittleEndian.getUInt(pptdata, (int) i + 4); // Record Size

        if (type == PPT_DIAGRAMGROUP_ATOM) { //Record type is of Drawing Group

          long shapeCount = LittleEndian.getUInt(pptdata, (int) i + 8); // Total number of objects
          long _currentID = LittleEndian.getInt(pptdata, (int) i + 12); // Group ID+number of
objects

          _currentID = ( (int) (_currentID / 1024)) * 1024;

          if (_currentID == PPT_MASTERSLIDE) { // Ignore Master Slide objects
            i++;
            continue;
          }

          //Check for the ClientTextBox GroupID existence
          if (!_containerClientTextBox.containsKey(new Long(_currentID))) {
            _clientTextBox = new PPTClientTextBox(_currentID);
            _containerClientTextBox.put(new Long(_currentID), _clientTextBox);
          }
          else {
            // If exists get Client Textbox Group
            _clientTextBox = (PPTClientTextBox) _containerClientTextBox.get(new
                Long(_currentID));
            _clientTextBox.setContent("");
          }

          // Iterating the bytearray for TextCharAtoms and TextBytesAtom
          for (long j = i + 8; j < offsetPD - 20; j++) {
            try {
              long nrinfo = LittleEndian.getUShort(pptdata, (int) j);
              long ntype = LittleEndian.getUShort(pptdata, (int) j + 2); //Record Type
              long nsize = LittleEndian.getUInt(pptdata, (int) j + 4); // Record size

              if (ntype == PPT_DIAGRAMGROUP_ATOM) { // Break the loop if next GroupID found
                i = j - 1;
                break;
              }
              else if (ntype == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

                _outStream = new ByteArrayOutputStream();
                long ii = 0;
                for (ii = j + 6; ii <= j + 6 + nsize; ii++) { // For loop to changed to a function
                  short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }
                  else if (ch == 0x201c) { // for left double quote
                    _outStream.write( (byte) 147);
                  }
                  else if (ch == 0x201d) { // for right double quote
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) { // for right single quote
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) { // for left single quote
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for '-' character
                    _outStream.write( (byte) 150);
                  }
                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));

              }
              else if (ntype == PPT_TEXTCHAR_ATOM) { // TextCharAtom record
                _outStream = new ByteArrayOutputStream();
                String strTempContent = new String(pptdata, (int) j + 6,
                    (int) (nsize) + 2);
                byte bytes[] = strTempContent.getBytes();

                for (int ii = 0; ii < bytes.length - 1; ii += 2) { // For loop to changed to a
function
                  short ch = Utils.convertBytesToShort(bytes, ii);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }

                  else if (ch == 0x201c) {
                    _outStream.write( (byte) 147);

                  }
                  else if (ch == 0x201d) {
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) {
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) {
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for - character
                    _outStream.write( (byte) 150);
                  }

                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));
              }
            }
            catch (Throwable e) {
              break;
            }
          }
        }
      }
      catch (Throwable ee) {
        return _containerClientTextBox;
      }
    }
    return _containerClientTextBox;
  }

  /**
   * Method that returns the Powerpoint place holders
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Vector
   * @throws MSPowerPointDocumentParserException
   */

  public Vector extractPlaceHoders(long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    int sNum = 0;

    Vector slides = new Vector(); // To All Slides data

    PPTSlide currentSlide = null; // Object to hold current slide data

    ByteArrayOutputStream _outStream = new ByteArrayOutputStream(); // To store data found in
TextCharAtoms and TextBytesAtoms

    for (long i = offset; i < pptdata.length - 20; i++) {
      try {
        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2);
        long size = LittleEndian.getUInt(pptdata, (int) i + 4);

        if (type == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

          _outStream = new ByteArrayOutputStream();
          long ii = 0;
          for (ii = i + 6; ii <= i + 6 + size; ii++) {
            short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');

            }
            else if (ch == 0x201c) { // for left double quote
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) { // for right double quote
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) { // for right single quote
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) { // for left single quote
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for '-' character
              _outStream.write( (byte) 150);
            }
            else {
              _outStream.write(ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());

        }
        else if (type == PPT_TEXTCHAR_ATOM) { //TextCharAtom record
          _outStream = new ByteArrayOutputStream();
          String strTempContent = new String(pptdata, (int) i + 6,
                                             (int) (size) + 2);
          byte bytes[] = strTempContent.getBytes();

          for (int ii = 0; ii < bytes.length - 1; ii += 2) {
            short ch = Utils.convertBytesToShort(bytes, ii);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');
            }
            else if (ch == 0x201c) {
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) {
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) {
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) {
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for - character
              _outStream.write( (byte) 150);
            }

            else {
              _outStream.write( (byte) ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());
        }
        else if (type == PPT_SLIDEPERSISTANT_ATOM) { // SlidePresistAtom Record type
          if (sNum != 0) {
            _outStream = new ByteArrayOutputStream();

            long slideID = LittleEndian.getUInt(pptdata,
                                                (int) i + 20);

            currentSlide = new PPTSlide(slideID);
            //currentSlide.addContent(_outStream.toString());
            slides.addElement(currentSlide);
          }
          sNum++;
        }
        else
        if (type == PPT_DIAGRAMGROUP_ATOM) { //DG
          break;
        }
      }
      catch (Throwable ee) {

      }

      /*******************************************************************/

    }

    return slides;
  }

  /**
   * Constructor that takes a Powerpoint document name as an argument for getting the text
   * @param fileName
   */
  public PPT2Text(String fileName) {
    this.docName = fileName;
  }

  /**
   * Method that reads the Powerpoint document for parsing the text
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    try {
      POIFSReader reader = new POIFSReader();
      reader.registerListener(new PPTListener());
      reader.read(new FileInputStream(docName));
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Unable to read the PPT Document \nError Cause : " + _docError,
          _docError);
    }

  }

  /**
   * returns the PowerPoint text
   * @return String
   */
  public String getText() {

    return pptTextBuffer.toString();
  }

  /**
   * Holds the Powerpoint document name
   */
  private String docName;

  /**
   * Holds the parsed Powerpoint Text
   */
  private StringBuffer pptTextBuffer = new StringBuffer();

}
_________________________________________________________________________________________________







/**
 * Title: MSPowerPointDocumentParserException
 * Description: This is An Exception class and used for catching the runtime exceptions for the
Powerpoint Document class
 * @author Sudhakar Chavali
 * @see microsoft.powerpoint.PowerPointDocument
 * @version 1.0
 */

public class MSPowerPointDocumentParserException
    extends DocumentParserException {

  /**
   * A constructor that builds the Exception object
   * @param message
   */
  public MSPowerPointDocumentParserException(String message) {
    super(message);
  }

  /**
   * A constructor that builds the Exception object
   * @param message
   * @param cause
   */

  public MSPowerPointDocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}





=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


		
__________________________________
Do you Yahoo!?
Yahoo! Mail - You care about security. So do we.
http://promotions.yahoo.com/new_mail

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Power Point Extractor Source Codes - Mail Continued

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

The main classes you can find there are

PPT2Text
PPTSlide
PPTConstants
PPTClientTextBox

=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

Power Point Extractor Source Codes - Mail Continued

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

The main classes you can find there are

PPT2Text
PPTSlide
PPTConstants
PPTClientTextBox

=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Power Point Extractor Source Codes

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Here is the Ryan Rhodes Contribution

http://www.mail-archive.com/slide-dev@jakarta.apache.org/msg10445.html

=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Power Point Extractor Source Codes

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Here is the Ryan Rhodes Contribution

http://www.mail-archive.com/slide-dev@jakarta.apache.org/msg10445.html

=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

Power Point Extractor Source Codes

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Hello All,

This was my first contribution 
http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java for
jakarta team. And it seems another expert(Ryan Rhodes- ryanshaerhodes@hotmail.com) has already
started working on that based on my first given contribution.

That sounds great to me.

So In order to increase the development process for Powerpoint extractor, I just wanted to
contribute Our team efforts in developing the Powerpoint extractor

Authors :- Sudhakar Chavali (sudhakar_koundinya@yahoo.com) and Hari Shanker Goud
(harigoud@yahoo.com)


Have a look on the below source codes


Regards
Sudhakar

____________________________________________________________________________________

/**
 * Title: DocumentParserException class
 * Description: This is root Exceptional class for throwing the runtime errors that can be raised
by different parsers
 * @author Sudhakar
 * @version 1.0
 */

public class DocumentParserException
    extends Exception {

  /**
   * Constructs a new exception with null as its detail message.
   */

  public DocumentParserException() {
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   */

  public DocumentParserException(String message) {
    super(message);
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   * @param cause
   */
  public DocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}
_____________________________________________________________________________________

import java.io.*;

/**
 *
 * Title: Summary Base
 * Description: A Generic one that reads the document's summary information and returns it through
different internal methods
 * @author Sudhakar Chavali
 * @version 1.0
 */
public interface SummaryBase {
  /**
   * A method returns the Document's Author
   * @return String
   */
  public String getDocAuthor();

  /**
   * A method that returns the Document Created Date
   * @return String
   */
  public String getDocCreatedDate();

  /**
   * A method that returns the Document's Key words
   * @return String
   */
  public String getDocKeywords();

  /**
   * A method that returns the Document's comments
   * @return String
   */
  public String getDocComments();

  /**
   * A method that returns the Document Name
   * @return String
   */
  public String getDocName();

  /**
   * A method that returns the Document's Subject
   * @return String
   */
  public String getDocSubject();

  /**
   * A method that returns the Document's title
   */

  public String getDocTitle();

  /**
   * A method that reads the document's Summary Information
   * @throws DocumentParserException
   */
  public void read() throws DocumentParserException;

  /**
   * A method that writes the Document's summary information as an XML into the file
   * @param strXMLFile
   * @throws DocumentParserException
   */
  public void write(String strXMLFile) throws 
      DocumentParserException;

  /**
   * A method that writes the document's summary information as an XML into OutputStream Object
   * @param out
   * @throws DocumentParserException
   */
  public void write(OutputStream out) throws 
      DocumentParserException;

  /**
   * A method that returns the Document's summary as an XML String
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsXML() throws 
      DocumentParserException;

  /**
   * A method that returns document's summary information as normal text
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsText() throws 
      DocumentParserException;
}

______________________________________________________________________________________

import java.io.*;

/**
 * A generic document that reads the document's text and parses it into normal Ascii text using
the different methods.
 */
public interface Document {

  /**
   * A method that returns the document's text after parsing. This method should be called after
calling the read method
   * @return String
   * @see #read()
   * @throws DocumentParserException
   */
  public abstract String getText() throws 
      DocumentParserException;

  /**
   * A method that returns the parsed text as byte array. This method should be called after
calling the read method
   * @return byte[]
   * @throws DocumentParserException
   */
  public abstract byte[] getBytes() throws 
      DocumentParserException;

  /**
   * A method that writes the parsed text into the OutputStream object. This method should be
called after calling the read method
   * @param out
   * @throws DocumentParserException
   */
  public abstract void write(OutputStream out) throws 
      DocumentParserException, Exception;

  /**
   * A method that reads and parses the document into Normal text
   * @throws DocumentParserException
   */
  public abstract void read() throws 
      DocumentParserException, Exception;

  /**
   * A method that returns the document summary information as Normal Text
   * @return String
   */
  public abstract String getDocumentSummaryAsText();

  /**
   * A method that returns the document summary information as an XML String
   * @return String
   */
  public abstract String getDocumentSummaryAsXML();

  /**
   * A method that returns the InputStream which contains the document summary information as a
Normal Text
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentTextStream() throws 
      DocumentParserException;

  ;

  /**
   * A method that returns the InputStream which contains the document summary information as an
XML String
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentSummaryStream() throws 
      DocumentParserException;

  ;

}


______________________________________________________________________________________

/**
 * <p>Title: PPTSlide</p>
 * <p>Description: A class that holds the Powerpoint slide</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

class PPTSlide {
  /**
   * Holds the Slide Number
   */
  protected long slideNumber;

  /**
   * Holds the contents of the Slide
   */
  protected Vector contents;

  /**
   * Initialise the Object for holding the contents of Power Point Slide
   * @param number
   */
  public PPTSlide(long number) {
    slideNumber = number;
    contents = new Vector();
  }

  /**
   * Add the Content of Slide to this Object
   * @param content
   */

  public void addContent(String content) {
    contents.addElement(content);
  }

  /**
   * returns the contents of slide as a vector object
   * @return Vector
   */
  public Vector getContent() {
    return contents;
  }

  /**
   * returns the slide value
   * @return long
   */

  public long getSlideNumber() {
    return slideNumber;
  }

}

_____________________________________________________________________________________





import java.io.OutputStream;
import java.io.InputStream;
import java.io.*;

/**
 * <p>Title: Parsers</p>
 * <p>Description: Class that parses the Power Point Document Content and it's summary to text</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public class PPTDocument
    implements Document {
  /**
   * Checks whether the summary information was already read or not
   */

  private boolean bRead = false;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private SummaryBase summary = null;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private String docName = null;

  /**
   * Holds the Powerpoint Document   Text
   */
  private String pptText = "";

  /**
   * Constructor for setting  the PowerPoint document path for initialising the POI object
   * @param docName
   */
  public PPTDocument(String docName) {
    this.docName = docName;
    summary = new MSDocumentSummary(docName);

  }

  /**
   * returns the parsed Powerpoint Document text
   * @return String
   */
  public String getText() {

    return pptText;

  }

  /**
   * returns the Powerpoint Document text as bytes
   * @return byte[]
   */
  public byte[] getBytes() {
    return getText().getBytes();
  }

  /**
   * Writes the Powerpoint Document Text into OutputStream object
   * @param out
   * @throws MSPowerPointDocumentParserException
   */
  public void write(OutputStream out) throws
      MSPowerPointDocumentParserException {
    try {
      out.write(getText().getBytes());
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Error Raised while writing the text into OutputStream Object \nError Cause : " +
          _docError, _docError);
    }

  }

  /**
   * Reads the Powerpoint document for getting the text from it.
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    PPT2Text ppt2Text = new PPT2Text(docName);
    ppt2Text.read();
    pptText = ppt2Text.getText();

    pptText = pptText.replaceAll("\r", "\n");
    pptText = pptText.replaceAll("\n", "\r\n");

  }

  /**
   * returns the document summary as tab delimited text
   * @return
   */
  public String getDocumentSummaryAsText() {
    try {
      if (!bRead) {
        summary.read(); //read the summary object
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsText(); //Build and return the Summary as Normal text

    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document summary as xml
   * @return
   */
  public String getDocumentSummaryAsXML() {
    try {
      if (!bRead) {
        summary.read(); //read the summary information of a document
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsXML(); //build annd return the summary as XML string
    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document text as Stream Object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentTextStream() throws
      MSPowerPointDocumentParserException {
    try {
      ByteArrayInputStream in = new ByteArrayInputStream(this.getBytes()); //Write the Document
Text in InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //EXCEPTION RAISED WHILE CREATING THE InputStream OBJECT
      throw new MSPowerPointDocumentParserException(
          "Unable to return the document text as an InputStream\n\tException Root :" +
          _documentError);
    }

  }

  /**
   * returns the document summary as stream object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentSummaryStream() throws
      MSPowerPointDocumentParserException {
    try {
      /*
       A method that reads the Document Summary Stream in  InputStream Object
       */
      ByteArrayInputStream in = new ByteArrayInputStream(
          getDocumentSummaryAsText().getBytes()); //Write the Summary Information into the
InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //error raised while creating the document summary info stream object
      // throw it
      throw new MSPowerPointDocumentParserException(
          "Unable to get Document Summary Information as Stream\n\tException Root:  " +
          _documentError);
    }

  }

}
________________________________________________________________________________________________



/**
 * <p>Title: PPTConstants</p>
 * <p>Description: A class that holds the Powerpoint constants</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public interface PPTConstants {


  public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
  public static final int PPT_DIAGRAMGROUP_ATOM = 61448;
  public static final int PPT_TEXTCHAR_ATOM = 4000;
  public static final int PPT_TEXTBYTE_ATOM = 4008;
  public static final int PPT_USEREDIT_ATOM = 4085;
  public static final int PPT_MASTERSLIDE = 1024;
}

_________________________________________________________________________________________________






/**
 * <p>Title: PPTClientTextBox</p>
 * <p>Description:  A class that holds the Powrpoint Client Text box content</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

public class PPTClientTextBox {
  /**
   * Holds the current id of a client text box
   */
  protected long currentID;

  /**
   * holds the content of client text box
   */
  protected String content;

  /**
   * Instantiates the client text box object
   * @param number
   */
  public PPTClientTextBox(long number) {
    currentID = number;
    this.content = "";
  }

  /**
   * Instantiates the client text box object
   * @param number
   * @param content
   */
  public PPTClientTextBox(long number, String content) {
    currentID = number;
    this.content = content;
  }

  /**
   * Sets the content of a client text box
   * @param content
   */
  public void setContent(String content) {
    this.content = content;
  }

  /**
   * returns the content of a client text box
   * @return
   */

  public String getContent() {
    return content;
  }

  /**
   * returns the current client text box id
   * @return long
   */
  public long getID() {
    return currentID;
  }
}
_________________________________________________________________________________________________



import java.io.*;
import java.util.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;
import org.apache.poi.util.HexDump;
import org.apache.poi.util.LittleEndian;
import org.apache.poi.hdf.extractor.*;

/**
 * <p>Title: PPT2Text</p>
 * <p>Description: A class that parses the Powerpoint document content to text </p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

public class PPT2Text
    implements PPTConstants {
  /**
   *
   * <p>Title: PPTListener</p>
   * <p>Description:  Class that used to handle the Power Point Events</p>
   * @author Hari Shanker,Sudhakar Chavali
   * @version 1.0
   */

  class PPTListener
      implements POIFSReaderListener {

    public void processPOIFSReaderEvent(POIFSReaderEvent event) {

      try {

        org.apache.poi.poifs.filesystem.DocumentInputStream _documentStream = null;

        // Checking for PowerPoint Document Stream
        if (!event.getName().startsWith("PowerPoint Document")) {
          return;
        }

        _documentStream = event.getStream();

        byte pptdata[] = new byte[_documentStream.available()];
        _documentStream.read(pptdata, 0, _documentStream.available());

        int sNum = 0;

        long offset = 0, offsetEnd = 0;
        long offsetPD = 0, oldoffsetPD = 0, docRef = 0, maxPresist = 0;

        // Traverse Bytearray to get CurrentUserEditAtom

        // Call to extract the Text in all PlaceHolders

        // To hold PPTClientTextBox objects for mapping into Slide Objects
        java.util.Hashtable _containerClientTextBox = new java.util.Hashtable();

        // Traverse ByteArray to identiy edit paths of ClientTextBoxes
        for (long i = 0; i < pptdata.length - 20; i++) {
          long type = LittleEndian.getUShort(pptdata, (int) i + 2);
          long size = LittleEndian.getUInt(pptdata, (int) i + 4);

          if (type == PPT_USEREDIT_ATOM) { // Checking the Record Header (UserEditAtom)
            long lastSlideID = LittleEndian.getInt(pptdata, (int) i + 8);
            long version = LittleEndian.getUInt(pptdata, (int) i + 12);
            offset = LittleEndian.getUInt(pptdata, (int) i + 16);
            offsetPD = LittleEndian.getUInt(pptdata,
                                            (int) i + 20);

            // Call to extract ClientTextBox text in each UserEditAtom
            _containerClientTextBox = extractClientTextBoxes(
                _containerClientTextBox, offset, pptdata, offsetPD);

          }
        }

        Vector slides = extractPlaceHoders(offset, pptdata, offsetPD);

        if (slides.size() == 0) {
          slides.addElement(new PPTSlide(256));
        }

        PPTSlide _slide = (PPTSlide) slides.get(slides.size() - 1);

        for (Enumeration enum = _containerClientTextBox.elements();
             enum.hasMoreElements(); ) {
          PPTClientTextBox _clientTextBox = (PPTClientTextBox) enum.nextElement();
          _slide.addContent(_clientTextBox.getContent());

        }

        //Merging ClientTextBox data with Slide Data

        // Printing the text from Slides vector object (need further modification)

        for (int i = 0; i < slides.size(); i++) {

          _slide = (PPTSlide) slides.get(i);

          Vector scontent = _slide.getContent();
//          StringBuffer _stringbuffer = new StringBuffer();
          for (int j = 0; j < scontent.size(); j++) {
            pptTextBuffer.append(scontent.get(j).toString());
          }

        }

      }
      catch (Throwable ex) {

        return;
      }
    }
  }

  /**
   * Method that returns the client text boxes of a slide
   * @param clientTextBoxContainer
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Hashtable
   * @throws MSPowerPointDocumentParserException
   */

  public java.util.Hashtable extractClientTextBoxes(java.util.Hashtable
      _containerClientTextBox, long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    //To hold temparary data
    ByteArrayOutputStream _outStream = new ByteArrayOutputStream();

    PPTClientTextBox _clientTextBox = null;

    // Traversing the bytearray upto Presist directory position
    for (long i = offset; i < offsetPD - 20; i++) {

      try {

        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2); // Record Type
        long size = LittleEndian.getUInt(pptdata, (int) i + 4); // Record Size

        if (type == PPT_DIAGRAMGROUP_ATOM) { //Record type is of Drawing Group

          long shapeCount = LittleEndian.getUInt(pptdata, (int) i + 8); // Total number of objects
          long _currentID = LittleEndian.getInt(pptdata, (int) i + 12); // Group ID+number of
objects

          _currentID = ( (int) (_currentID / 1024)) * 1024;

          if (_currentID == PPT_MASTERSLIDE) { // Ignore Master Slide objects
            i++;
            continue;
          }

          //Check for the ClientTextBox GroupID existence
          if (!_containerClientTextBox.containsKey(new Long(_currentID))) {
            _clientTextBox = new PPTClientTextBox(_currentID);
            _containerClientTextBox.put(new Long(_currentID), _clientTextBox);
          }
          else {
            // If exists get Client Textbox Group
            _clientTextBox = (PPTClientTextBox) _containerClientTextBox.get(new
                Long(_currentID));
            _clientTextBox.setContent("");
          }

          // Iterating the bytearray for TextCharAtoms and TextBytesAtom
          for (long j = i + 8; j < offsetPD - 20; j++) {
            try {
              long nrinfo = LittleEndian.getUShort(pptdata, (int) j);
              long ntype = LittleEndian.getUShort(pptdata, (int) j + 2); //Record Type
              long nsize = LittleEndian.getUInt(pptdata, (int) j + 4); // Record size

              if (ntype == PPT_DIAGRAMGROUP_ATOM) { // Break the loop if next GroupID found
                i = j - 1;
                break;
              }
              else if (ntype == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

                _outStream = new ByteArrayOutputStream();
                long ii = 0;
                for (ii = j + 6; ii <= j + 6 + nsize; ii++) { // For loop to changed to a function
                  short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }
                  else if (ch == 0x201c) { // for left double quote
                    _outStream.write( (byte) 147);
                  }
                  else if (ch == 0x201d) { // for right double quote
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) { // for right single quote
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) { // for left single quote
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for '-' character
                    _outStream.write( (byte) 150);
                  }
                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));

              }
              else if (ntype == PPT_TEXTCHAR_ATOM) { // TextCharAtom record
                _outStream = new ByteArrayOutputStream();
                String strTempContent = new String(pptdata, (int) j + 6,
                    (int) (nsize) + 2);
                byte bytes[] = strTempContent.getBytes();

                for (int ii = 0; ii < bytes.length - 1; ii += 2) { // For loop to changed to a
function
                  short ch = Utils.convertBytesToShort(bytes, ii);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }

                  else if (ch == 0x201c) {
                    _outStream.write( (byte) 147);

                  }
                  else if (ch == 0x201d) {
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) {
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) {
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for - character
                    _outStream.write( (byte) 150);
                  }

                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));
              }
            }
            catch (Throwable e) {
              break;
            }
          }
        }
      }
      catch (Throwable ee) {
        return _containerClientTextBox;
      }
    }
    return _containerClientTextBox;
  }

  /**
   * Method that returns the Powerpoint place holders
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Vector
   * @throws MSPowerPointDocumentParserException
   */

  public Vector extractPlaceHoders(long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    int sNum = 0;

    Vector slides = new Vector(); // To All Slides data

    PPTSlide currentSlide = null; // Object to hold current slide data

    ByteArrayOutputStream _outStream = new ByteArrayOutputStream(); // To store data found in
TextCharAtoms and TextBytesAtoms

    for (long i = offset; i < pptdata.length - 20; i++) {
      try {
        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2);
        long size = LittleEndian.getUInt(pptdata, (int) i + 4);

        if (type == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

          _outStream = new ByteArrayOutputStream();
          long ii = 0;
          for (ii = i + 6; ii <= i + 6 + size; ii++) {
            short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');

            }
            else if (ch == 0x201c) { // for left double quote
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) { // for right double quote
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) { // for right single quote
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) { // for left single quote
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for '-' character
              _outStream.write( (byte) 150);
            }
            else {
              _outStream.write(ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());

        }
        else if (type == PPT_TEXTCHAR_ATOM) { //TextCharAtom record
          _outStream = new ByteArrayOutputStream();
          String strTempContent = new String(pptdata, (int) i + 6,
                                             (int) (size) + 2);
          byte bytes[] = strTempContent.getBytes();

          for (int ii = 0; ii < bytes.length - 1; ii += 2) {
            short ch = Utils.convertBytesToShort(bytes, ii);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');
            }
            else if (ch == 0x201c) {
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) {
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) {
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) {
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for - character
              _outStream.write( (byte) 150);
            }

            else {
              _outStream.write( (byte) ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());
        }
        else if (type == PPT_SLIDEPERSISTANT_ATOM) { // SlidePresistAtom Record type
          if (sNum != 0) {
            _outStream = new ByteArrayOutputStream();

            long slideID = LittleEndian.getUInt(pptdata,
                                                (int) i + 20);

            currentSlide = new PPTSlide(slideID);
            //currentSlide.addContent(_outStream.toString());
            slides.addElement(currentSlide);
          }
          sNum++;
        }
        else
        if (type == PPT_DIAGRAMGROUP_ATOM) { //DG
          break;
        }
      }
      catch (Throwable ee) {

      }

      /*******************************************************************/

    }

    return slides;
  }

  /**
   * Constructor that takes a Powerpoint document name as an argument for getting the text
   * @param fileName
   */
  public PPT2Text(String fileName) {
    this.docName = fileName;
  }

  /**
   * Method that reads the Powerpoint document for parsing the text
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    try {
      POIFSReader reader = new POIFSReader();
      reader.registerListener(new PPTListener());
      reader.read(new FileInputStream(docName));
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Unable to read the PPT Document \nError Cause : " + _docError,
          _docError);
    }

  }

  /**
   * returns the PowerPoint text
   * @return String
   */
  public String getText() {

    return pptTextBuffer.toString();
  }

  /**
   * Holds the Powerpoint document name
   */
  private String docName;

  /**
   * Holds the parsed Powerpoint Text
   */
  private StringBuffer pptTextBuffer = new StringBuffer();

}
_________________________________________________________________________________________________







/**
 * Title: MSPowerPointDocumentParserException
 * Description: This is An Exception class and used for catching the runtime exceptions for the
Powerpoint Document class
 * @author Sudhakar Chavali
 * @see microsoft.powerpoint.PowerPointDocument
 * @version 1.0
 */

public class MSPowerPointDocumentParserException
    extends DocumentParserException {

  /**
   * A constructor that builds the Exception object
   * @param message
   */
  public MSPowerPointDocumentParserException(String message) {
    super(message);
  }

  /**
   * A constructor that builds the Exception object
   * @param message
   * @param cause
   */

  public MSPowerPointDocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}





=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Power Point Extractor Source Codes

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Hello All,

This was my first contribution 
http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java for
jakarta team. And it seems another expert(Ryan Rhodes- ryanshaerhodes@hotmail.com) has already
started working on that based on my first given contribution.

That sounds great to me.

So In order to increase the development process for Powerpoint extractor, I just wanted to
contribute Our team efforts in developing the Powerpoint extractor

Authors :- Sudhakar Chavali (sudhakar_koundinya@yahoo.com) and Hari Shanker Goud
(harigoud@yahoo.com)


Have a look on the below source codes


Regards
Sudhakar

____________________________________________________________________________________

/**
 * Title: DocumentParserException class
 * Description: This is root Exceptional class for throwing the runtime errors that can be raised
by different parsers
 * @author Sudhakar
 * @version 1.0
 */

public class DocumentParserException
    extends Exception {

  /**
   * Constructs a new exception with null as its detail message.
   */

  public DocumentParserException() {
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   */

  public DocumentParserException(String message) {
    super(message);
  }

  /**
   * Constructs a new exception with the specified detail message.
   * @param message
   * @param cause
   */
  public DocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}
_____________________________________________________________________________________

import java.io.*;

/**
 *
 * Title: Summary Base
 * Description: A Generic one that reads the document's summary information and returns it through
different internal methods
 * @author Sudhakar Chavali
 * @version 1.0
 */
public interface SummaryBase {
  /**
   * A method returns the Document's Author
   * @return String
   */
  public String getDocAuthor();

  /**
   * A method that returns the Document Created Date
   * @return String
   */
  public String getDocCreatedDate();

  /**
   * A method that returns the Document's Key words
   * @return String
   */
  public String getDocKeywords();

  /**
   * A method that returns the Document's comments
   * @return String
   */
  public String getDocComments();

  /**
   * A method that returns the Document Name
   * @return String
   */
  public String getDocName();

  /**
   * A method that returns the Document's Subject
   * @return String
   */
  public String getDocSubject();

  /**
   * A method that returns the Document's title
   */

  public String getDocTitle();

  /**
   * A method that reads the document's Summary Information
   * @throws DocumentParserException
   */
  public void read() throws DocumentParserException;

  /**
   * A method that writes the Document's summary information as an XML into the file
   * @param strXMLFile
   * @throws DocumentParserException
   */
  public void write(String strXMLFile) throws 
      DocumentParserException;

  /**
   * A method that writes the document's summary information as an XML into OutputStream Object
   * @param out
   * @throws DocumentParserException
   */
  public void write(OutputStream out) throws 
      DocumentParserException;

  /**
   * A method that returns the Document's summary as an XML String
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsXML() throws 
      DocumentParserException;

  /**
   * A method that returns document's summary information as normal text
   * @return String
   * @throws DocumentParserException
   */
  public String getSummaryAsText() throws 
      DocumentParserException;
}

______________________________________________________________________________________

import java.io.*;

/**
 * A generic document that reads the document's text and parses it into normal Ascii text using
the different methods.
 */
public interface Document {

  /**
   * A method that returns the document's text after parsing. This method should be called after
calling the read method
   * @return String
   * @see #read()
   * @throws DocumentParserException
   */
  public abstract String getText() throws 
      DocumentParserException;

  /**
   * A method that returns the parsed text as byte array. This method should be called after
calling the read method
   * @return byte[]
   * @throws DocumentParserException
   */
  public abstract byte[] getBytes() throws 
      DocumentParserException;

  /**
   * A method that writes the parsed text into the OutputStream object. This method should be
called after calling the read method
   * @param out
   * @throws DocumentParserException
   */
  public abstract void write(OutputStream out) throws 
      DocumentParserException, Exception;

  /**
   * A method that reads and parses the document into Normal text
   * @throws DocumentParserException
   */
  public abstract void read() throws 
      DocumentParserException, Exception;

  /**
   * A method that returns the document summary information as Normal Text
   * @return String
   */
  public abstract String getDocumentSummaryAsText();

  /**
   * A method that returns the document summary information as an XML String
   * @return String
   */
  public abstract String getDocumentSummaryAsXML();

  /**
   * A method that returns the InputStream which contains the document summary information as a
Normal Text
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentTextStream() throws 
      DocumentParserException;

  ;

  /**
   * A method that returns the InputStream which contains the document summary information as an
XML String
   * @return InputStream
   * @throws DocumentParserException
   */
  public abstract InputStream getDocumentSummaryStream() throws 
      DocumentParserException;

  ;

}


______________________________________________________________________________________

/**
 * <p>Title: PPTSlide</p>
 * <p>Description: A class that holds the Powerpoint slide</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

class PPTSlide {
  /**
   * Holds the Slide Number
   */
  protected long slideNumber;

  /**
   * Holds the contents of the Slide
   */
  protected Vector contents;

  /**
   * Initialise the Object for holding the contents of Power Point Slide
   * @param number
   */
  public PPTSlide(long number) {
    slideNumber = number;
    contents = new Vector();
  }

  /**
   * Add the Content of Slide to this Object
   * @param content
   */

  public void addContent(String content) {
    contents.addElement(content);
  }

  /**
   * returns the contents of slide as a vector object
   * @return Vector
   */
  public Vector getContent() {
    return contents;
  }

  /**
   * returns the slide value
   * @return long
   */

  public long getSlideNumber() {
    return slideNumber;
  }

}

_____________________________________________________________________________________





import java.io.OutputStream;
import java.io.InputStream;
import java.io.*;

/**
 * <p>Title: Parsers</p>
 * <p>Description: Class that parses the Power Point Document Content and it's summary to text</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public class PPTDocument
    implements Document {
  /**
   * Checks whether the summary information was already read or not
   */

  private boolean bRead = false;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private SummaryBase summary = null;

  /**
   * Holds the Powerpoint Document Summary Content
   */
  private String docName = null;

  /**
   * Holds the Powerpoint Document   Text
   */
  private String pptText = "";

  /**
   * Constructor for setting  the PowerPoint document path for initialising the POI object
   * @param docName
   */
  public PPTDocument(String docName) {
    this.docName = docName;
    summary = new MSDocumentSummary(docName);

  }

  /**
   * returns the parsed Powerpoint Document text
   * @return String
   */
  public String getText() {

    return pptText;

  }

  /**
   * returns the Powerpoint Document text as bytes
   * @return byte[]
   */
  public byte[] getBytes() {
    return getText().getBytes();
  }

  /**
   * Writes the Powerpoint Document Text into OutputStream object
   * @param out
   * @throws MSPowerPointDocumentParserException
   */
  public void write(OutputStream out) throws
      MSPowerPointDocumentParserException {
    try {
      out.write(getText().getBytes());
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Error Raised while writing the text into OutputStream Object \nError Cause : " +
          _docError, _docError);
    }

  }

  /**
   * Reads the Powerpoint document for getting the text from it.
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    PPT2Text ppt2Text = new PPT2Text(docName);
    ppt2Text.read();
    pptText = ppt2Text.getText();

    pptText = pptText.replaceAll("\r", "\n");
    pptText = pptText.replaceAll("\n", "\r\n");

  }

  /**
   * returns the document summary as tab delimited text
   * @return
   */
  public String getDocumentSummaryAsText() {
    try {
      if (!bRead) {
        summary.read(); //read the summary object
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsText(); //Build and return the Summary as Normal text

    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document summary as xml
   * @return
   */
  public String getDocumentSummaryAsXML() {
    try {
      if (!bRead) {
        summary.read(); //read the summary information of a document
        bRead = true; //ensure that summary information was read only one time
      }
      return summary.getSummaryAsXML(); //build annd return the summary as XML string
    }
    catch (Exception ex) {
      return "";
    }

  }

  /**
   * returns the document text as Stream Object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentTextStream() throws
      MSPowerPointDocumentParserException {
    try {
      ByteArrayInputStream in = new ByteArrayInputStream(this.getBytes()); //Write the Document
Text in InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //EXCEPTION RAISED WHILE CREATING THE InputStream OBJECT
      throw new MSPowerPointDocumentParserException(
          "Unable to return the document text as an InputStream\n\tException Root :" +
          _documentError);
    }

  }

  /**
   * returns the document summary as stream object
   * @return InputStream
   * @throws MSPowerPointDocumentParserException
   */
  public InputStream getDocumentSummaryStream() throws
      MSPowerPointDocumentParserException {
    try {
      /*
       A method that reads the Document Summary Stream in  InputStream Object
       */
      ByteArrayInputStream in = new ByteArrayInputStream(
          getDocumentSummaryAsText().getBytes()); //Write the Summary Information into the
InputStream Object
      return (InputStream) in; //return InputStream Object
    }
    catch (Throwable _documentError) {
      //error raised while creating the document summary info stream object
      // throw it
      throw new MSPowerPointDocumentParserException(
          "Unable to get Document Summary Information as Stream\n\tException Root:  " +
          _documentError);
    }

  }

}
________________________________________________________________________________________________



/**
 * <p>Title: PPTConstants</p>
 * <p>Description: A class that holds the Powerpoint constants</p>
 * @author Sudhakar Chavali
 * @version 1.0
 */

public interface PPTConstants {


  public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
  public static final int PPT_DIAGRAMGROUP_ATOM = 61448;
  public static final int PPT_TEXTCHAR_ATOM = 4000;
  public static final int PPT_TEXTBYTE_ATOM = 4008;
  public static final int PPT_USEREDIT_ATOM = 4085;
  public static final int PPT_MASTERSLIDE = 1024;
}

_________________________________________________________________________________________________






/**
 * <p>Title: PPTClientTextBox</p>
 * <p>Description:  A class that holds the Powrpoint Client Text box content</p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

import java.util.*;
import java.io.*;

public class PPTClientTextBox {
  /**
   * Holds the current id of a client text box
   */
  protected long currentID;

  /**
   * holds the content of client text box
   */
  protected String content;

  /**
   * Instantiates the client text box object
   * @param number
   */
  public PPTClientTextBox(long number) {
    currentID = number;
    this.content = "";
  }

  /**
   * Instantiates the client text box object
   * @param number
   * @param content
   */
  public PPTClientTextBox(long number, String content) {
    currentID = number;
    this.content = content;
  }

  /**
   * Sets the content of a client text box
   * @param content
   */
  public void setContent(String content) {
    this.content = content;
  }

  /**
   * returns the content of a client text box
   * @return
   */

  public String getContent() {
    return content;
  }

  /**
   * returns the current client text box id
   * @return long
   */
  public long getID() {
    return currentID;
  }
}
_________________________________________________________________________________________________



import java.io.*;
import java.util.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;
import org.apache.poi.util.HexDump;
import org.apache.poi.util.LittleEndian;
import org.apache.poi.hdf.extractor.*;

/**
 * <p>Title: PPT2Text</p>
 * <p>Description: A class that parses the Powerpoint document content to text </p>
 * @author Hari Shanker, Sudhakar Chavali
 * @version 1.0
 */

public class PPT2Text
    implements PPTConstants {
  /**
   *
   * <p>Title: PPTListener</p>
   * <p>Description:  Class that used to handle the Power Point Events</p>
   * @author Hari Shanker,Sudhakar Chavali
   * @version 1.0
   */

  class PPTListener
      implements POIFSReaderListener {

    public void processPOIFSReaderEvent(POIFSReaderEvent event) {

      try {

        org.apache.poi.poifs.filesystem.DocumentInputStream _documentStream = null;

        // Checking for PowerPoint Document Stream
        if (!event.getName().startsWith("PowerPoint Document")) {
          return;
        }

        _documentStream = event.getStream();

        byte pptdata[] = new byte[_documentStream.available()];
        _documentStream.read(pptdata, 0, _documentStream.available());

        int sNum = 0;

        long offset = 0, offsetEnd = 0;
        long offsetPD = 0, oldoffsetPD = 0, docRef = 0, maxPresist = 0;

        // Traverse Bytearray to get CurrentUserEditAtom

        // Call to extract the Text in all PlaceHolders

        // To hold PPTClientTextBox objects for mapping into Slide Objects
        java.util.Hashtable _containerClientTextBox = new java.util.Hashtable();

        // Traverse ByteArray to identiy edit paths of ClientTextBoxes
        for (long i = 0; i < pptdata.length - 20; i++) {
          long type = LittleEndian.getUShort(pptdata, (int) i + 2);
          long size = LittleEndian.getUInt(pptdata, (int) i + 4);

          if (type == PPT_USEREDIT_ATOM) { // Checking the Record Header (UserEditAtom)
            long lastSlideID = LittleEndian.getInt(pptdata, (int) i + 8);
            long version = LittleEndian.getUInt(pptdata, (int) i + 12);
            offset = LittleEndian.getUInt(pptdata, (int) i + 16);
            offsetPD = LittleEndian.getUInt(pptdata,
                                            (int) i + 20);

            // Call to extract ClientTextBox text in each UserEditAtom
            _containerClientTextBox = extractClientTextBoxes(
                _containerClientTextBox, offset, pptdata, offsetPD);

          }
        }

        Vector slides = extractPlaceHoders(offset, pptdata, offsetPD);

        if (slides.size() == 0) {
          slides.addElement(new PPTSlide(256));
        }

        PPTSlide _slide = (PPTSlide) slides.get(slides.size() - 1);

        for (Enumeration enum = _containerClientTextBox.elements();
             enum.hasMoreElements(); ) {
          PPTClientTextBox _clientTextBox = (PPTClientTextBox) enum.nextElement();
          _slide.addContent(_clientTextBox.getContent());

        }

        //Merging ClientTextBox data with Slide Data

        // Printing the text from Slides vector object (need further modification)

        for (int i = 0; i < slides.size(); i++) {

          _slide = (PPTSlide) slides.get(i);

          Vector scontent = _slide.getContent();
//          StringBuffer _stringbuffer = new StringBuffer();
          for (int j = 0; j < scontent.size(); j++) {
            pptTextBuffer.append(scontent.get(j).toString());
          }

        }

      }
      catch (Throwable ex) {

        return;
      }
    }
  }

  /**
   * Method that returns the client text boxes of a slide
   * @param clientTextBoxContainer
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Hashtable
   * @throws MSPowerPointDocumentParserException
   */

  public java.util.Hashtable extractClientTextBoxes(java.util.Hashtable
      _containerClientTextBox, long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    //To hold temparary data
    ByteArrayOutputStream _outStream = new ByteArrayOutputStream();

    PPTClientTextBox _clientTextBox = null;

    // Traversing the bytearray upto Presist directory position
    for (long i = offset; i < offsetPD - 20; i++) {

      try {

        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2); // Record Type
        long size = LittleEndian.getUInt(pptdata, (int) i + 4); // Record Size

        if (type == PPT_DIAGRAMGROUP_ATOM) { //Record type is of Drawing Group

          long shapeCount = LittleEndian.getUInt(pptdata, (int) i + 8); // Total number of objects
          long _currentID = LittleEndian.getInt(pptdata, (int) i + 12); // Group ID+number of
objects

          _currentID = ( (int) (_currentID / 1024)) * 1024;

          if (_currentID == PPT_MASTERSLIDE) { // Ignore Master Slide objects
            i++;
            continue;
          }

          //Check for the ClientTextBox GroupID existence
          if (!_containerClientTextBox.containsKey(new Long(_currentID))) {
            _clientTextBox = new PPTClientTextBox(_currentID);
            _containerClientTextBox.put(new Long(_currentID), _clientTextBox);
          }
          else {
            // If exists get Client Textbox Group
            _clientTextBox = (PPTClientTextBox) _containerClientTextBox.get(new
                Long(_currentID));
            _clientTextBox.setContent("");
          }

          // Iterating the bytearray for TextCharAtoms and TextBytesAtom
          for (long j = i + 8; j < offsetPD - 20; j++) {
            try {
              long nrinfo = LittleEndian.getUShort(pptdata, (int) j);
              long ntype = LittleEndian.getUShort(pptdata, (int) j + 2); //Record Type
              long nsize = LittleEndian.getUInt(pptdata, (int) j + 4); // Record size

              if (ntype == PPT_DIAGRAMGROUP_ATOM) { // Break the loop if next GroupID found
                i = j - 1;
                break;
              }
              else if (ntype == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

                _outStream = new ByteArrayOutputStream();
                long ii = 0;
                for (ii = j + 6; ii <= j + 6 + nsize; ii++) { // For loop to changed to a function
                  short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }
                  else if (ch == 0x201c) { // for left double quote
                    _outStream.write( (byte) 147);
                  }
                  else if (ch == 0x201d) { // for right double quote
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) { // for right single quote
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) { // for left single quote
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for '-' character
                    _outStream.write( (byte) 150);
                  }
                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));

              }
              else if (ntype == PPT_TEXTCHAR_ATOM) { // TextCharAtom record
                _outStream = new ByteArrayOutputStream();
                String strTempContent = new String(pptdata, (int) j + 6,
                    (int) (nsize) + 2);
                byte bytes[] = strTempContent.getBytes();

                for (int ii = 0; ii < bytes.length - 1; ii += 2) { // For loop to changed to a
function
                  short ch = Utils.convertBytesToShort(bytes, ii);
                  if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
                    _outStream.write( (byte) '\r');

                  }

                  else if (ch == 0x201c) {
                    _outStream.write( (byte) 147);

                  }
                  else if (ch == 0x201d) {
                    _outStream.write( (byte) 148);

                  }
                  else if (ch == 0x2019) {
                    _outStream.write( (byte) 146);
                  }
                  else if (ch == 0x2018) {
                    _outStream.write( (byte) 145);
                  }
                  else if (ch == 0x2013) { // for - character
                    _outStream.write( (byte) 150);
                  }

                  else {
                    _outStream.write( (byte) ch);
                  }
                }

                // Setting the identified text for Current groupID
                _clientTextBox.setContent(_clientTextBox.getContent() +
                                          new String(_outStream.toByteArray()));
              }
            }
            catch (Throwable e) {
              break;
            }
          }
        }
      }
      catch (Throwable ee) {
        return _containerClientTextBox;
      }
    }
    return _containerClientTextBox;
  }

  /**
   * Method that returns the Powerpoint place holders
   * @param offset
   * @param pptBytes
   * @param offsetPD
   * @return Vector
   * @throws MSPowerPointDocumentParserException
   */

  public Vector extractPlaceHoders(long offset, byte[] pptdata, long offsetPD) throws
      Throwable {

    int sNum = 0;

    Vector slides = new Vector(); // To All Slides data

    PPTSlide currentSlide = null; // Object to hold current slide data

    ByteArrayOutputStream _outStream = new ByteArrayOutputStream(); // To store data found in
TextCharAtoms and TextBytesAtoms

    for (long i = offset; i < pptdata.length - 20; i++) {
      try {
        long rinfo = LittleEndian.getUShort(pptdata, (int) i);
        long type = LittleEndian.getUShort(pptdata, (int) i + 2);
        long size = LittleEndian.getUInt(pptdata, (int) i + 4);

        if (type == PPT_TEXTBYTE_ATOM) { //TextByteAtom record

          _outStream = new ByteArrayOutputStream();
          long ii = 0;
          for (ii = i + 6; ii <= i + 6 + size; ii++) {
            short ch = Utils.convertBytesToShort(pptdata, (int) ii + 2);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');

            }
            else if (ch == 0x201c) { // for left double quote
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) { // for right double quote
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) { // for right single quote
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) { // for left single quote
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for '-' character
              _outStream.write( (byte) 150);
            }
            else {
              _outStream.write(ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());

        }
        else if (type == PPT_TEXTCHAR_ATOM) { //TextCharAtom record
          _outStream = new ByteArrayOutputStream();
          String strTempContent = new String(pptdata, (int) i + 6,
                                             (int) (size) + 2);
          byte bytes[] = strTempContent.getBytes();

          for (int ii = 0; ii < bytes.length - 1; ii += 2) {
            short ch = Utils.convertBytesToShort(bytes, ii);

            if (ch == 0 || ch == 16 || ch == 13 || ch == 10) {
              _outStream.write( (byte) '\r');
            }
            else if (ch == 0x201c) {
              _outStream.write( (byte) 147);
            }
            else if (ch == 0x201d) {
              _outStream.write( (byte) 148);
            }
            else if (ch == 0x2019) {
              _outStream.write( (byte) 146);
            }
            else if (ch == 0x2018) {
              _outStream.write( (byte) 145);
            }
            else if (ch == 0x2013) { // for - character
              _outStream.write( (byte) 150);
            }

            else {
              _outStream.write( (byte) ch);
            }
          }

          // Setting the identified text for Current Slide
          currentSlide.addContent(_outStream.toString());
        }
        else if (type == PPT_SLIDEPERSISTANT_ATOM) { // SlidePresistAtom Record type
          if (sNum != 0) {
            _outStream = new ByteArrayOutputStream();

            long slideID = LittleEndian.getUInt(pptdata,
                                                (int) i + 20);

            currentSlide = new PPTSlide(slideID);
            //currentSlide.addContent(_outStream.toString());
            slides.addElement(currentSlide);
          }
          sNum++;
        }
        else
        if (type == PPT_DIAGRAMGROUP_ATOM) { //DG
          break;
        }
      }
      catch (Throwable ee) {

      }

      /*******************************************************************/

    }

    return slides;
  }

  /**
   * Constructor that takes a Powerpoint document name as an argument for getting the text
   * @param fileName
   */
  public PPT2Text(String fileName) {
    this.docName = fileName;
  }

  /**
   * Method that reads the Powerpoint document for parsing the text
   * @throws MSPowerPointDocumentParserException
   */
  public void read() throws MSPowerPointDocumentParserException {
    try {
      POIFSReader reader = new POIFSReader();
      reader.registerListener(new PPTListener());
      reader.read(new FileInputStream(docName));
    }
    catch (Throwable _docError) {
      throw new MSPowerPointDocumentParserException(
          "Unable to read the PPT Document \nError Cause : " + _docError,
          _docError);
    }

  }

  /**
   * returns the PowerPoint text
   * @return String
   */
  public String getText() {

    return pptTextBuffer.toString();
  }

  /**
   * Holds the Powerpoint document name
   */
  private String docName;

  /**
   * Holds the parsed Powerpoint Text
   */
  private StringBuffer pptTextBuffer = new StringBuffer();

}
_________________________________________________________________________________________________







/**
 * Title: MSPowerPointDocumentParserException
 * Description: This is An Exception class and used for catching the runtime exceptions for the
Powerpoint Document class
 * @author Sudhakar Chavali
 * @see microsoft.powerpoint.PowerPointDocument
 * @version 1.0
 */

public class MSPowerPointDocumentParserException
    extends DocumentParserException {

  /**
   * A constructor that builds the Exception object
   * @param message
   */
  public MSPowerPointDocumentParserException(String message) {
    super(message);
  }

  /**
   * A constructor that builds the Exception object
   * @param message
   * @param cause
   */

  public MSPowerPointDocumentParserException(String message, Throwable cause) {
    super(message, cause);
  }

}





=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org

RE: MSPowerPointExtractor problem

Posted by "Koundinya (Sudhakar Chavali)" <su...@yahoo.com>.

Check this,

http://wiki.apache.org/jakarta-lucene-data/attachments/PowerPoint/attachments/PPT2Text.java

--- Ryan Rhodes <ry...@hotmail.com> wrote:

> Hi Ralph,
> 
> I haven't tested the PPT extractor with any other languages.  I remember
> reading about other people having problems with different character sets
> though.
> 
> Could you send a before and after example file here or to bugzilla?
> 
> -Ryan Rhodes
> 
> 
> -----Original Message-----
> From: Ralph Scheuer [mailto:ralph.scheuer@gmx.net] 
> Sent: Wednesday, July 28, 2004 10:01 AM
> To: slide
> Subject: MSPowerPointExtractor problem
> 
> Hello everybody,
> 
> When I was searching for a Java class to extract text from PowerPoint 
> files, I accidentally discovered Slide.
> 
> I pulled the MSPowerPointExtractor class and some other stuff it 
> depends on via CVS and tried it for some text extraction.
> 
> The method I used looks very similar to the provided example main 
> method (see below).
> 
> However. when I tried to extract text from a German PowerPoint 
> presentation, I had some problems with the encoding. I did not know 
> which encoding to use, converting the output to ISO Latin 1 with my 
> text editor solved only part of the problem (some German Umlaute were 
> displayed correctly, some were not).
> 
> Is this a known issue or am I doing something wrong? Any hints for me?
> 
> Thanks in advance.
> 
> Ralph Scheuer
> 
> BTW. I am using Mac OS X 10.3.4 with JDK 1.4.2_03, the native encoding 
> on this platform is MacRoman.
> 
> 
>      public static String contentStringForData(NSData data){
> 	
> 	StringBuffer buf = new StringBuffer();
> 	try{
> 	    ByteArrayInputStream input = data.stream();
> 	    MSPowerPointExtractor ex = new MSPowerPointExtractor(null,
> null);
> 	
> 	    Reader reader = ex.extract(input);
> 	
> 	    int c;
> 	    do
> 		{
> 		    c = reader.read();
> 		
> 		    buf.append((char)c);
> 		}
> 	    while( c != -1 );
> 	}catch(Exception e){
> 	
> 	}
> 	
> 	return buf.toString();
>      }
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: slide-user-help@jakarta.apache.org
> 
> 


=====
"No one can earn a million dollars honestly."- William Jennings Bryan (1860-1925) 

"Make everything as simple as possible, but not simpler."- Albert Einstein (1879-1955)

"It is dangerous to be sincere unless you are also stupid."- George Bernard Shaw (1856-1950)


	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: slide-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-user-help@jakarta.apache.org