You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by prakash jaya <pr...@hotmail.com> on 2005/10/25 15:31:43 UTC

how can i extract text from Powerpointfiles,Ms word files

hi friends
             i want to extract text from ppt,Word files.Is it possible to 
extract text from word,ppt files by this "poi" library.i would be thankful 
if someone provide the solution.plz give solution to my problem.

_________________________________________________________________
Play @ work, play@ home Logon to MSN Games Enjoy unlimited action 
http://www.msn.co.in/gamezone/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by Nick Burch <ni...@torchbox.com>.
On Thu, 27 Oct 2005, prakash jaya wrote:
> but how to get text from MS word files.i run HWPFDocument.java class.In 
> the specification it takes two aruments(one is sorce file,another is 
> destination file).it does not give any result & also it does not create 
> any destination file.can u plz give solution this problem.i would be 
> thankful if u give solution.

Use the code snippet I gave in my first email reply to you

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by eswar <es...@cgvakindia.com>.
hai jaya

Better serach in the website of apache CVS. Where u can get the example
files. I don't know the exact link. But you can able to get the correct
example from that CVS. Search with download of the HWPFD.

it is very simple. If not mail me.

with regards
eswar
----- Original Message ----- 
From: "prakash jaya" <pr...@hotmail.com>
To: <po...@jakarta.apache.org>
Sent: Thursday, October 27, 2005 5:17 PM
Subject: Re: how can i extract text from Powerpointfiles,Ms word files


>
> hello friend good morning,
>                              i am getting text from the powerpoint
> presentations using  the powerpointextractor class of poi.but how to get
> text from MS word files.i run HWPFDocument.java class.In the specification
> it takes two aruments(one is sorce file,another is destination file).it
does
> not give any result & also it does not create any destination file.can u
plz
> give solution this problem.i would be thankful if u give solution.
>
> _________________________________________________________________
> Spice up your IM conversations. New colourful, animated emoticons. Go
> chatting! http://server1.msn.co.in/SP05/emoticons/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by ac...@apache.org.
Just a quick reminder.  I/most committers who have looked at it/ don't 
recommend using the present HWPF codebase.  We will have a new code base 
which we're negotiating.  I've been meaning to update the documentation 
to reflect this.

The present HWPF codebase is very bad, devoid of meaningful unit tests, 
chokes on a lot of different docs, and will do a better job of 
corrupting data than writing it.

-Andy


prakash jaya wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> hello friends thank for u suggetions,
>                                   i tried i got some result,but i am not 
> getting the total document.i am getting only the first line of the 
> document.plz give solution to this.
> here is my following code:
> 
> ////////////////////////////////////////////////////////////////////
> import java.io.*;
> 
> import org.apache.poi.hwpf.usermodel.*;
> import org.apache.poi.hwpf.HWPFDocument;
> 
> public class Test11
> {
>  public Test11()
>  {
>  }
> 
>  public static void main(String[] args)throws IOException
>  {
>    try
>    {
>      HWPFDocument doc = new HWPFDocument (new FileInputStream (fin));
>      Range r = doc.getRange();
>         FileOutputStream out=new FileOutputStream("d:\\example.txt");
> 
>        for (int x = 0; x < r.numSections(); x++)
>      {
>        Section s = r.getSection(x);
>        for (int y = 0; y < s.numParagraphs(); y++)
>        {
>          Paragraph p = s.getParagraph(y);
>          for (int z = 0; z < p.numCharacterRuns(); z++)
>          {
> 
>            //character run
>            CharacterRun run = p.getCharacterRun(z);
>            //character run text
>            String text = run.text();
>           byte[] b1=text.getBytes();
> 
>            // show us the text
>            out.write(b1);
> 
>          }
> 
> 
>          out.close();
>        }
>      }
> 
>    }
>    catch (Throwable t)
>    {
>      t.printStackTrace();
>    }
>  }
> 
> }
> 
> 
> 
> ///////////////////////////////////////////////////////
> my original document is:
> 
> 
> I want to read a powerpoit file "A" and write it content to create 
> another powerpoint "B".
> 
> The simple way is to use FileInputStream to read a byte array from file 
> A.ppt and FileOutputStream to write the byte array to B.ppt. It's work.
> 
> But today, i don't want to use raw byte array to write to B.ppt 
> immediately(Just the program's demand, i do not want to do this 
> too!!:<). I translate the byte array to "String", then translate it back 
> to byte array and write it to B.ppt. One problem happens!!
> 
> 
> this program output is:
> I want to read a powerpoit file "A" and write it content to create 
> another powerpoint "B".
> 
> 
> plz give solution to this problem.i would be thankful if u give solution 
> to my problem
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> From: Rama Subba Reddy <su...@yahoo.co.in>
>> Reply-To: "POI Users List" <po...@jakarta.apache.org>
>> To: POI Users List <po...@jakarta.apache.org>
>> Subject: Re: how can i extract text from Powerpointfiles,Ms word files
>> Date: Thu, 27 Oct 2005 12:53:07 +0100 (BST)
>>
>> Hello,
>>   use the following code and extract
>> HWPFDocument doc = new HWPFDocument(fin);
>>   Range range = doc.getRange();
>>    int totParagraphs = range.numParagraphs();
>> for (int i = 0; i < totParagraphs; i++) {
>>      Paragraph para = range.getParagraph(i);
>> get text run from para and then get text and properties from run
>> }
>> prakash jaya <pr...@hotmail.com> wrote:
>>
>> hello friend good morning,
>> i am getting text from the powerpoint
>> presentations using the powerpointextractor class of poi.but how to get
>> text from MS word files.i run HWPFDocument.java class.In the 
>> specification
>> it takes two aruments(one is sorce file,another is destination 
>> file).it does
>> not give any result & also it does not create any destination file.can 
>> u plz
>> give solution this problem.i would be thankful if u give solution.
>>
>> _________________________________________________________________
>> Spice up your IM conversations. New colourful, animated emoticons. Go
>> chatting! http://server1.msn.co.in/SP05/emoticons/
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>> Mailing List: http://jakarta.apache.org/site/mail2.html#poi
>> The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
>>
>>
>>
>> ---------------------------------
>>  Enjoy this Diwali with Y! India Click here
> 
> 
> _________________________________________________________________
> Answer questions. Register with e-bay. Win gold, watches and more! 
> http://pages.ebay.in/msnindia/msn_quad_shopwingold_sept.html
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
> 
> 


-- 
Andrew C. Oliver
SuperLink Software, Inc.

Java to Excel using POI
http://www.superlinksoftware.com/services/poi
Commercial support including features added/implemented, bugs fixed.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by prakash jaya <pr...@hotmail.com>.







hello friends thank for u suggetions,
                                   i tried i got some result,but i am not 
getting the total document.i am getting only the first line of the 
document.plz give solution to this.
here is my following code:

////////////////////////////////////////////////////////////////////
import java.io.*;

import org.apache.poi.hwpf.usermodel.*;
import org.apache.poi.hwpf.HWPFDocument;

public class Test11
{
  public Test11()
  {
  }

  public static void main(String[] args)throws IOException
  {
    try
    {
      HWPFDocument doc = new HWPFDocument (new FileInputStream (fin));
      Range r = doc.getRange();
         FileOutputStream out=new FileOutputStream("d:\\example.txt");

        for (int x = 0; x < r.numSections(); x++)
      {
        Section s = r.getSection(x);
        for (int y = 0; y < s.numParagraphs(); y++)
        {
          Paragraph p = s.getParagraph(y);
          for (int z = 0; z < p.numCharacterRuns(); z++)
          {

            //character run
            CharacterRun run = p.getCharacterRun(z);
            //character run text
            String text = run.text();
           byte[] b1=text.getBytes();

            // show us the text
            out.write(b1);

          }


		 out.close();
        }
      }

    }
    catch (Throwable t)
    {
      t.printStackTrace();
    }
  }

}



///////////////////////////////////////////////////////
my original document is:


I want to read a powerpoit file "A" and write it content to create another 
powerpoint "B".

The simple way is to use FileInputStream to read a byte array from file 
A.ppt and FileOutputStream to write the byte array to B.ppt. It's work.

But today, i don't want to use raw byte array to write to B.ppt 
immediately(Just the program's demand, i do not want to do this too!!:<). I 
translate the byte array to "String", then translate it back to byte array 
and write it to B.ppt. One problem happens!!


this program output is:
I want to read a powerpoit file "A" and write it content to create another 
powerpoint "B".


plz give solution to this problem.i would be thankful if u give solution to 
my problem


















>From: Rama Subba Reddy <su...@yahoo.co.in>
>Reply-To: "POI Users List" <po...@jakarta.apache.org>
>To: POI Users List <po...@jakarta.apache.org>
>Subject: Re: how can i extract text from Powerpointfiles,Ms word files
>Date: Thu, 27 Oct 2005 12:53:07 +0100 (BST)
>
>Hello,
>   use the following code and extract
>HWPFDocument doc = new HWPFDocument(fin);
>   Range range = doc.getRange();
>    int totParagraphs = range.numParagraphs();
>for (int i = 0; i < totParagraphs; i++) {
>      Paragraph para = range.getParagraph(i);
>get text run from para and then get text and properties from run
>}
>prakash jaya <pr...@hotmail.com> wrote:
>
>hello friend good morning,
>i am getting text from the powerpoint
>presentations using the powerpointextractor class of poi.but how to get
>text from MS word files.i run HWPFDocument.java class.In the specification
>it takes two aruments(one is sorce file,another is destination file).it 
>does
>not give any result & also it does not create any destination file.can u 
>plz
>give solution this problem.i would be thankful if u give solution.
>
>_________________________________________________________________
>Spice up your IM conversations. New colourful, animated emoticons. Go
>chatting! http://server1.msn.co.in/SP05/emoticons/
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>Mailing List: http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
>
>
>
>---------------------------------
>  Enjoy this Diwali with Y! India Click here

_________________________________________________________________
Answer questions. Register with e-bay. Win gold, watches and more! 
http://pages.ebay.in/msnindia/msn_quad_shopwingold_sept.html


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by Rama Subba Reddy <su...@yahoo.co.in>.
Hello,
  use the following code and extract
HWPFDocument doc = new HWPFDocument(fin);
  Range range = doc.getRange();
   int totParagraphs = range.numParagraphs();
for (int i = 0; i < totParagraphs; i++) {
     Paragraph para = range.getParagraph(i);
get text run from para and then get text and properties from run
}
prakash jaya <pr...@hotmail.com> wrote:

hello friend good morning,
i am getting text from the powerpoint 
presentations using the powerpointextractor class of poi.but how to get 
text from MS word files.i run HWPFDocument.java class.In the specification 
it takes two aruments(one is sorce file,another is destination file).it does 
not give any result & also it does not create any destination file.can u plz 
give solution this problem.i would be thankful if u give solution.

_________________________________________________________________
Spice up your IM conversations. New colourful, animated emoticons. Go 
chatting! http://server1.msn.co.in/SP05/emoticons/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/


		
---------------------------------
 Enjoy this Diwali with Y! India Click here

Re: how can i extract text from Powerpointfiles,Ms word files

Posted by prakash jaya <pr...@hotmail.com>.
hello friend good morning,
                             i am getting text from the powerpoint 
presentations using  the powerpointextractor class of poi.but how to get 
text from MS word files.i run HWPFDocument.java class.In the specification 
it takes two aruments(one is sorce file,another is destination file).it does 
not give any result & also it does not create any destination file.can u plz 
give solution this problem.i would be thankful if u give solution.

_________________________________________________________________
Spice up your IM conversations. New colourful, animated emoticons. Go 
chatting! http://server1.msn.co.in/SP05/emoticons/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by Nick Burch <ni...@torchbox.com>.
On Wed, 26 Oct 2005, prakash jaya wrote:
>                       By running the PowerPointExtractor.java class i am 
> getting out put on console(System.out).I want to get on some text file for 
> that i changed the PowerPointExtractor.java class
> and complied this class and replace this .class in the old .class in the 
> jar(poi-scratchpad-3.0).but its giving the out put on the same 
> target(system.out)
> below my chaned code:
>
>
> 	PrintWriter out = new PrintWriter (new 
> FileOutputStream("d:\\vipin.txt"));
>  	PowerPointExtractor ppe = new PowerPointExtractor(file);
> 	System.out.println(ppe.getText(true,notes));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	out.println(ppe.getText(true,notes));
> 	out.close();
> 	ppe.close();
> why its giving same.

Just a little bit of a "Doh!" moment on your part...

You're printing the extracted text to System.out, in addition to your 
file, hence you're seeing it on your screen

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by prakash jaya <pr...@hotmail.com>.
hi friend thanks for ur advice.

But i have on doubt:

                        By running the PowerPointExtractor.java class i am 
getting out put on console(System.out).I want to get on some text file for 
that i changed the PowerPointExtractor.java class
and complied this class and replace this .class in the old .class in the 
jar(poi-scratchpad-3.0).but its giving the out put on the same 
target(system.out)
below my chaned code:


	PrintWriter out = new PrintWriter (new FileOutputStream("d:\\vipin.txt"));
   	PowerPointExtractor ppe = new PowerPointExtractor(file);
	System.out.println(ppe.getText(true,notes));
	out.println(ppe.getText(true,notes));
	out.close();
	ppe.close();
why its giving same.
please give the solution to this problem.

regards,
prakash.

_________________________________________________________________
Spice up your IM conversations. New colourful, animated emoticons. Go 
chatting! http://server1.msn.co.in/SP05/emoticons/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by prakash jaya <pr...@hotmail.com>.
Thank u for ur advice.but i didn't understand what is the cvs build,main 
jar.i copied the
scratchpad JAR, poi-3.0-alpha1-20051025.jar    
,poi-contrib-3.0-alpha1-20051025.jar to my 
jdk1.5(lib),jre1.5(lib/ext)direcries to run this PowerpointExtractor.java 
file.will it work?other plz give what i will do.i would be thankful if u 
give solution to this.








>From: Nick Burch <ni...@torchbox.com>
>Reply-To: "POI Users List" <po...@jakarta.apache.org>
>To: POI Users List <po...@jakarta.apache.org>
>Subject: Re: how can i extract text from Powerpointfiles,Ms word files
>Date: Tue, 25 Oct 2005 16:57:46 +0100 (BST)
>
>On Tue, 25 Oct 2005, prakash jaya wrote:
>>i want to extract text from ppt,Word files.Is it possible to extract text 
>>from word,ppt files by this "poi" library.i would be thankful if someone 
>>provide the solution.plz give solution to my problem.
>
>PowerPoint Text Extraction:
>	http://jakarta.apache.org/poi/hslf/quick-guide.html
>
>Word Text Extraction:
>
>	import org.apache.poi.hwpf.usermodel.*;
>	import org.apache.poi.hwpf.HWPFDocument;
>
>	public String getWordText(InputStream inputStream) {
>		HWPFDocument wdoc = new HWPFDocument(inputStream);
>		String docText = extractText(wdoc);
>		return docText;
>	}
>
>
>For both of these, you'll need a recent CVS build, and you'll need both the 
>main JAR, and the scratchpad JAR. If you don't fancy doing that yourself, 
>grab copies from:
>	http://encore.torchbox.com/poi-cvs-build/
>
>Nick
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>

_________________________________________________________________
Spice up your IM conversations. New colourful, animated emoticons. Go 
chatting! http://server1.msn.co.in/SP05/emoticons/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: how can i extract text from Powerpointfiles,Ms word files

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 25 Oct 2005, prakash jaya wrote:
> i want to extract text from ppt,Word files.Is it possible to extract 
> text from word,ppt files by this "poi" library.i would be thankful if 
> someone provide the solution.plz give solution to my problem.

PowerPoint Text Extraction:
 	http://jakarta.apache.org/poi/hslf/quick-guide.html

Word Text Extraction:

 	import org.apache.poi.hwpf.usermodel.*;
 	import org.apache.poi.hwpf.HWPFDocument;

 	public String getWordText(InputStream inputStream) {
 		HWPFDocument wdoc = new HWPFDocument(inputStream);
 		String docText = extractText(wdoc);
 		return docText;
 	}


For both of these, you'll need a recent CVS build, and you'll need both 
the main JAR, and the scratchpad JAR. If you don't fancy doing that 
yourself, grab copies from:
 	http://encore.torchbox.com/poi-cvs-build/

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/