You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Suba Suresh <su...@wolfram.com> on 2006/06/26 19:11:41 UTC

PowerPoint extractor

Hi!

	I just want to extract the text to index in lucene. I am using the 
poi-3.0 jar files. The hslf package in the scratchpad jar.

My code is as follows

	PowerPointExtractor ppExtractor = new PowerPointExtractor(new 
FileInputStream(filename.ppt));
	String text = ppExtractor.getText();

but I am getting the following exceptions. What am I doing wrong?

Exception in thread "main" java.lang.NullPointerException
	at 
org.apache.poi.hslf.usermodel.SlideShow.buildSlidesAndNotes(SlideShow.java:211)
	at org.apache.poi.hslf.usermodel.SlideShow.<init>(SlideShow.java:83)
	at 
org.apache.poi.hslf.extractor.PowerPointExtractor.<init>(PowerPointExtractor.java:85)
	at indexing.PowerPointHandler.getDocument(PowerPointHandler.java:22)
	at indexing.IndexFiles.indexFile(IndexFiles.java:132)
	

thanks,
suba suresh

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Nick Burch <ni...@torchbox.com>.
On Wed, 28 Jun 2006, Suba Suresh wrote:
> I have given the link to the file and opened a bug report. Yesterday's
> build was giving me lots............ of "No core record found with ID
> 235 wiht PersistPtr lookup" warnings.

OK, I'll take a look when I'm back from apachecon

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Suba Suresh <su...@wolfram.com>.
I have given the link to the file and opened a bug report. Yesterday's 
build was giving me lots............ of "No core record found with ID 
235 wiht PersistPtr lookup" warnings.

thanks,
suba suresh.

Suba Suresh wrote:
> I can go to the link and download the file to bugzilla. Is there any 
> procedure I have to follow? What is the link to bugzilla?
> 
> On an aside note I am trying to do the same with word document file with 
> poi hdf library. I just want to extract text. How can I do it and also 
> how can I extract meta data from all the microsoft format files.
> 
> thanks,
> suba suresh
> 
> Nick Burch wrote:
> 
>> On Mon, 26 Jun 2006, Suba Suresh wrote:
>>
>>> but I am getting the following exceptions. What am I doing wrong?
>>
>>
>>
>> Can you upload your problem powerpoint file to bugzilla? We're shortly 
>> going to be changing the block of code this broke in, and that way we 
>> can be sure we've fixed this bug (along with a couple of others)
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Suba Suresh <su...@wolfram.com>.
I tried the July4th build. The warnings are gone. Thank You.

I used the following code for a couple of small excel files to index 
with lucene. I don't know how effective the search is going to be since 
it is still in the implementation stage.If there are any errors please 
let me know.

public class ExcelHandler implements DocumentHandler {
	
	String fileName;
	public ExcelHandler(String name) {
		super();
		fileName = new String(name);
		
	}

public Document getDocument(InputStream is) throws 
DocumentHandlerException {

Document doc = new Document();
POIFSDocument pdoc = new POIFSDocument(fileName,is);
DocumentInputStream docis = new DocumentInputStream(pdoc);
byte[] content = new byte[docis.available()];
         docis.read(content);
         docis.close();
         StringBuffer textBuf = new StringBuffer();
	for(int i =0; i<content.length; i++){
         	String byteString = new Byte(content[i]).toString();
         	 textBuf.append(byteString);
         }
         String text = textBuf.toString();
	if((text!=null) && (!text.equals(""))){
			
		doc.add(new Field("body", text, Field.Store.YES, Field.Index.NO));
		}
	}

	catch(IOException io){
		throw new DocumentHandlerException("Cannot parse Excel Document", io);
	}
		return doc;
	}
}

Separately in another file I am indexing the filename, filepath, date as 
keywords. Hope it helps.

thanks,
suba suresh.



Nick Burch wrote:
> On Tue, 27 Jun 2006, Suba Suresh wrote:
> 
>>Thank you for all the pointers.  It is a great help. I used today's
>>build. It worked fine for WordDocument. I did not try the meta data yet.
>>For PowerPoint I am getting the following for powerpoint extractor just
>>for one file. Am I doing anything wrong? I did'nt change my code.
> 
> 
> These errors should now have gone. Can you try a new svn checkout /
> tomorrow's SVN build?
> 
> 
> 
>>Also since some the excel files were not 97-2002 format I used the
>>POIFSFilesystem and read it as a bytestream and stored as text string. I
>>hope that is fine.
> 
> 
> If you have some code for getting some basic text out of Excel 95 files,
> we'd be interested in hosting it. I'm sure that something that outputs
> text that can be fed to lucene would be useful for a lot of people, even
> if that's all the excel 95 support we have.
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 27 Jun 2006, Suba Suresh wrote:
> Thank you for all the pointers.  It is a great help. I used today's
> build. It worked fine for WordDocument. I did not try the meta data yet.
> For PowerPoint I am getting the following for powerpoint extractor just
> for one file. Am I doing anything wrong? I did'nt change my code.

These errors should now have gone. Can you try a new svn checkout /
tomorrow's SVN build?


> Also since some the excel files were not 97-2002 format I used the
> POIFSFilesystem and read it as a bytestream and stored as text string. I
> hope that is fine.

If you have some code for getting some basic text out of Excel 95 files,
we'd be interested in hosting it. I'm sure that something that outputs
text that can be fed to lucene would be useful for a lot of people, even
if that's all the excel 95 support we have.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Suba Suresh <su...@wolfram.com>.
Thank you for all the pointers.  It is a great help. I used today's 
build. It worked fine for WordDocument. I did not try the meta data yet. 
For PowerPoint I am getting the following for powerpoint extractor just 
for one file. Am I doing anything wrong? I did'nt change my code.

No core record found with ID 3 based on PersistPtr lookup
No core record found with ID 10 based on PersistPtr lookup
No core record found with ID 12 based on PersistPtr lookup
No core record found with ID 13 based on PersistPtr lookup
No core record found with ID 16 based on PersistPtr lookup
......
......
......
......
No core record found with ID 246 based on PersistPtr lookup

PowerPointExtractor ppExtractor = new PowerPointExtractor(new 
FileInputStream(filename.ppt));
     String text = ppExtractor.getText();

Also since some the excel files were not 97-2002 format I used the 
POIFSFilesystem and read it as a bytestream and stored as text string. I 
hope that is fine.

thanks,
suba suresh.

Nick Burch wrote:
> On Mon, 26 Jun 2006, Suba Suresh wrote:
> 
>>I can go to the link and download the file to bugzilla. Is there any
>>procedure I have to follow? What is the link to bugzilla?
> 
> 
> Just follow the "Bug Database" link from the sidebar when at
> http://jakarta.apache.org/poi/. That said, I've updated the slide building
> code today, so your problem might now be fixed. Try a new SVN build, and
> report back :)
> 
> 
>>On an aside note I am trying to do the same with word document file with
>>poi hdf library. I just want to extract text. How can I do it
> 
> 
> You'll be better of with hwpf. See another post to the list today for a
> guide
> 
> 
>>and also how can I extract meta data from all the microsoft format
>>files.
> 
> 
> For that, you'll want hpsf:
> 	http://jakarta.apache.org/poi/hpsf/index.html
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Nick Burch <ni...@torchbox.com>.
On Mon, 26 Jun 2006, Suba Suresh wrote:
> I can go to the link and download the file to bugzilla. Is there any
> procedure I have to follow? What is the link to bugzilla?

Just follow the "Bug Database" link from the sidebar when at
http://jakarta.apache.org/poi/. That said, I've updated the slide building
code today, so your problem might now be fixed. Try a new SVN build, and
report back :)

> On an aside note I am trying to do the same with word document file with
> poi hdf library. I just want to extract text. How can I do it

You'll be better of with hwpf. See another post to the list today for a
guide

> and also how can I extract meta data from all the microsoft format
> files.

For that, you'll want hpsf:
	http://jakarta.apache.org/poi/hpsf/index.html

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Suba Suresh <su...@wolfram.com>.
I can go to the link and download the file to bugzilla. Is there any 
procedure I have to follow? What is the link to bugzilla?

On an aside note I am trying to do the same with word document file with 
poi hdf library. I just want to extract text. How can I do it and also 
how can I extract meta data from all the microsoft format files.

thanks,
suba suresh

Nick Burch wrote:
> On Mon, 26 Jun 2006, Suba Suresh wrote:
> 
>> but I am getting the following exceptions. What am I doing wrong?
> 
> 
> Can you upload your problem powerpoint file to bugzilla? We're shortly 
> going to be changing the block of code this broke in, and that way we 
> can be sure we've fixed this bug (along with a couple of others)
> 
> Nick
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: PowerPoint extractor

Posted by Nick Burch <ni...@torchbox.com>.
On Mon, 26 Jun 2006, Suba Suresh wrote:
> but I am getting the following exceptions. What am I doing wrong?

Can you upload your problem powerpoint file to bugzilla? We're shortly 
going to be changing the block of code this broke in, and that way we can 
be sure we've fixed this bug (along with a couple of others)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/