You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Santosh <sa...@softprosys.com> on 2004/08/24 14:15:47 UTC

worddoucments search

Can lucene be able to search word documents? if so please give me information about it

regards
Santosh kumar


-----------------------SOFTPRO DISCLAIMER------------------------------

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects. 

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.
------------------------------------------------------------------------

Textmining.org IS NOT POI (was Re: worddoucments search)

Posted by Ryan Ackley <sa...@cfl.rr.com>.
Go to http://www.textmining.org for a platform independent library to
extract text from Word documents. I wrote 99.99% of the Word component of
POI and all of the textmining.org library.

 I have seen several discussions and web pages that point to textmining.org
that say I simply wrap POI classes (For example, the JGuru GAQ
http://www.jguru.com ) This is totally false.

* The textmining.org library is optimized for extracting text. POI is not.
* The textmining.org libraries supports extracting text from Word 6/95. POI
does not.
* The textmining.org libraries do not extract deleted text that is still in
the document for the purposes of revision marking. POI does not handle this.

-Ryan Ackley

----- Original Message ----- 
From: "Chandan Tamrakar" <ch...@ccnep.com.np>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, August 24, 2004 7:31 AM
Subject: Re: worddoucments search


> please look at Apache POI project.
> http://jakarta.apache.org
>
> Words documents can be extracted using POI apis and later can be indexed.
>
> regards
>
> ----- Original Message ----- 
> From: "Santosh" <sa...@softprosys.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, August 24, 2004 6:00 PM
> Subject: worddoucments search
>
>
> Can lucene be able to search word documents? if so please give me
> information about it
>
> regards
> Santosh kumar
>
>
> -----------------------SOFTPRO DISCLAIMER------------------------------
>
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
>
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
>
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
>
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects.
>
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> ------------------------------------------------------------------------
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Chandan Tamrakar <ch...@ccnep.com.np>.
please look at Apache POI project.
http://jakarta.apache.org

Words documents can be extracted using POI apis and later can be indexed.

regards

----- Original Message ----- 
From: "Santosh" <sa...@softprosys.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, August 24, 2004 6:00 PM
Subject: worddoucments search


Can lucene be able to search word documents? if so please give me
information about it

regards
Santosh kumar


-----------------------SOFTPRO DISCLAIMER------------------------------

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects.

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.
------------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Introduction to Lucene [was Re: worddoucments search]

Posted by Steven Rowe <sa...@syr.edu>.
A collection of links to introductory level Lucene articles (including 
one in simplified Chinese and one in Turkish) is available on the 
Lucene Wiki at:

<URL:http://wiki.apache.org/jakarta-lucene/IntroductionToLucene>

Steve

Otis Gospodnetic wrote:
> that part you have to do yourself.  It is easy, just create a new
> Document, create an appropriate Field, give it a name and the string
> value you got with textmining.org library, then add the Field to your
> Document, and then add the Document to the index with IndexWriter.
> 
> Look at one of the articles about Lucene to get started.  I wrote one
> called something like Introduction to Text Indexing with Lucene.  You
> probably want to read that one to get going.
> 
> Otis
> 
> --- Santosh <sa...@softprosys.com> wrote:
> 
>>I have gon through textmining.org, I am able to extract text in
>>string format. but how can I get it as lucene document format
>>----- Original Message -----
>>From: "Otis Gospodnetic" <ot...@yahoo.com>
>>To: "Lucene Users List" <lu...@jakarta.apache.org>
>>Sent: Tuesday, August 24, 2004 11:54 PM
>>Subject: Re: worddoucments search
>>
>> As I just answered in a separate email to Ryan - we used
>>textmining.orglibrary, too, as an example of something that is easier
>>to use thanPOI.  It's been a while since I wrote that chapter, so it
>>slipped mymind when I replied.  Yes, use textmining.org first, you'll
>>be able toinclude it in your code in 2 minutes.  Good stuff.
>>
>> Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Chandan Tamrakar <ch...@ccnep.com.np>.
Santosh
please read the API' of lucene.

  When you can string from word doc. using textmining api's . try to
convert into some temp.  file and try indexing them

If you are able to index PDF and normal file what trouble will you face
indexing a string extracted from word docs ? please also read /search the
previous posting. it should help understanding about lucene more...


----- Original Message ----- 
From: "Karthik N S" <ka...@controlnet.co.in>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, August 25, 2004 4:21 PM
Subject: RE: worddoucments search


> Hi
>
>   Santosh
>
>   Please .........
>
>   If u have Downloded the Lucene (zip )bundel , First try to read the
> docs/index.html  which is in the bundel,
>   if  u are still in trouble, then  approach the Form for Help  [ Un
> necessarily  asking silly Questions will be ignored ]
>
>
> Karthik
>
>
>
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Wednesday, August 25, 2004 3:01 PM
> To: Lucene Users List
> Subject: Re: worddoucments search
>
>
> that part you have to do yourself.  It is easy, just create a new
> Document, create an appropriate Field, give it a name and the string
> value you got with textmining.org library, then add the Field to your
> Document, and then add the Document to the index with IndexWriter.
>
> Look at one of the articles about Lucene to get started.  I wrote one
> called something like Introduction to Text Indexing with Lucene.  You
> probably want to read that one to get going.
>
> Otis
>
> --- Santosh <sa...@softprosys.com> wrote:
>
> > I have gon through textmining.org, I am able to extract text in
> > string
> > format. but how can I get it as
> > lucene document format
> > ----- Original Message -----
> > From: "Otis Gospodnetic" <ot...@yahoo.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Tuesday, August 24, 2004 11:54 PM
> > Subject: Re: worddoucments search
> >
> >
> >  As I just answered in a separate email to Ryan - we used
> > textmining.orglibrary, too, as an example of something that is easier
> > to use thanPOI.  It's been a while since I wrote that chapter, so it
> > slipped mymind when I replied.  Yes, use textmining.org first, you'll
> > be able toinclude it in your code in 2 minutes.  Good stuff.
> >
> >  Otis
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: worddoucments search

Posted by Karthik N S <ka...@controlnet.co.in>.
Hi

  Santosh

  Please .........

  If u have Downloded the Lucene (zip )bundel , First try to read the
docs/index.html  which is in the bundel,
  if  u are still in trouble, then  approach the Form for Help  [ Un
necessarily  asking silly Questions will be ignored ]


Karthik




-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: Wednesday, August 25, 2004 3:01 PM
To: Lucene Users List
Subject: Re: worddoucments search


that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.

Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.

Otis

--- Santosh <sa...@softprosys.com> wrote:

> I have gon through textmining.org, I am able to extract text in
> string
> format. but how can I get it as
> lucene document format
> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, August 24, 2004 11:54 PM
> Subject: Re: worddoucments search
>
>
>  As I just answered in a separate email to Ryan - we used
> textmining.orglibrary, too, as an example of something that is easier
> to use thanPOI.  It's been a while since I wrote that chapter, so it
> slipped mymind when I replied.  Yes, use textmining.org first, you'll
> be able toinclude it in your code in 2 minutes.  Good stuff.
>
>  Otis
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Otis Gospodnetic <ot...@yahoo.com>.
that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.

Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.

Otis

--- Santosh <sa...@softprosys.com> wrote:

> I have gon through textmining.org, I am able to extract text in
> string
> format. but how can I get it as
> lucene document format
> ----- Original Message -----
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, August 24, 2004 11:54 PM
> Subject: Re: worddoucments search
> 
> 
>  As I just answered in a separate email to Ryan - we used
> textmining.orglibrary, too, as an example of something that is easier
> to use thanPOI.  It's been a while since I wrote that chapter, so it
> slipped mymind when I replied.  Yes, use textmining.org first, you'll
> be able toinclude it in your code in 2 minutes.  Good stuff.
> 
>  Otis
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Santosh <sa...@softprosys.com>.
I have gon through textmining.org, I am able to extract text in string
format. but how can I get it as
lucene document format
----- Original Message -----
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, August 24, 2004 11:54 PM
Subject: Re: worddoucments search


 As I just answered in a separate email to Ryan - we used textmining.orglibrary, too, as an example of something that is easier to use thanPOI.  It's been a while since I wrote that chapter, so it slipped mymind when I replied.  Yes, use textmining.org first, you'll be able toinclude it in your code in 2 minutes.  Good stuff.

 Otis





---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Otis Gospodnetic <ot...@yahoo.com>.
As I just answered in a separate email to Ryan - we used textmining.org
library, too, as an example of something that is easier to use than
POI.  It's been a while since I wrote that chapter, so it slipped my
mind when I replied.  Yes, use textmining.org first, you'll be able to
include it in your code in 2 minutes.  Good stuff.

Otis



--- Ryan Ackley <sa...@cfl.rr.com> wrote:

> Otis,
> 
> Why didn't you use the textmining.org library? You even asked me to
> fix a
> bug for the book , which I did. Also, the code would have been about
> three
> lines.
> 
> -Ryan
> 
> ----- Original Message ----- 
> From: "Otis Gospodnetic" <ot...@yahoo.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Tuesday, August 24, 2004 7:41 AM
> Subject: Re: worddoucments search
> 
> 
> > For Lucene in Action Erik and I wrote a little extensible framework
> for
> > indexing various documents, including MS Word.  We used POI, so the
> > solution works on Winblows, UNIX/Linux, OSX....  I think the code
> is
> > bit too big for the list, but the book will be out soon.  Erik and
> I
> > are going through copy and tech editing right now.  POI:
> > http://jakarta.apache.org/poi .
> >
> > Otis
> >
> >
> > --- Don Vaillancourt <do...@webimpact.com> wrote:
> >
> > > I could ber wrong, but I don't think that there is an indexer for
> > > word
> > > documents.
> > >
> > > There's a Python version of Lucene called Lupy with a Python
> indexer
> > > for
> > > all sorts of document types
> (http://www.methods.co.nz/docindexer/).
> > > Would anyone be willing to port those over.  Although the MSWord
> > > indexer
> > > only words on MSWindows and you may need MSWord for it to work. 
> Man,
> > >
> > > that's no good.
> > >
> > > I think that we'd need to ask the OpenOffice people for help on
> this.
> > >
> > >
> > > Santosh wrote:
> > >
> > > >Can lucene be able to search word documents? if so please give
> me
> > > information about it
> > > >
> > > >regards
> > > >Santosh kumar
> > > >
> > > >
> > > >-----------------------SOFTPRO
> > > DISCLAIMER------------------------------
> > > >
> > > >Information contained in this E-MAIL and any attachments are
> > > >confidential being  proprietary to SOFTPRO SYSTEMS  is
> 'privileged'
> > > >and 'confidential'.
> > > >
> > > >If you are not an intended or authorised recipient of this
> E-MAIL or
> > > >have received it in error, You are notified that any use,
> copying or
> > > >dissemination  of the information contained in this E-MAIL in
> any
> > > >manner whatsoever is strictly prohibited. Please delete it
> > > immediately
> > > >and notify the sender by E-MAIL.
> > > >
> > > >In such a case reading, reproducing, printing or further
> > > dissemination
> > > >of this E-MAIL is strictly prohibited and may be unlawful.
> > > >
> > > >SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an
> attachment
> > > >hereto is free from computer viruses or other defects.
> > > >
> > > >The opinions expressed in this E-MAIL and any ATTACHEMENTS may
> be
> > > >those of the author and are not necessarily those of SOFTPRO
> > > SYSTEMS.
> > >
> >
>
>------------------------------------------------------------------------
> > > >
> > > >
> > > >
> > >
> > >
> > > -- 
> > > *Don Vaillancourt
> > > Director of Software Development
> > > *
> > > *WEB IMPACT INC.*
> > > phone: 416-815-2000 ext. 245
> > > fax: 416-815-2001
> > > email: donv@web-impact.com <ma...@webimpact.com>
> > > web: http://www.web-impact.com
> > >
> > >
> > >
> > > / This email message is intended only for the addressee(s)
> > > and contains information that may be confidential and/or
> > > copyright. If you are not the intended recipient please
> > > notify the sender by reply email and immediately delete
> > > this email. Use, disclosure or reproduction of this email
> > > by anyone other than the intended recipient(s) is strictly
> > > prohibited. No representation is made that this email or
> > > any attachments are free of viruses. Virus scanning is
> > > recommended and is the responsibility of the recipient.
> > > /
> > > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Ryan Ackley <sa...@cfl.rr.com>.
Otis,

Why didn't you use the textmining.org library? You even asked me to fix a
bug for the book , which I did. Also, the code would have been about three
lines.

-Ryan

----- Original Message ----- 
From: "Otis Gospodnetic" <ot...@yahoo.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, August 24, 2004 7:41 AM
Subject: Re: worddoucments search


> For Lucene in Action Erik and I wrote a little extensible framework for
> indexing various documents, including MS Word.  We used POI, so the
> solution works on Winblows, UNIX/Linux, OSX....  I think the code is
> bit too big for the list, but the book will be out soon.  Erik and I
> are going through copy and tech editing right now.  POI:
> http://jakarta.apache.org/poi .
>
> Otis
>
>
> --- Don Vaillancourt <do...@webimpact.com> wrote:
>
> > I could ber wrong, but I don't think that there is an indexer for
> > word
> > documents.
> >
> > There's a Python version of Lucene called Lupy with a Python indexer
> > for
> > all sorts of document types (http://www.methods.co.nz/docindexer/).
> > Would anyone be willing to port those over.  Although the MSWord
> > indexer
> > only words on MSWindows and you may need MSWord for it to work.  Man,
> >
> > that's no good.
> >
> > I think that we'd need to ask the OpenOffice people for help on this.
> >
> >
> > Santosh wrote:
> >
> > >Can lucene be able to search word documents? if so please give me
> > information about it
> > >
> > >regards
> > >Santosh kumar
> > >
> > >
> > >-----------------------SOFTPRO
> > DISCLAIMER------------------------------
> > >
> > >Information contained in this E-MAIL and any attachments are
> > >confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> > >and 'confidential'.
> > >
> > >If you are not an intended or authorised recipient of this E-MAIL or
> > >have received it in error, You are notified that any use, copying or
> > >dissemination  of the information contained in this E-MAIL in any
> > >manner whatsoever is strictly prohibited. Please delete it
> > immediately
> > >and notify the sender by E-MAIL.
> > >
> > >In such a case reading, reproducing, printing or further
> > dissemination
> > >of this E-MAIL is strictly prohibited and may be unlawful.
> > >
> > >SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> > >hereto is free from computer viruses or other defects.
> > >
> > >The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> > >those of the author and are not necessarily those of SOFTPRO
> > SYSTEMS.
> >
> >------------------------------------------------------------------------
> > >
> > >
> > >
> >
> >
> > -- 
> > *Don Vaillancourt
> > Director of Software Development
> > *
> > *WEB IMPACT INC.*
> > phone: 416-815-2000 ext. 245
> > fax: 416-815-2001
> > email: donv@web-impact.com <ma...@webimpact.com>
> > web: http://www.web-impact.com
> >
> >
> >
> > / This email message is intended only for the addressee(s)
> > and contains information that may be confidential and/or
> > copyright. If you are not the intended recipient please
> > notify the sender by reply email and immediately delete
> > this email. Use, disclosure or reproduction of this email
> > by anyone other than the intended recipient(s) is strictly
> > prohibited. No representation is made that this email or
> > any attachments are free of viruses. Virus scanning is
> > recommended and is the responsibility of the recipient.
> > /
> > >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Don Vaillancourt <do...@webimpact.com>.
Woo hoo for Otis and Erik!

Otis Gospodnetic wrote:

>For Lucene in Action Erik and I wrote a little extensible framework for
>indexing various documents, including MS Word.  We used POI, so the
>solution works on Winblows, UNIX/Linux, OSX....  I think the code is
>bit too big for the list, but the book will be out soon.  Erik and I
>are going through copy and tech editing right now.  POI:
>http://jakarta.apache.org/poi .
>
>Otis
>
>
>--- Don Vaillancourt <do...@webimpact.com> wrote:
>
>  
>
>>I could ber wrong, but I don't think that there is an indexer for
>>word 
>>documents.
>>
>>There's a Python version of Lucene called Lupy with a Python indexer
>>for 
>>all sorts of document types (http://www.methods.co.nz/docindexer/).  
>>Would anyone be willing to port those over.  Although the MSWord
>>indexer 
>>only words on MSWindows and you may need MSWord for it to work.  Man,
>>
>>that's no good.
>>
>>I think that we'd need to ask the OpenOffice people for help on this.
>>
>>
>>Santosh wrote:
>>
>>    
>>
>>>Can lucene be able to search word documents? if so please give me
>>>      
>>>
>>information about it
>>    
>>
>>>regards
>>>Santosh kumar
>>>
>>>
>>>-----------------------SOFTPRO
>>>      
>>>
>>DISCLAIMER------------------------------
>>    
>>
>>>Information contained in this E-MAIL and any attachments are
>>>confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
>>>and 'confidential'.
>>>
>>>If you are not an intended or authorised recipient of this E-MAIL or
>>>have received it in error, You are notified that any use, copying or
>>>dissemination  of the information contained in this E-MAIL in any
>>>manner whatsoever is strictly prohibited. Please delete it
>>>      
>>>
>>immediately
>>    
>>
>>>and notify the sender by E-MAIL.
>>>
>>>In such a case reading, reproducing, printing or further
>>>      
>>>
>>dissemination
>>    
>>
>>>of this E-MAIL is strictly prohibited and may be unlawful.
>>>
>>>SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
>>>hereto is free from computer viruses or other defects. 
>>>
>>>The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
>>>those of the author and are not necessarily those of SOFTPRO
>>>      
>>>
>>SYSTEMS.
>>
>>------------------------------------------------------------------------
>>    
>>
>>> 
>>>
>>>      
>>>
>>-- 
>>*Don Vaillancourt
>>Director of Software Development
>>*
>>*WEB IMPACT INC.*
>>phone: 416-815-2000 ext. 245
>>fax: 416-815-2001
>>email: donv@web-impact.com <ma...@webimpact.com>
>>web: http://www.web-impact.com
>>
>>
>>
>>/ This email message is intended only for the addressee(s)
>>and contains information that may be confidential and/or
>>copyright. If you are not the intended recipient please
>>notify the sender by reply email and immediately delete
>>this email. Use, disclosure or reproduction of this email
>>by anyone other than the intended recipient(s) is strictly
>>prohibited. No representation is made that this email or
>>any attachments are free of viruses. Virus scanning is
>>recommended and is the responsibility of the recipient.
>>/
>>    
>>
>---------------------------------------------------------------------
>  
>
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


-- 
*Don Vaillancourt
Director of Software Development
*
*WEB IMPACT INC.*
phone: 416-815-2000 ext. 245
fax: 416-815-2001
email: donv@web-impact.com <ma...@webimpact.com>
web: http://www.web-impact.com



/ This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright. If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.
/

Re: worddoucments search

Posted by Otis Gospodnetic <ot...@yahoo.com>.
For Lucene in Action Erik and I wrote a little extensible framework for
indexing various documents, including MS Word.  We used POI, so the
solution works on Winblows, UNIX/Linux, OSX....  I think the code is
bit too big for the list, but the book will be out soon.  Erik and I
are going through copy and tech editing right now.  POI:
http://jakarta.apache.org/poi .

Otis


--- Don Vaillancourt <do...@webimpact.com> wrote:

> I could ber wrong, but I don't think that there is an indexer for
> word 
> documents.
> 
> There's a Python version of Lucene called Lupy with a Python indexer
> for 
> all sorts of document types (http://www.methods.co.nz/docindexer/).  
> Would anyone be willing to port those over.  Although the MSWord
> indexer 
> only words on MSWindows and you may need MSWord for it to work.  Man,
> 
> that's no good.
> 
> I think that we'd need to ask the OpenOffice people for help on this.
> 
> 
> Santosh wrote:
> 
> >Can lucene be able to search word documents? if so please give me
> information about it
> >
> >regards
> >Santosh kumar
> >
> >
> >-----------------------SOFTPRO
> DISCLAIMER------------------------------
> >
> >Information contained in this E-MAIL and any attachments are
> >confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> >and 'confidential'.
> >
> >If you are not an intended or authorised recipient of this E-MAIL or
> >have received it in error, You are notified that any use, copying or
> >dissemination  of the information contained in this E-MAIL in any
> >manner whatsoever is strictly prohibited. Please delete it
> immediately
> >and notify the sender by E-MAIL.
> >
> >In such a case reading, reproducing, printing or further
> dissemination
> >of this E-MAIL is strictly prohibited and may be unlawful.
> >
> >SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> >hereto is free from computer viruses or other defects. 
> >
> >The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> >those of the author and are not necessarily those of SOFTPRO
> SYSTEMS.
>
>------------------------------------------------------------------------
> >
> >  
> >
> 
> 
> -- 
> *Don Vaillancourt
> Director of Software Development
> *
> *WEB IMPACT INC.*
> phone: 416-815-2000 ext. 245
> fax: 416-815-2001
> email: donv@web-impact.com <ma...@webimpact.com>
> web: http://www.web-impact.com
> 
> 
> 
> / This email message is intended only for the addressee(s)
> and contains information that may be confidential and/or
> copyright. If you are not the intended recipient please
> notify the sender by reply email and immediately delete
> this email. Use, disclosure or reproduction of this email
> by anyone other than the intended recipient(s) is strictly
> prohibited. No representation is made that this email or
> any attachments are free of viruses. Virus scanning is
> recommended and is the responsibility of the recipient.
> /
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Don Vaillancourt <do...@webimpact.com>.
I could ber wrong, but I don't think that there is an indexer for word 
documents.

There's a Python version of Lucene called Lupy with a Python indexer for 
all sorts of document types (http://www.methods.co.nz/docindexer/).  
Would anyone be willing to port those over.  Although the MSWord indexer 
only words on MSWindows and you may need MSWord for it to work.  Man, 
that's no good.

I think that we'd need to ask the OpenOffice people for help on this.


Santosh wrote:

>Can lucene be able to search word documents? if so please give me information about it
>
>regards
>Santosh kumar
>
>
>-----------------------SOFTPRO DISCLAIMER------------------------------
>
>Information contained in this E-MAIL and any attachments are
>confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
>and 'confidential'.
>
>If you are not an intended or authorised recipient of this E-MAIL or
>have received it in error, You are notified that any use, copying or
>dissemination  of the information contained in this E-MAIL in any
>manner whatsoever is strictly prohibited. Please delete it immediately
>and notify the sender by E-MAIL.
>
>In such a case reading, reproducing, printing or further dissemination
>of this E-MAIL is strictly prohibited and may be unlawful.
>
>SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
>hereto is free from computer viruses or other defects. 
>
>The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
>those of the author and are not necessarily those of SOFTPRO SYSTEMS.
>------------------------------------------------------------------------
>
>  
>


-- 
*Don Vaillancourt
Director of Software Development
*
*WEB IMPACT INC.*
phone: 416-815-2000 ext. 245
fax: 416-815-2001
email: donv@web-impact.com <ma...@webimpact.com>
web: http://www.web-impact.com



/ This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright. If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.
/

Re: worddoucments search

Posted by Ryan Ackley <sa...@cfl.rr.com>.
Code example for textmining.org library:

FileInputStream in = new FileInputStream ("test.doc");
WordExtractor extractor = new WordExtractor();

String str = extractor.extractText();


----- Original Message ----- 
From: "Natarajan.T" <na...@crimsonlogic.co.in>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Tuesday, August 24, 2004 8:11 AM
Subject: RE: worddoucments search


> Hi Santhosh,
> 
> Try out the below attached code.....(POI.jar should be in your class
> path)
> 
> 
> public String getContent(InputStream reader) throws IOException {
>     ArrayList text = new ArrayList();
>     POIFSFileSystem fsys = new POIFSFileSystem(reader);
> 
>     DocumentEntry headerProps =
> (DocumentEntry)fsys.getRoot().getEntry("WordDocument");
>     DocumentInputStream din =
> fsys.createDocumentInputStream("WordDocument");
>     byte[] header = new byte[headerProps.getSize()];
> 
>     din.read(header);
>     din.close();
> 
>     //Get the information we need from the header
>     int info = LittleEndian.getShort(header, 0xa);
>     boolean useTable1 = (info & 0x200) != 0;
> 
>     //get the location of the piece table
>     int complexOffset = LittleEndian.getInt(header,
> 0x1a2);
> 
>     String tableName = null;
>     if (useTable1) {
>       tableName = "1Table";
>     }
>     else{
>       tableName = "0Table";
>     }
> 
>     DocumentEntry table =
> (DocumentEntry)fsys.getRoot().getEntry(tableName);
>     byte[] tableStream = new byte[table.getSize()];
>     din = fsys.createDocumentInputStream(tableName);
>     din.read(tableStream);
> din.close();
> 
>     din = null;
>     fsys = null;
>     table = null;
>     headerProps = null;
> 
>     int multiple = findText(tableStream, complexOffset,
> text);
> 
>     StringBuffer sb = new StringBuffer();
>     int size = text.size();
>     tableStream = null;
> 
> WordTextPiece nextPiece = null;
> int start ;
> int length;
> String toStr = "";
> for (int x = 0; x < size; x++) {
> nextPiece = (WordTextPiece)text.get(x);
> start = nextPiece.getStart();
> length = nextPiece.getLength();
> 
> boolean unicode =
> nextPiece.usesUnicode();
> if (unicode) {
> toStr = new String(header,
> start, length * multiple, "UTF-16LE"); 
> }
> else{ 
> toStr = new String(header,
> start, length , "ISO-8859-1"); 
> } 
> 
> }
> 
> reader.close();
> return toStr;
> }
> 
> 
> Regards,
> Natarajan.
> 
> 
> 
> -----Original Message-----
> From: Santosh [mailto:santosh.s@softprosys.com] 
> Sent: Tuesday, August 24, 2004 5:46 PM
> To: Lucene Users List
> Subject: worddoucments search
> 
> Can lucene be able to search word documents? if so please give me
> information about it
> 
> regards
> Santosh kumar
> 
> 
> -----------------------SOFTPRO DISCLAIMER------------------------------
> 
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
> 
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
> 
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
> 
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects. 
> 
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> ------------------------------------------------------------------------
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: worddoucments search

Posted by Ryan Ackley <sa...@cfl.rr.com>.
Natarajan,

This won't work because there is no class called WordTextPiece in POI. This
is just code regurgitated from the internals of the textmining.org
libraries. When you steal my code at least give me credit. It is required by
the Apache license I distribute the textmining.org libraries under.

-Ryan

----- Original Message ----- 
From: "Natarajan.T" <na...@crimsonlogic.co.in>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Tuesday, August 24, 2004 8:11 AM
Subject: RE: worddoucments search


> Hi Santhosh,
>
> Try out the below attached code.....(POI.jar should be in your class
> path)
>
>
> public String getContent(InputStream reader) throws IOException {
>     ArrayList text = new ArrayList();
>     POIFSFileSystem fsys = new POIFSFileSystem(reader);
>
>     DocumentEntry headerProps =
> (DocumentEntry)fsys.getRoot().getEntry("WordDocument");
>     DocumentInputStream din =
> fsys.createDocumentInputStream("WordDocument");
>     byte[] header = new byte[headerProps.getSize()];
>
>     din.read(header);
>     din.close();
>
>     //Get the information we need from the header
>     int info = LittleEndian.getShort(header, 0xa);
>     boolean useTable1 = (info & 0x200) != 0;
>
>     //get the location of the piece table
>     int complexOffset = LittleEndian.getInt(header,
> 0x1a2);
>
>     String tableName = null;
>     if (useTable1) {
>       tableName = "1Table";
>     }
>     else{
>       tableName = "0Table";
>     }
>
>     DocumentEntry table =
> (DocumentEntry)fsys.getRoot().getEntry(tableName);
>     byte[] tableStream = new byte[table.getSize()];
>     din = fsys.createDocumentInputStream(tableName);
>     din.read(tableStream);
> din.close();
>
>     din = null;
>     fsys = null;
>     table = null;
>     headerProps = null;
>
>     int multiple = findText(tableStream, complexOffset,
> text);
>
>     StringBuffer sb = new StringBuffer();
>     int size = text.size();
>     tableStream = null;
>
> WordTextPiece nextPiece = null;
> int start ;
> int length;
> String toStr = "";
> for (int x = 0; x < size; x++) {
> nextPiece = (WordTextPiece)text.get(x);
> start = nextPiece.getStart();
> length = nextPiece.getLength();
>
> boolean unicode =
> nextPiece.usesUnicode();
> if (unicode) {
> toStr = new String(header,
> start, length * multiple, "UTF-16LE");
> }
> else{
> toStr = new String(header,
> start, length , "ISO-8859-1");
> }
>
> }
>
> reader.close();
> return toStr;
> }
>
>
> Regards,
> Natarajan.
>
>
>
> -----Original Message-----
> From: Santosh [mailto:santosh.s@softprosys.com]
> Sent: Tuesday, August 24, 2004 5:46 PM
> To: Lucene Users List
> Subject: worddoucments search
>
> Can lucene be able to search word documents? if so please give me
> information about it
>
> regards
> Santosh kumar
>
>
> -----------------------SOFTPRO DISCLAIMER------------------------------
>
> Information contained in this E-MAIL and any attachments are
> confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
> and 'confidential'.
>
> If you are not an intended or authorised recipient of this E-MAIL or
> have received it in error, You are notified that any use, copying or
> dissemination  of the information contained in this E-MAIL in any
> manner whatsoever is strictly prohibited. Please delete it immediately
> and notify the sender by E-MAIL.
>
> In such a case reading, reproducing, printing or further dissemination
> of this E-MAIL is strictly prohibited and may be unlawful.
>
> SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
> hereto is free from computer viruses or other defects.
>
> The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
> those of the author and are not necessarily those of SOFTPRO SYSTEMS.
> ------------------------------------------------------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


RE: worddoucments search

Posted by "Natarajan.T" <na...@crimsonlogic.co.in>.
Hi Santhosh,

Try out the below attached code.....(POI.jar should be in your class
path)


public String getContent(InputStream reader) throws IOException {
		    ArrayList text = new ArrayList();
		    POIFSFileSystem fsys = new POIFSFileSystem(reader);

		    DocumentEntry headerProps =
(DocumentEntry)fsys.getRoot().getEntry("WordDocument");
		    DocumentInputStream din =
fsys.createDocumentInputStream("WordDocument");
		    byte[] header = new byte[headerProps.getSize()];

		    din.read(header);
		    din.close();

		    //Get the information we need from the header
		    int info = LittleEndian.getShort(header, 0xa);
		    boolean useTable1 = (info & 0x200) != 0;

		    //get the location of the piece table
		    int complexOffset = LittleEndian.getInt(header,
0x1a2);

		    String tableName = null;
		    if (useTable1) {
		      tableName = "1Table";
		    }
		    else{
		      tableName = "0Table";
		    }

		    DocumentEntry table =
(DocumentEntry)fsys.getRoot().getEntry(tableName);
		    byte[] tableStream = new byte[table.getSize()];
		    din = fsys.createDocumentInputStream(tableName);
		    din.read(tableStream);
			din.close();

		    din = null;
		    fsys = null;
		    table = null;
		    headerProps = null;

		    int multiple = findText(tableStream, complexOffset,
text);

		    StringBuffer sb = new StringBuffer();
		    int size = text.size();
		    tableStream = null;

			WordTextPiece nextPiece = null;
			int start ;
			int length;
			String toStr = "";
			for (int x = 0; x < size; x++) {
				nextPiece = (WordTextPiece)text.get(x);
				start = nextPiece.getStart();
				length = nextPiece.getLength();

				boolean unicode =
nextPiece.usesUnicode();
				if (unicode) {
					toStr = new String(header,
start, length * multiple, "UTF-16LE"); 
				}
				else{ 
					toStr = new String(header,
start, length , "ISO-8859-1"); 
				} 
				
			}

			reader.close();
			return toStr;
		}


Regards,
Natarajan.



-----Original Message-----
From: Santosh [mailto:santosh.s@softprosys.com] 
Sent: Tuesday, August 24, 2004 5:46 PM
To: Lucene Users List
Subject: worddoucments search

Can lucene be able to search word documents? if so please give me
information about it

regards
Santosh kumar


-----------------------SOFTPRO DISCLAIMER------------------------------

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects. 

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.
------------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org