You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Prasad KVSH <Pr...@ness.com> on 2012/03/07 10:40:26 UTC

Help on DOCX and XLSX

Dear All,

 

We started using Lucene version 3.0.3, we have different types of
documents like PDF, XLS, XLSX, DOC, DOCX,TXT etc., at a specified
folder. 

 

We have created index on these files(using IndexFiles.java), Indexing
has took 17.2 MB for 69.4MB Documents. This index created using Standard
Analyzer with limited index fields. And able to search a given text in
PDF(text content only), *.doc and *.xls(MS Word 1997-2003) versions
only.

 

Now I need help on .docx and .xlsx files indexing. How I can run
indexing on these files. These files are ignored when we do a string
search

 

Writer is defined as below:

IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
StandardAnalyzer(Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.LIMITED);

 

Another question is on the size of index folder, whether we can optimize
the size

 

Thanks

Prasad

Re: Help on DOCX and XLSX

Posted by Ian Lea <ia...@gmail.com>.

So you want to index different fields and search on those fields and
are asking whether you can do that in lucene?  The answer is yes.

I still think you should look at Solr but if you are determined to use
Lucene, get hold of a copy of the second edition of Lucene In Action
http://www.manning.com/hatcher3/.


--
Ian.


On Wed, Mar 7, 2012 at 11:13 AM, Prasad KVSH <Pr...@ness.com> wrote:
> Hi Ian,
>
> Thanks for your quick reply.
>
> Our documents will have the following common key information like
>
> 1. Document Type ID,
> 2. Document Date,
> 3. Document Author ID,
> 4. Document Status
> 5. Document Group ID.
>
> While creating the indexing, we would like to add the above key values
> along the content index. So that it will not read entire index and
> search on Document Type ID  or Date Range.  Can we implement this
> approach?
>
> Currently search text is being performed on indexing, then we are
> filtering the documents by reading document record from database table
> for the above key values.
>
> Thanks
> Prasad
>
>
>
> -----Original Message-----
> From: Ian Lea [mailto:ian.lea@gmail.com]
> Sent: Wednesday, March 07, 2012 4:03 PM
> To: java-user@lucene.apache.org
> Subject: Re: Help on DOCX and XLSX
>
> You'll have to find something that parses the formats you are interested
> in and extracts the text you want.  Apache Tika comes to mind.
>
> Why are you using such an old version of Lucene?  Why aren't you using
> Solr?  That might just work for you out of the box.  See also
> http://www.lucidimagination.com/devzone/technical-articles/content-extra
> ction-tika
>
> As for the size, I wouldn't worry about it.  Disk space is cheap.  If
> you really do care, scan the FAQ at
> http://wiki.apache.org/lucene-java/LuceneFAQ.  Lots of useful info on
> all sorts of things.
>
>
> --
> Ian.
>
>
> On Wed, Mar 7, 2012 at 9:40 AM, Prasad KVSH <Pr...@ness.com>
> wrote:
>> Dear All,
>>
>>
>>
>> We started using Lucene version 3.0.3, we have different types of
>> documents like PDF, XLS, XLSX, DOC, DOCX,TXT etc., at a specified
>> folder.
>>
>>
>>
>> We have created index on these files(using IndexFiles.java), Indexing
>> has took 17.2 MB for 69.4MB Documents. This index created using
>> Standard Analyzer with limited index fields. And able to search a
>> given text in PDF(text content only), *.doc and *.xls(MS Word
>> 1997-2003) versions only.
>>
>>
>>
>> Now I need help on .docx and .xlsx files indexing. How I can run
>> indexing on these files. These files are ignored when we do a string
>> search
>>
>>
>>
>> Writer is defined as below:
>>
>> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
>> StandardAnalyzer(Version.LUCENE_CURRENT), true,
>> IndexWriter.MaxFieldLength.LIMITED);
>>
>>
>>
>> Another question is on the size of index folder, whether we can
>> optimize the size
>>
>>
>>
>> Thanks
>>
>> Prasad
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Help on DOCX and XLSX

Posted by Prasad KVSH <Pr...@ness.com>.

Hi Ian,

Thanks for your quick reply.

Our documents will have the following common key information like

1. Document Type ID, 
2. Document Date,
3. Document Author ID,
4. Document Status
5. Document Group ID.

While creating the indexing, we would like to add the above key values
along the content index. So that it will not read entire index and
search on Document Type ID  or Date Range.  Can we implement this
approach?

Currently search text is being performed on indexing, then we are
filtering the documents by reading document record from database table
for the above key values.

Thanks
Prasad

-----Original Message-----
From: Ian Lea [mailto:ian.lea@gmail.com] 
Sent: Wednesday, March 07, 2012 4:03 PM
To: java-user@lucene.apache.org
Subject: Re: Help on DOCX and XLSX

You'll have to find something that parses the formats you are interested
in and extracts the text you want.  Apache Tika comes to mind.

Why are you using such an old version of Lucene?  Why aren't you using
Solr?  That might just work for you out of the box.  See also
http://www.lucidimagination.com/devzone/technical-articles/content-extra
ction-tika

As for the size, I wouldn't worry about it.  Disk space is cheap.  If
you really do care, scan the FAQ at
http://wiki.apache.org/lucene-java/LuceneFAQ.  Lots of useful info on
all sorts of things.

--
Ian.

On Wed, Mar 7, 2012 at 9:40 AM, Prasad KVSH <Pr...@ness.com>
wrote:
> Dear All,
>
>
>
> We started using Lucene version 3.0.3, we have different types of 
> documents like PDF, XLS, XLSX, DOC, DOCX,TXT etc., at a specified 
> folder.
>
>
>
> We have created index on these files(using IndexFiles.java), Indexing 
> has took 17.2 MB for 69.4MB Documents. This index created using 
> Standard Analyzer with limited index fields. And able to search a 
> given text in PDF(text content only), *.doc and *.xls(MS Word 
> 1997-2003) versions only.
>
>
>
> Now I need help on .docx and .xlsx files indexing. How I can run 
> indexing on these files. These files are ignored when we do a string 
> search
>
>
>
> Writer is defined as below:
>
> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new 
> StandardAnalyzer(Version.LUCENE_CURRENT), true, 
> IndexWriter.MaxFieldLength.LIMITED);
>
>
>
> Another question is on the size of index folder, whether we can 
> optimize the size
>
>
>
> Thanks
>
> Prasad
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Help on DOCX and XLSX

Posted by Ian Lea <ia...@gmail.com>.

You'll have to find something that parses the formats you are
interested in and extracts the text you want.  Apache Tika comes to
mind.

Why are you using such an old version of Lucene?  Why aren't you using
Solr?  That might just work for you out of the box.  See also
http://www.lucidimagination.com/devzone/technical-articles/content-extraction-tika

As for the size, I wouldn't worry about it.  Disk space is cheap.  If
you really do care, scan the FAQ at
http://wiki.apache.org/lucene-java/LuceneFAQ.  Lots of useful info on
all sorts of things.

--
Ian.

On Wed, Mar 7, 2012 at 9:40 AM, Prasad KVSH <Pr...@ness.com> wrote:
> Dear All,
>
>
>
> We started using Lucene version 3.0.3, we have different types of
> documents like PDF, XLS, XLSX, DOC, DOCX,TXT etc., at a specified
> folder.
>
>
>
> We have created index on these files(using IndexFiles.java), Indexing
> has took 17.2 MB for 69.4MB Documents. This index created using Standard
> Analyzer with limited index fields. And able to search a given text in
> PDF(text content only), *.doc and *.xls(MS Word 1997-2003) versions
> only.
>
>
>
> Now I need help on .docx and .xlsx files indexing. How I can run
> indexing on these files. These files are ignored when we do a string
> search
>
>
>
> Writer is defined as below:
>
> IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new
> StandardAnalyzer(Version.LUCENE_CURRENT), true,
> IndexWriter.MaxFieldLength.LIMITED);
>
>
>
> Another question is on the size of index folder, whether we can optimize
> the size
>
>
>
> Thanks
>
> Prasad
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org