You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Patrick Welfringer <pa...@gmail.com> on 2013/12/18 10:51:06 UTC

Can Lucene be configured to avoid downloading file contents?

Hi,



*Can anyone familiar with Lucene please share their insight?*

The question is this: *is there any way to configure Lucene to index only
certain whitelisted metadata*, or exclude blacklisted metadata?



Indeed, we believe that excluding the “file” metadata could dramatically
reduce the time it takes Lucene to download and process the large number of
PDF files in our particular setup.



We don’t need file contents to be indexed, only other metadata like
“creation date”, “keywords” etc.

The “Luke” tool tells us that none of the file contents are indexed. Yet
during the hour long indexing, we see all of the metadata being downloaded
and written to disk, including document contents.



If you can help us find a way to prevent Lucene to index the entire
Jackrabbit repository, you’ll cheer up many mailing list subscribers that
have similar issues!



Cheers,

Patrick

Re: Can Lucene be configured to avoid downloading file contents?

Posted by Patrick Welfringer <pa...@gmail.com>.
Hello Nilay,

thank you for your suggestion!


>From what I understand about the EmptyParser, it doesn't prevent Lucene
from downloading the data concerned by this EmptyParser - it just doesn't
do anything with that data.


Am I wrong? Could it really avoid the file contents to be downloaded by the
Lucene indexer?


As an additional difficulty, we use an old version of Jackrabbit (1.6.5)
which does not seem to include the tika parser classes . Do you think we
can still use the tika EmptyParser (by adding it manually) or create our
own EmptyParser?



Just to be clear, the goal here is to not only avoid processing the data,
but even to stop Lucene from downloading this data.



Kind regards,

Patrick




On 18 December 2013 10:59, Nilay Parmar <ni...@cybage.com> wrote:

> Hello.
>
> Try using EmptyParser for those types of document which you want to avoid
> indexing(document content) in your tika-config file.
>
> Thanks and regards,
> Nilay Parmar
>
>
> -----Original Message-----
> From: Patrick Welfringer [mailto:patrickwelfringer@gmail.com]
> Sent: Wednesday, December 18, 2013 3:21 PM
> To: users@jackrabbit.apache.org
> Subject: Can Lucene be configured to avoid downloading file contents?
>
> Hi,
>
>
>
> *Can anyone familiar with Lucene please share their insight?*
>
> The question is this: *is there any way to configure Lucene to index only
> certain whitelisted metadata*, or exclude blacklisted metadata?
>
>
>
> Indeed, we believe that excluding the “file” metadata could dramatically
> reduce the time it takes Lucene to download and process the large number of
> PDF files in our particular setup.
>
>
>
> We don’t need file contents to be indexed, only other metadata like
> “creation date”, “keywords” etc.
>
> The “Luke” tool tells us that none of the file contents are indexed. Yet
> during the hour long indexing, we see all of the metadata being downloaded
> and written to disk, including document contents.
>
>
>
> If you can help us find a way to prevent Lucene to index the entire
> Jackrabbit repository, you’ll cheer up many mailing list subscribers that
> have similar issues!
>
>
>
> Cheers,
>
> Patrick
>
> "Legal Disclaimer: This electronic message and all contents contain
> information from Cybage Software Private Limited which may be privileged,
> confidential, or otherwise protected from disclosure. The information is
> intended to be for the addressee(s) only. If you are not an addressee, any
> disclosure, copy, distribution, or use of the contents of this message is
> strictly prohibited. If you have received this electronic message in error
> please notify the sender by reply e-mail to and destroy the original
> message and all copies. Cybage has taken every reasonable precaution to
> minimize the risk of malicious content in the mail, but is not liable for
> any damage you may sustain as a result of any malicious content in this
> e-mail. You should carry out your own malicious content checks before
> opening the e-mail or attachment."
> www.cybage.com
>

RE: Can Lucene be configured to avoid downloading file contents?

Posted by Nilay Parmar <ni...@cybage.com>.
Hello.

Try using EmptyParser for those types of document which you want to avoid indexing(document content) in your tika-config file.

Thanks and regards,
Nilay Parmar


-----Original Message-----
From: Patrick Welfringer [mailto:patrickwelfringer@gmail.com] 
Sent: Wednesday, December 18, 2013 3:21 PM
To: users@jackrabbit.apache.org
Subject: Can Lucene be configured to avoid downloading file contents?

Hi,



*Can anyone familiar with Lucene please share their insight?*

The question is this: *is there any way to configure Lucene to index only
certain whitelisted metadata*, or exclude blacklisted metadata?



Indeed, we believe that excluding the “file” metadata could dramatically
reduce the time it takes Lucene to download and process the large number of
PDF files in our particular setup.



We don’t need file contents to be indexed, only other metadata like
“creation date”, “keywords” etc.

The “Luke” tool tells us that none of the file contents are indexed. Yet
during the hour long indexing, we see all of the metadata being downloaded
and written to disk, including document contents.



If you can help us find a way to prevent Lucene to index the entire
Jackrabbit repository, you’ll cheer up many mailing list subscribers that
have similar issues!



Cheers,

Patrick

"Legal Disclaimer: This electronic message and all contents contain information from Cybage Software Private Limited which may be privileged, confidential, or otherwise protected from disclosure. The information is intended to be for the addressee(s) only. If you are not an addressee, any disclosure, copy, distribution, or use of the contents of this message is strictly prohibited. If you have received this electronic message in error please notify the sender by reply e-mail to and destroy the original message and all copies. Cybage has taken every reasonable precaution to minimize the risk of malicious content in the mail, but is not liable for any damage you may sustain as a result of any malicious content in this e-mail. You should carry out your own malicious content checks before opening the e-mail or attachment." 
www.cybage.com