You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sreedevi s <sr...@gmail.com> on 2015/02/10 09:24:18 UTC

Lucene search in attachments

Hi,
    Which is the best method to search in attachments in lucene? I am new
to lucene and I am using version 4.10.2. By making use of Tika, I know I
can convert files to text and then index it as another field. But for large
files that will not be the ideal solution. I believe the maximum characters
per field is 10,000. So, what can be ideal method to search attachments then


Best Regards,
Sreedevi S

RE: Lucene search in attachments

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

> -----Original Message-----
> From: sreedevi s [mailto:sreedevi.payikkad@gmail.com]
> Sent: Tuesday, February 10, 2015 10:46 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene search in attachments
> 
> Hi Uwe,
> Thank you for the info update.I will remove the limit in tika and check.
> So, my understanding is,currently lucene doesnt have any restriction on
> number of terms per field but  when a term is greater then 2^15 bytes it is
> silently ignored at indexing time – a message is logged in to infoStream if
> enabled, but no error is thrown .

Yes. There is only a limit on a single term *after* text analysis. But keep in mind that some Analyzers like StandardAnalyzer have other limits way below that one. On the other hand, if you index your documents as "StingField" or with KeywordAnalyzer, there is no tokenization done at all, in that case the whole field is indexed as a single term - but that’s not useful for searching in full text anyways. So use a suitable analyzer!

> Is that right?

Yes!

Uwe

> Best Regards,
> Sreedevi S
> 
> On Tue, Feb 10, 2015 at 2:45 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Hi,
> >
> > There is no restriction to 10000 characters inside Lucene and there
> > never was one. In earlier Lucene versions (long time ago) there was an
> > implicit restriction to 10,000 TERMS (not characters). This is no longer the
> case.
> > If you still want this, you have to wrap your Analyzer:
> > http://goo.gl/SRf45A
> >
> > If you have a limitation to 10,000 characters somewhere, it might be
> > your TIKA text extraction.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: sreedevi s [mailto:sreedevi.payikkad@gmail.com]
> > > Sent: Tuesday, February 10, 2015 9:53 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Lucene search in attachments
> > >
> > > Thank you David. Yes, it has a restriction of characters to 10000.
> > > But for large files, what could be done in that case?
> > >
> > > Best Regards,
> > > Sreedevi S
> > >
> > > On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <da...@pilato.fr> wrote:
> > >
> > > > If you don’t index content, you won’t be able to search for it I guess.
> > > > That said, Tika can have this extracted characters limit. See
> > > > indexedChars
> > > > below:
> > > >
> > > > tika().parseToString(new BytesStreamInput(content, false),
> > > > metadata, indexedChars);
> > > >
> > > > [1]
> > > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/
> > > > blob
> > > >
> > >
> /master/src/main/java/org/elasticsearch/index/mapper/attachment/Atta
> > > ch
> > > > mentMapper.java#L456
> > > > <
> > > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/
> > > > blob
> > > >
> > >
> /master/src/main/java/org/elasticsearch/index/mapper/attachment/Atta
> > > ch
> > > > mentMapper.java#L456
> > > > >
> > > >
> > > > --
> > > > David Pilato | Technical Advocate | Elasticsearch.com @dadoonet
> > > > <https://twitter.com/dadoonet> | @elasticsearchfr <
> > > > https://twitter.com/elasticsearchfr> | @scrutmydocs <
> > > > https://twitter.com/scrutmydocs>
> > > >
> > > >
> > > >
> > > > > Le 10 févr. 2015 à 09:24, sreedevi s
> > > > > <sr...@gmail.com> a
> > > > écrit :
> > > > >
> > > > > Hi,
> > > > >    Which is the best method to search in attachments in lucene?
> > > > > I am new to lucene and I am using version 4.10.2. By making use
> > > > > of Tika, I know I can convert files to text and then index it as
> > > > > another field. But for
> > > > large
> > > > > files that will not be the ideal solution. I believe the maximum
> > > > characters
> > > > > per field is 10,000. So, what can be ideal method to search
> > > > > attachments
> > > > then
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Sreedevi S
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene search in attachments

Posted by sreedevi s <sr...@gmail.com>.
Hi Uwe,
Thank you for the info update.I will remove the limit in tika and check.
So, my understanding is,currently lucene doesnt have any restriction on
number of terms per field but  when a term is greater then 2^15 bytes it is
silently ignored at indexing time – a message is logged in to infoStream if
enabled, but no error is thrown .
Is that right?



Best Regards,
Sreedevi S

On Tue, Feb 10, 2015 at 2:45 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> There is no restriction to 10000 characters inside Lucene and there never
> was one. In earlier Lucene versions (long time ago) there was an implicit
> restriction to 10,000 TERMS (not characters). This is no longer the case.
> If you still want this, you have to wrap your Analyzer:
> http://goo.gl/SRf45A
>
> If you have a limitation to 10,000 characters somewhere, it might be your
> TIKA text extraction.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: sreedevi s [mailto:sreedevi.payikkad@gmail.com]
> > Sent: Tuesday, February 10, 2015 9:53 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene search in attachments
> >
> > Thank you David. Yes, it has a restriction of characters to 10000.
> > But for large files, what could be done in that case?
> >
> > Best Regards,
> > Sreedevi S
> >
> > On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <da...@pilato.fr> wrote:
> >
> > > If you don’t index content, you won’t be able to search for it I guess.
> > > That said, Tika can have this extracted characters limit. See
> > > indexedChars
> > > below:
> > >
> > > tika().parseToString(new BytesStreamInput(content, false), metadata,
> > > indexedChars);
> > >
> > > [1]
> > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob
> > >
> > /master/src/main/java/org/elasticsearch/index/mapper/attachment/Attach
> > > mentMapper.java#L456
> > > <
> > > https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob
> > >
> > /master/src/main/java/org/elasticsearch/index/mapper/attachment/Attach
> > > mentMapper.java#L456
> > > >
> > >
> > > --
> > > David Pilato | Technical Advocate | Elasticsearch.com @dadoonet
> > > <https://twitter.com/dadoonet> | @elasticsearchfr <
> > > https://twitter.com/elasticsearchfr> | @scrutmydocs <
> > > https://twitter.com/scrutmydocs>
> > >
> > >
> > >
> > > > Le 10 févr. 2015 à 09:24, sreedevi s <sr...@gmail.com> a
> > > écrit :
> > > >
> > > > Hi,
> > > >    Which is the best method to search in attachments in lucene? I am
> > > > new to lucene and I am using version 4.10.2. By making use of Tika,
> > > > I know I can convert files to text and then index it as another
> > > > field. But for
> > > large
> > > > files that will not be the ideal solution. I believe the maximum
> > > characters
> > > > per field is 10,000. So, what can be ideal method to search
> > > > attachments
> > > then
> > > >
> > > >
> > > > Best Regards,
> > > > Sreedevi S
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Lucene search in attachments

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

There is no restriction to 10000 characters inside Lucene and there never was one. In earlier Lucene versions (long time ago) there was an implicit restriction to 10,000 TERMS (not characters). This is no longer the case. If you still want this, you have to wrap your Analyzer: http://goo.gl/SRf45A

If you have a limitation to 10,000 characters somewhere, it might be your TIKA text extraction.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: sreedevi s [mailto:sreedevi.payikkad@gmail.com]
> Sent: Tuesday, February 10, 2015 9:53 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene search in attachments
> 
> Thank you David. Yes, it has a restriction of characters to 10000.
> But for large files, what could be done in that case?
> 
> Best Regards,
> Sreedevi S
> 
> On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <da...@pilato.fr> wrote:
> 
> > If you don’t index content, you won’t be able to search for it I guess.
> > That said, Tika can have this extracted characters limit. See
> > indexedChars
> > below:
> >
> > tika().parseToString(new BytesStreamInput(content, false), metadata,
> > indexedChars);
> >
> > [1]
> > https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob
> >
> /master/src/main/java/org/elasticsearch/index/mapper/attachment/Attach
> > mentMapper.java#L456
> > <
> > https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob
> >
> /master/src/main/java/org/elasticsearch/index/mapper/attachment/Attach
> > mentMapper.java#L456
> > >
> >
> > --
> > David Pilato | Technical Advocate | Elasticsearch.com @dadoonet
> > <https://twitter.com/dadoonet> | @elasticsearchfr <
> > https://twitter.com/elasticsearchfr> | @scrutmydocs <
> > https://twitter.com/scrutmydocs>
> >
> >
> >
> > > Le 10 févr. 2015 à 09:24, sreedevi s <sr...@gmail.com> a
> > écrit :
> > >
> > > Hi,
> > >    Which is the best method to search in attachments in lucene? I am
> > > new to lucene and I am using version 4.10.2. By making use of Tika,
> > > I know I can convert files to text and then index it as another
> > > field. But for
> > large
> > > files that will not be the ideal solution. I believe the maximum
> > characters
> > > per field is 10,000. So, what can be ideal method to search
> > > attachments
> > then
> > >
> > >
> > > Best Regards,
> > > Sreedevi S
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene search in attachments

Posted by sreedevi s <sr...@gmail.com>.
No David. By increasing the value or I can set to -1 to make it unlimited
but still I cannot assure that my whole text can be searchable, which is
still a problem with large files because only the part which is indexed
will be searchable.
Was looking for some alternatives.

Best Regards,
Sreedevi S

On Tue, Feb 10, 2015 at 2:26 PM, David Pilato <da...@pilato.fr> wrote:

> I don’t understand.
> If you don’t raise this restriction to a higher value (or to -1), all the
> text won’t be extracted so only a subset of the text will be indexed.
> Non indexed parts of the text won’t be searchable.
>
> Did I misunderstand your question?
>
> --
> David Pilato | Technical Advocate | Elasticsearch.com
> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <
> https://twitter.com/elasticsearchfr> | @scrutmydocs <
> https://twitter.com/scrutmydocs>
>
>
>
> > Le 10 févr. 2015 à 09:52, sreedevi s <sr...@gmail.com> a
> écrit :
> >
> > Thank you David. Yes, it has a restriction of characters to 10000.
> > But for large files, what could be done in that case?
> >
> > Best Regards,
> > Sreedevi S
> >
> > On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <da...@pilato.fr> wrote:
> >
> >> If you don’t index content, you won’t be able to search for it I guess.
> >> That said, Tika can have this extracted characters limit. See
> indexedChars
> >> below:
> >>
> >> tika().parseToString(new BytesStreamInput(content, false), metadata,
> >> indexedChars);
> >>
> >> [1]
> >>
> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
> >> <
> >>
> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
> >>>
> >>
> >> --
> >> David Pilato | Technical Advocate | Elasticsearch.com
> >> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <
> >> https://twitter.com/elasticsearchfr> | @scrutmydocs <
> >> https://twitter.com/scrutmydocs>
> >>
> >>
> >>
> >>> Le 10 févr. 2015 à 09:24, sreedevi s <sr...@gmail.com> a
> >> écrit :
> >>>
> >>> Hi,
> >>>   Which is the best method to search in attachments in lucene? I am new
> >>> to lucene and I am using version 4.10.2. By making use of Tika, I know
> I
> >>> can convert files to text and then index it as another field. But for
> >> large
> >>> files that will not be the ideal solution. I believe the maximum
> >> characters
> >>> per field is 10,000. So, what can be ideal method to search attachments
> >> then
> >>>
> >>>
> >>> Best Regards,
> >>> Sreedevi S
> >>
> >>
>
>

Re: Lucene search in attachments

Posted by David Pilato <da...@pilato.fr>.
I don’t understand.
If you don’t raise this restriction to a higher value (or to -1), all the text won’t be extracted so only a subset of the text will be indexed.
Non indexed parts of the text won’t be searchable.

Did I misunderstand your question?

-- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <https://twitter.com/elasticsearchfr> | @scrutmydocs <https://twitter.com/scrutmydocs>



> Le 10 févr. 2015 à 09:52, sreedevi s <sr...@gmail.com> a écrit :
> 
> Thank you David. Yes, it has a restriction of characters to 10000.
> But for large files, what could be done in that case?
> 
> Best Regards,
> Sreedevi S
> 
> On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <da...@pilato.fr> wrote:
> 
>> If you don’t index content, you won’t be able to search for it I guess.
>> That said, Tika can have this extracted characters limit. See indexedChars
>> below:
>> 
>> tika().parseToString(new BytesStreamInput(content, false), metadata,
>> indexedChars);
>> 
>> [1]
>> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
>> <
>> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
>>> 
>> 
>> --
>> David Pilato | Technical Advocate | Elasticsearch.com
>> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <
>> https://twitter.com/elasticsearchfr> | @scrutmydocs <
>> https://twitter.com/scrutmydocs>
>> 
>> 
>> 
>>> Le 10 févr. 2015 à 09:24, sreedevi s <sr...@gmail.com> a
>> écrit :
>>> 
>>> Hi,
>>>   Which is the best method to search in attachments in lucene? I am new
>>> to lucene and I am using version 4.10.2. By making use of Tika, I know I
>>> can convert files to text and then index it as another field. But for
>> large
>>> files that will not be the ideal solution. I believe the maximum
>> characters
>>> per field is 10,000. So, what can be ideal method to search attachments
>> then
>>> 
>>> 
>>> Best Regards,
>>> Sreedevi S
>> 
>> 


Re: Lucene search in attachments

Posted by sreedevi s <sr...@gmail.com>.
Thank you David. Yes, it has a restriction of characters to 10000.
But for large files, what could be done in that case?

Best Regards,
Sreedevi S

On Tue, Feb 10, 2015 at 2:04 PM, David Pilato <da...@pilato.fr> wrote:

> If you don’t index content, you won’t be able to search for it I guess.
> That said, Tika can have this extracted characters limit. See indexedChars
> below:
>
> tika().parseToString(new BytesStreamInput(content, false), metadata,
> indexedChars);
>
> [1]
> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
> <
> https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456
> >
>
> --
> David Pilato | Technical Advocate | Elasticsearch.com
> @dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <
> https://twitter.com/elasticsearchfr> | @scrutmydocs <
> https://twitter.com/scrutmydocs>
>
>
>
> > Le 10 févr. 2015 à 09:24, sreedevi s <sr...@gmail.com> a
> écrit :
> >
> > Hi,
> >    Which is the best method to search in attachments in lucene? I am new
> > to lucene and I am using version 4.10.2. By making use of Tika, I know I
> > can convert files to text and then index it as another field. But for
> large
> > files that will not be the ideal solution. I believe the maximum
> characters
> > per field is 10,000. So, what can be ideal method to search attachments
> then
> >
> >
> > Best Regards,
> > Sreedevi S
>
>

Re: Lucene search in attachments

Posted by David Pilato <da...@pilato.fr>.
If you don’t index content, you won’t be able to search for it I guess.
That said, Tika can have this extracted characters limit. See indexedChars below:

tika().parseToString(new BytesStreamInput(content, false), metadata, indexedChars);

[1] https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456 <https://github.com/elasticsearch/elasticsearch-mapper-attachments/blob/master/src/main/java/org/elasticsearch/index/mapper/attachment/AttachmentMapper.java#L456>

-- 
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet <https://twitter.com/dadoonet> | @elasticsearchfr <https://twitter.com/elasticsearchfr> | @scrutmydocs <https://twitter.com/scrutmydocs>



> Le 10 févr. 2015 à 09:24, sreedevi s <sr...@gmail.com> a écrit :
> 
> Hi,
>    Which is the best method to search in attachments in lucene? I am new
> to lucene and I am using version 4.10.2. By making use of Tika, I know I
> can convert files to text and then index it as another field. But for large
> files that will not be the ideal solution. I believe the maximum characters
> per field is 10,000. So, what can be ideal method to search attachments then
> 
> 
> Best Regards,
> Sreedevi S