You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by nikhil desai <ni...@gmail.com> on 2013/06/10 19:24:22 UTC

Lucene Indexes explanantion

Hello,

My first time post in this group.

I have been using Lucene recently. I have a question.

Where can I find a good explanation on Indexes. Or rather how indexing (Not
really the mathematical aspect) happens in Lucene, what all
attributes(charTerm, Offset etc) come into play? And the way it is
implemented? I checked the "Lucene In Action" and could not find much on
actual indexing, what all classes etc are being used.

Appreciate your help.

Thanks
NIKHIL

Re: Lucene Indexes explanantion

Posted by Jack Krupansky <ja...@basetechnology.com>.
Sorry, but you really are going to need to work on your "Lucene Basics" 
before you tackle such an ambitious effort.

The Lucene JavaDoc, Solr Wikis, Stack Underflow, blogs of McCandless, et al, 
and Google search in general will cover a lot of the ground, including those 
mysterious terms:

- Indexed terms
- Stored values
- Payloads
- DocValues

-- Jack Krupansky

-----Original Message----- 
From: nikhil desai
Sent: Monday, June 10, 2013 8:36 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Indexes explanantion

I don't think I could get much from what you said, could you please
elaborate? Appreciate.

On Mon, Jun 10, 2013 at 5:20 PM, Jack Krupansky 
<ja...@basetechnology.com>wrote:

> Your stored value could be very different from your indexed (searchable)
> value. You can also associate payloads with an indexed term. And there are
> DocValues as well.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: nikhil desai
> Sent: Monday, June 10, 2013 8:06 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene Indexes explanantion
>
>
> Sure. Thanks Jack.
> I don't have much experience working with Lucene, however, here is what I
> am trying to resolve.
>
> I learned that the Custom attributes cannot be used for indexing or
> searching purposes. However I wanted the attributes to be used for 
> indexing
> and searching. So I created custom attributes and inserted them as tokens
> into the tokenstream by assigning positionIncrement attribute to 0. Now
> since my new token stream has attributes(as tokens) and they are used 
> while
> indexing, I can now search the document based on the attributes(tokens I
> newly inserted). However I still have an issue. And by the way I have a 
> lot
> of attributes that I need to assign to an individual token.
>
> Ex: Sentence: "LinkedIn is famous"
> After passing through custom analyzer and few filters that I have written
> and appending Attributes to the tokens, the new Tokenstream we get is
> "LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that
> LinkedIn is Noun and is also an Socialsite, famous is an adjective and 
> also
> a Positive word, 'is' is removed as it does not make sense to index 'is')
>
> This is now definitely searchable based on Attributes(Here: Noun,
> SocialSite, JJ, Positive).
>
> However, since I have put this entire text "LinkedIn is famous" as a Field
> while adding a Document, when I search for say "SocialSite", I get a
> Document as an output which has "LinkedIn is famous" as one of the fields.
>
> However, is it possible to get only "LinkedIn" as output rather than an
> entire text? i.e Only the actual token(the token present in the original
> input) as output?
> Another example: if I search for say "Positive" I should get "famous" as
> output and not the entire "LinkedIn is famous".
>
> I know that if I put it as a Field in the document, I should be able to 
> get
> it, but how do I add such a Field? because, only when the Tokens are 
> passed
> through the filters we get to know what all Attributes would be attached 
> to
> it, so while we do indexwriter.addDocument() we have no idea about the
> Attributes.
>
> The typical problem that I see is the indexing is done based on the new
> tokenstream which is good, but when it retrieves the Document, it has the
> older actual Tokenstream(or actual input) and that is what is given as
> output.
>
> Does that make any sense? Or I have a typical use case that does not go
> well with Lucene?
>
> Any help comments are appreciated.
>
> On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky <ja...@basetechnology.com>*
> *wrote:
>
>  Even though you've posted for Lucene, you might want to consider taking a
>> look at Solr because Solr has an Admin UI with an Analysis page which
>> gives
>> you a nice display of how index and query text is analyzed into tokens,
>> terms, and attributes - all of which Solr inherits from Lucene.
>>
>> And check out the unit tests for Lucene (and Solr) for indexing. Then you
>> can actually step through code and see it happen.
>>
>> Otherwise, google for blogs on various sub-topics of interest with
>> specific terms.
>>
>> OTOH... don't try diving too deeply until you've written and understood a
>> fair amount of Java code using Lucene. Otherwise, you won't have enough
>> context to understand or even ask intelligent questions.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: nikhil desai
>> Sent: Monday, June 10, 2013 1:24 PM
>> To: java-user@lucene.apache.org
>> Subject: Lucene Indexes explanantion
>>
>>
>> Hello,
>>
>> My first time post in this group.
>>
>> I have been using Lucene recently. I have a question.
>>
>> Where can I find a good explanation on Indexes. Or rather how indexing
>> (Not
>> really the mathematical aspect) happens in Lucene, what all
>> attributes(charTerm, Offset etc) come into play? And the way it is
>> implemented? I checked the "Lucene In Action" and could not find much on
>> actual indexing, what all classes etc are being used.
>>
>> Appreciate your help.
>>
>> Thanks
>> NIKHIL
>>
>> ------------------------------****----------------------------**
>> --**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.****apache.org<
>> java-user-**unsubscribe@lucene.apache.org<ja...@lucene.apache.org>
>> >
>> For additional commands, e-mail: java-user-help@lucene.apache.****org<
>> java-user-help@lucene.**apache.org <ja...@lucene.apache.org>>
>>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Indexes explanantion

Posted by nikhil desai <ni...@gmail.com>.
I don't think I could get much from what you said, could you please
elaborate? Appreciate.

On Mon, Jun 10, 2013 at 5:20 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> Your stored value could be very different from your indexed (searchable)
> value. You can also associate payloads with an indexed term. And there are
> DocValues as well.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: nikhil desai
> Sent: Monday, June 10, 2013 8:06 PM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene Indexes explanantion
>
>
> Sure. Thanks Jack.
> I don't have much experience working with Lucene, however, here is what I
> am trying to resolve.
>
> I learned that the Custom attributes cannot be used for indexing or
> searching purposes. However I wanted the attributes to be used for indexing
> and searching. So I created custom attributes and inserted them as tokens
> into the tokenstream by assigning positionIncrement attribute to 0. Now
> since my new token stream has attributes(as tokens) and they are used while
> indexing, I can now search the document based on the attributes(tokens I
> newly inserted). However I still have an issue. And by the way I have a lot
> of attributes that I need to assign to an individual token.
>
> Ex: Sentence: "LinkedIn is famous"
> After passing through custom analyzer and few filters that I have written
> and appending Attributes to the tokens, the new Tokenstream we get is
> "LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that
> LinkedIn is Noun and is also an Socialsite, famous is an adjective and also
> a Positive word, 'is' is removed as it does not make sense to index 'is')
>
> This is now definitely searchable based on Attributes(Here: Noun,
> SocialSite, JJ, Positive).
>
> However, since I have put this entire text "LinkedIn is famous" as a Field
> while adding a Document, when I search for say "SocialSite", I get a
> Document as an output which has "LinkedIn is famous" as one of the fields.
>
> However, is it possible to get only "LinkedIn" as output rather than an
> entire text? i.e Only the actual token(the token present in the original
> input) as output?
> Another example: if I search for say "Positive" I should get "famous" as
> output and not the entire "LinkedIn is famous".
>
> I know that if I put it as a Field in the document, I should be able to get
> it, but how do I add such a Field? because, only when the Tokens are passed
> through the filters we get to know what all Attributes would be attached to
> it, so while we do indexwriter.addDocument() we have no idea about the
> Attributes.
>
> The typical problem that I see is the indexing is done based on the new
> tokenstream which is good, but when it retrieves the Document, it has the
> older actual Tokenstream(or actual input) and that is what is given as
> output.
>
> Does that make any sense? Or I have a typical use case that does not go
> well with Lucene?
>
> Any help comments are appreciated.
>
> On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky <ja...@basetechnology.com>*
> *wrote:
>
>  Even though you've posted for Lucene, you might want to consider taking a
>> look at Solr because Solr has an Admin UI with an Analysis page which
>> gives
>> you a nice display of how index and query text is analyzed into tokens,
>> terms, and attributes - all of which Solr inherits from Lucene.
>>
>> And check out the unit tests for Lucene (and Solr) for indexing. Then you
>> can actually step through code and see it happen.
>>
>> Otherwise, google for blogs on various sub-topics of interest with
>> specific terms.
>>
>> OTOH... don't try diving too deeply until you've written and understood a
>> fair amount of Java code using Lucene. Otherwise, you won't have enough
>> context to understand or even ask intelligent questions.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: nikhil desai
>> Sent: Monday, June 10, 2013 1:24 PM
>> To: java-user@lucene.apache.org
>> Subject: Lucene Indexes explanantion
>>
>>
>> Hello,
>>
>> My first time post in this group.
>>
>> I have been using Lucene recently. I have a question.
>>
>> Where can I find a good explanation on Indexes. Or rather how indexing
>> (Not
>> really the mathematical aspect) happens in Lucene, what all
>> attributes(charTerm, Offset etc) come into play? And the way it is
>> implemented? I checked the "Lucene In Action" and could not find much on
>> actual indexing, what all classes etc are being used.
>>
>> Appreciate your help.
>>
>> Thanks
>> NIKHIL
>>
>> ------------------------------****----------------------------**
>> --**---------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.****apache.org<
>> java-user-**unsubscribe@lucene.apache.org<ja...@lucene.apache.org>
>> >
>> For additional commands, e-mail: java-user-help@lucene.apache.****org<
>> java-user-help@lucene.**apache.org <ja...@lucene.apache.org>>
>>
>>
>>
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>

Re: Lucene Indexes explanantion

Posted by Jack Krupansky <ja...@basetechnology.com>.
Your stored value could be very different from your indexed (searchable) 
value. You can also associate payloads with an indexed term. And there are 
DocValues as well.

-- Jack Krupansky

-----Original Message----- 
From: nikhil desai
Sent: Monday, June 10, 2013 8:06 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene Indexes explanantion

Sure. Thanks Jack.
I don't have much experience working with Lucene, however, here is what I
am trying to resolve.

I learned that the Custom attributes cannot be used for indexing or
searching purposes. However I wanted the attributes to be used for indexing
and searching. So I created custom attributes and inserted them as tokens
into the tokenstream by assigning positionIncrement attribute to 0. Now
since my new token stream has attributes(as tokens) and they are used while
indexing, I can now search the document based on the attributes(tokens I
newly inserted). However I still have an issue. And by the way I have a lot
of attributes that I need to assign to an individual token.

Ex: Sentence: "LinkedIn is famous"
After passing through custom analyzer and few filters that I have written
and appending Attributes to the tokens, the new Tokenstream we get is
"LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that
LinkedIn is Noun and is also an Socialsite, famous is an adjective and also
a Positive word, 'is' is removed as it does not make sense to index 'is')

This is now definitely searchable based on Attributes(Here: Noun,
SocialSite, JJ, Positive).

However, since I have put this entire text "LinkedIn is famous" as a Field
while adding a Document, when I search for say "SocialSite", I get a
Document as an output which has "LinkedIn is famous" as one of the fields.

However, is it possible to get only "LinkedIn" as output rather than an
entire text? i.e Only the actual token(the token present in the original
input) as output?
Another example: if I search for say "Positive" I should get "famous" as
output and not the entire "LinkedIn is famous".

I know that if I put it as a Field in the document, I should be able to get
it, but how do I add such a Field? because, only when the Tokens are passed
through the filters we get to know what all Attributes would be attached to
it, so while we do indexwriter.addDocument() we have no idea about the
Attributes.

The typical problem that I see is the indexing is done based on the new
tokenstream which is good, but when it retrieves the Document, it has the
older actual Tokenstream(or actual input) and that is what is given as
output.

Does that make any sense? Or I have a typical use case that does not go
well with Lucene?

Any help comments are appreciated.

On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky 
<ja...@basetechnology.com>wrote:

> Even though you've posted for Lucene, you might want to consider taking a
> look at Solr because Solr has an Admin UI with an Analysis page which 
> gives
> you a nice display of how index and query text is analyzed into tokens,
> terms, and attributes - all of which Solr inherits from Lucene.
>
> And check out the unit tests for Lucene (and Solr) for indexing. Then you
> can actually step through code and see it happen.
>
> Otherwise, google for blogs on various sub-topics of interest with
> specific terms.
>
> OTOH... don't try diving too deeply until you've written and understood a
> fair amount of Java code using Lucene. Otherwise, you won't have enough
> context to understand or even ask intelligent questions.
>
> -- Jack Krupansky
>
> -----Original Message----- From: nikhil desai
> Sent: Monday, June 10, 2013 1:24 PM
> To: java-user@lucene.apache.org
> Subject: Lucene Indexes explanantion
>
>
> Hello,
>
> My first time post in this group.
>
> I have been using Lucene recently. I have a question.
>
> Where can I find a good explanation on Indexes. Or rather how indexing 
> (Not
> really the mathematical aspect) happens in Lucene, what all
> attributes(charTerm, Offset etc) come into play? And the way it is
> implemented? I checked the "Lucene In Action" and could not find much on
> actual indexing, what all classes etc are being used.
>
> Appreciate your help.
>
> Thanks
> NIKHIL
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: 
> java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: 
> java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Indexes explanantion

Posted by nikhil desai <ni...@gmail.com>.
Sure. Thanks Jack.
I don't have much experience working with Lucene, however, here is what I
am trying to resolve.

I learned that the Custom attributes cannot be used for indexing or
searching purposes. However I wanted the attributes to be used for indexing
and searching. So I created custom attributes and inserted them as tokens
into the tokenstream by assigning positionIncrement attribute to 0. Now
since my new token stream has attributes(as tokens) and they are used while
indexing, I can now search the document based on the attributes(tokens I
newly inserted). However I still have an issue. And by the way I have a lot
of attributes that I need to assign to an individual token.

Ex: Sentence: "LinkedIn is famous"
After passing through custom analyzer and few filters that I have written
and appending Attributes to the tokens, the new Tokenstream we get is
"LinkedIn Noun SocialSite famous JJ Positive" - (what that means is that
LinkedIn is Noun and is also an Socialsite, famous is an adjective and also
a Positive word, 'is' is removed as it does not make sense to index 'is')

This is now definitely searchable based on Attributes(Here: Noun,
SocialSite, JJ, Positive).

However, since I have put this entire text "LinkedIn is famous" as a Field
while adding a Document, when I search for say "SocialSite", I get a
Document as an output which has "LinkedIn is famous" as one of the fields.

However, is it possible to get only "LinkedIn" as output rather than an
entire text? i.e Only the actual token(the token present in the original
input) as output?
Another example: if I search for say "Positive" I should get "famous" as
output and not the entire "LinkedIn is famous".

I know that if I put it as a Field in the document, I should be able to get
it, but how do I add such a Field? because, only when the Tokens are passed
through the filters we get to know what all Attributes would be attached to
it, so while we do indexwriter.addDocument() we have no idea about the
Attributes.

The typical problem that I see is the indexing is done based on the new
tokenstream which is good, but when it retrieves the Document, it has the
older actual Tokenstream(or actual input) and that is what is given as
output.

Does that make any sense? Or I have a typical use case that does not go
well with Lucene?

Any help comments are appreciated.

On Mon, Jun 10, 2013 at 1:32 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> Even though you've posted for Lucene, you might want to consider taking a
> look at Solr because Solr has an Admin UI with an Analysis page which gives
> you a nice display of how index and query text is analyzed into tokens,
> terms, and attributes - all of which Solr inherits from Lucene.
>
> And check out the unit tests for Lucene (and Solr) for indexing. Then you
> can actually step through code and see it happen.
>
> Otherwise, google for blogs on various sub-topics of interest with
> specific terms.
>
> OTOH... don't try diving too deeply until you've written and understood a
> fair amount of Java code using Lucene. Otherwise, you won't have enough
> context to understand or even ask intelligent questions.
>
> -- Jack Krupansky
>
> -----Original Message----- From: nikhil desai
> Sent: Monday, June 10, 2013 1:24 PM
> To: java-user@lucene.apache.org
> Subject: Lucene Indexes explanantion
>
>
> Hello,
>
> My first time post in this group.
>
> I have been using Lucene recently. I have a question.
>
> Where can I find a good explanation on Indexes. Or rather how indexing (Not
> really the mathematical aspect) happens in Lucene, what all
> attributes(charTerm, Offset etc) come into play? And the way it is
> implemented? I checked the "Lucene In Action" and could not find much on
> actual indexing, what all classes etc are being used.
>
> Appreciate your help.
>
> Thanks
> NIKHIL
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>

Re: Lucene Indexes explanantion

Posted by Jack Krupansky <ja...@basetechnology.com>.
Even though you've posted for Lucene, you might want to consider taking a 
look at Solr because Solr has an Admin UI with an Analysis page which gives 
you a nice display of how index and query text is analyzed into tokens, 
terms, and attributes - all of which Solr inherits from Lucene.

And check out the unit tests for Lucene (and Solr) for indexing. Then you 
can actually step through code and see it happen.

Otherwise, google for blogs on various sub-topics of interest with specific 
terms.

OTOH... don't try diving too deeply until you've written and understood a 
fair amount of Java code using Lucene. Otherwise, you won't have enough 
context to understand or even ask intelligent questions.

-- Jack Krupansky

-----Original Message----- 
From: nikhil desai
Sent: Monday, June 10, 2013 1:24 PM
To: java-user@lucene.apache.org
Subject: Lucene Indexes explanantion

Hello,

My first time post in this group.

I have been using Lucene recently. I have a question.

Where can I find a good explanation on Indexes. Or rather how indexing (Not
really the mathematical aspect) happens in Lucene, what all
attributes(charTerm, Offset etc) come into play? And the way it is
implemented? I checked the "Lucene In Action" and could not find much on
actual indexing, what all classes etc are being used.

Appreciate your help.

Thanks
NIKHIL 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org