You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jordon Saardchit <js...@go2.com> on 2010/12/23 21:23:25 UTC

Get Analyzed/Tokenized Field List

Is there an easy way to retrieve a collection of fields (or field names) that are analyzed/tokenized from any given index?

Jordon
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Get Analyzed/Tokenized Field List

Posted by Erick Erickson <er...@gmail.com>.
Well, not to my knowledge. In fact there's no guarantee that the #same#
index
has the #same# analyzer used on the #same# field in different documents, so
I don't
see how there could be a robust implementation of what you want.

You could populate a field with a particular analyzer (or none at all),
close your writer and open another with any other random analyzer (or
none at all) for the same field and Lucene wouldn't complain.

Solr handles this with the schema file. I guess you could abstract the
field definitions into a library and use the library in both apps, but
otherwise
the apps have to "just know".

Best
Erick

On Fri, Dec 24, 2010 at 1:16 PM, Jordon Saardchit <js...@go2.com>wrote:

> Heh, yes, all stuff I know.  My question was if an index contained any meta
> data which revealed whether or not a certain indexed field had been analyzed
> or not, which I think you are saying it does not.
>
> Our searching and indexing is isolated into 2 completely seperate packages
> which can be deployed independantly of each other.  The only common
> dependency (obviously) is the index itself.  That being said, I was trying
> to determine from the search runtime if the given fieldname/input pair
> should be analyzed or not when building the query without having any
> knowledge of how the index was created.
>
> Jordon
>
> On Dec 23, 2010, at 5:59 PM, Erick Erickson wrote:
>
> > I guess I'm missing the point. The fact that it is stored is irrelevant
> for
> > searching. Stored
> > fields really only govern whether Document.getField("fieldname") returns
> > anything #after#
> > the search. You can find out if a field is stored-only by asking
> > IndexReader.getFields
> > for UNINDEXED, and you can search on anything that is INDEXED.
> >
> > So if, say, you're creating a drop-down with a selection of fields to
> choose
> > from, you
> > should be able to get the list by looking for INDEXED.
> >
> > But somewhere you've got to insure that the analyzers used at index time
> are
> > identical
> > or compatible with those used at query time. If all you're concerned is
> > building up a string
> > like "+text:stuff +title:nonsense" and handing that off to the app that
> > knows how the index
> > was built (so it can use the right analyzers for the text and title
> fields
> > when parsing the input)
> > looking for INDEXED should be fine.
> >
> > If you're #only# using  your custom analyzer for searchable fields, it's
> > fine because any INDEXED
> > field can use the your custom analyzer.
> >
> > But if you use different analyzers for different searchable fields,
> there's
> > no way I know of to
> > analyze an index and answer the question "what analyzer was this field
> > created with",
> > that knowledge is built a-priori into the app as far as I know.
> >
> >
> > Best
> > Erick
> >
> >
> > On Thu, Dec 23, 2010 at 6:32 PM, Jordon Saardchit <jsaardchit@go2.com
> >wrote:
> >
> >> The basic use case is determiniation of rules in regards to building a
> >> query.  I've got an application that programmatically builds queries
> >> (without any pre existing knowledge of the contents of the index it is
> >> searching).  We have a custom designed analyzer and filter chain.
>  However,
> >> it is applied to certain fields at index time.  The fields it is applied
> to
> >> are unstored.
> >>
> >> On the search side, I want to be able to determine at runtime which
> field
> >> the analyzer should be applied to, and which field not to.  I could be
> >> approaching the solution incorrectly, but I figured this would be a
> pretty
> >> common or natural use case.
> >>
> >> Jordon
> >>
> >> On Dec 23, 2010, at 2:51 PM, Erick Erickson wrote:
> >>
> >>> Ah, you didn't mention indexed but unstored in your original message,
> >>> just indexed/analyzed....
> >>>
> >>> I don't think you can (someone jump in here if I'm wrong, please). The
> >>> problem
> >>> is that Lucene doesn't require any sort of schema. So if you are
> >> perfectly
> >>> free to
> >>> store a field in one document and NOT store it in another. All the
> >> variants
> >>> specified in IndexReader.fieldOption can quickly be determined by just
> >>> looking at the
> >>> various index files. But you'd have to spin through all the #documents#
> >> in
> >>> order
> >>> to answer the question "is this field ever stored?". Sounds like a
> table
> >>> scan in the
> >>> DB world.
> >>>
> >>> I don't think Lucene keeps meta-data for this, and spinning through all
> >> the
> >>> documents
> >>> would be expensive...
> >>>
> >>> Why do you want to know? Perhaps there's another way to satisfy the
> >>> use-case.
> >>>
> >>> I could be way off base here, I'm speaking from general principles not
> >>> knowledge of
> >>> the code...
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Thu, Dec 23, 2010 at 4:43 PM, Jordon Saardchit <jsaardchit@go2.com
> >>> wrote:
> >>>
> >>>> Yes I have, and after testing each of the various options denoted in
> >>>> IndexReader.FieldOption, I cannot retrieve fieldnames that are indexed
> >>>> (analyzed), and unstored.  I figured this would be relatively easy to
> do
> >> and
> >>>> I was simply overlooking something.  Is it perhaps not possible to do
> >> this?
> >>>>
> >>>> Jordon
> >>>>
> >>>> On Dec 23, 2010, at 1:30 PM, Erick Erickson wrote:
> >>>>
> >>>>> Have you looked at IndexReader.getFieldNames()?
> >>>>>
> >>>>> Best
> >>>>> Erick
> >>>>>
> >>>>> On Thu, Dec 23, 2010 at 3:23 PM, Jordon Saardchit <
> jsaardchit@go2.com
> >>>>> wrote:
> >>>>>
> >>>>>> Is there an easy way to retrieve a collection of fields (or field
> >> names)
> >>>>>> that are analyzed/tokenized from any given index?
> >>>>>>
> >>>>>> Jordon
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Get Analyzed/Tokenized Field List

Posted by Jordon Saardchit <js...@go2.com>.
Heh, yes, all stuff I know.  My question was if an index contained any meta data which revealed whether or not a certain indexed field had been analyzed or not, which I think you are saying it does not.

Our searching and indexing is isolated into 2 completely seperate packages which can be deployed independantly of each other.  The only common dependency (obviously) is the index itself.  That being said, I was trying to determine from the search runtime if the given fieldname/input pair should be analyzed or not when building the query without having any knowledge of how the index was created.

Jordon

On Dec 23, 2010, at 5:59 PM, Erick Erickson wrote:

> I guess I'm missing the point. The fact that it is stored is irrelevant for
> searching. Stored
> fields really only govern whether Document.getField("fieldname") returns
> anything #after#
> the search. You can find out if a field is stored-only by asking
> IndexReader.getFields
> for UNINDEXED, and you can search on anything that is INDEXED.
> 
> So if, say, you're creating a drop-down with a selection of fields to choose
> from, you
> should be able to get the list by looking for INDEXED.
> 
> But somewhere you've got to insure that the analyzers used at index time are
> identical
> or compatible with those used at query time. If all you're concerned is
> building up a string
> like "+text:stuff +title:nonsense" and handing that off to the app that
> knows how the index
> was built (so it can use the right analyzers for the text and title fields
> when parsing the input)
> looking for INDEXED should be fine.
> 
> If you're #only# using  your custom analyzer for searchable fields, it's
> fine because any INDEXED
> field can use the your custom analyzer.
> 
> But if you use different analyzers for different searchable fields, there's
> no way I know of to
> analyze an index and answer the question "what analyzer was this field
> created with",
> that knowledge is built a-priori into the app as far as I know.
> 
> 
> Best
> Erick
> 
> 
> On Thu, Dec 23, 2010 at 6:32 PM, Jordon Saardchit <js...@go2.com>wrote:
> 
>> The basic use case is determiniation of rules in regards to building a
>> query.  I've got an application that programmatically builds queries
>> (without any pre existing knowledge of the contents of the index it is
>> searching).  We have a custom designed analyzer and filter chain.  However,
>> it is applied to certain fields at index time.  The fields it is applied to
>> are unstored.
>> 
>> On the search side, I want to be able to determine at runtime which field
>> the analyzer should be applied to, and which field not to.  I could be
>> approaching the solution incorrectly, but I figured this would be a pretty
>> common or natural use case.
>> 
>> Jordon
>> 
>> On Dec 23, 2010, at 2:51 PM, Erick Erickson wrote:
>> 
>>> Ah, you didn't mention indexed but unstored in your original message,
>>> just indexed/analyzed....
>>> 
>>> I don't think you can (someone jump in here if I'm wrong, please). The
>>> problem
>>> is that Lucene doesn't require any sort of schema. So if you are
>> perfectly
>>> free to
>>> store a field in one document and NOT store it in another. All the
>> variants
>>> specified in IndexReader.fieldOption can quickly be determined by just
>>> looking at the
>>> various index files. But you'd have to spin through all the #documents#
>> in
>>> order
>>> to answer the question "is this field ever stored?". Sounds like a table
>>> scan in the
>>> DB world.
>>> 
>>> I don't think Lucene keeps meta-data for this, and spinning through all
>> the
>>> documents
>>> would be expensive...
>>> 
>>> Why do you want to know? Perhaps there's another way to satisfy the
>>> use-case.
>>> 
>>> I could be way off base here, I'm speaking from general principles not
>>> knowledge of
>>> the code...
>>> 
>>> Best
>>> Erick
>>> 
>>> On Thu, Dec 23, 2010 at 4:43 PM, Jordon Saardchit <jsaardchit@go2.com
>>> wrote:
>>> 
>>>> Yes I have, and after testing each of the various options denoted in
>>>> IndexReader.FieldOption, I cannot retrieve fieldnames that are indexed
>>>> (analyzed), and unstored.  I figured this would be relatively easy to do
>> and
>>>> I was simply overlooking something.  Is it perhaps not possible to do
>> this?
>>>> 
>>>> Jordon
>>>> 
>>>> On Dec 23, 2010, at 1:30 PM, Erick Erickson wrote:
>>>> 
>>>>> Have you looked at IndexReader.getFieldNames()?
>>>>> 
>>>>> Best
>>>>> Erick
>>>>> 
>>>>> On Thu, Dec 23, 2010 at 3:23 PM, Jordon Saardchit <jsaardchit@go2.com
>>>>> wrote:
>>>>> 
>>>>>> Is there an easy way to retrieve a collection of fields (or field
>> names)
>>>>>> that are analyzed/tokenized from any given index?
>>>>>> 
>>>>>> Jordon
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Get Analyzed/Tokenized Field List

Posted by Erick Erickson <er...@gmail.com>.
I guess I'm missing the point. The fact that it is stored is irrelevant for
searching. Stored
fields really only govern whether Document.getField("fieldname") returns
anything #after#
the search. You can find out if a field is stored-only by asking
IndexReader.getFields
for UNINDEXED, and you can search on anything that is INDEXED.

So if, say, you're creating a drop-down with a selection of fields to choose
from, you
should be able to get the list by looking for INDEXED.

But somewhere you've got to insure that the analyzers used at index time are
identical
or compatible with those used at query time. If all you're concerned is
building up a string
like "+text:stuff +title:nonsense" and handing that off to the app that
knows how the index
was built (so it can use the right analyzers for the text and title fields
when parsing the input)
looking for INDEXED should be fine.

If you're #only# using  your custom analyzer for searchable fields, it's
fine because any INDEXED
field can use the your custom analyzer.

But if you use different analyzers for different searchable fields, there's
no way I know of to
analyze an index and answer the question "what analyzer was this field
created with",
that knowledge is built a-priori into the app as far as I know.


Best
Erick


On Thu, Dec 23, 2010 at 6:32 PM, Jordon Saardchit <js...@go2.com>wrote:

> The basic use case is determiniation of rules in regards to building a
> query.  I've got an application that programmatically builds queries
> (without any pre existing knowledge of the contents of the index it is
> searching).  We have a custom designed analyzer and filter chain.  However,
> it is applied to certain fields at index time.  The fields it is applied to
> are unstored.
>
> On the search side, I want to be able to determine at runtime which field
> the analyzer should be applied to, and which field not to.  I could be
> approaching the solution incorrectly, but I figured this would be a pretty
> common or natural use case.
>
> Jordon
>
> On Dec 23, 2010, at 2:51 PM, Erick Erickson wrote:
>
> > Ah, you didn't mention indexed but unstored in your original message,
> > just indexed/analyzed....
> >
> > I don't think you can (someone jump in here if I'm wrong, please). The
> > problem
> > is that Lucene doesn't require any sort of schema. So if you are
> perfectly
> > free to
> > store a field in one document and NOT store it in another. All the
> variants
> > specified in IndexReader.fieldOption can quickly be determined by just
> > looking at the
> > various index files. But you'd have to spin through all the #documents#
> in
> > order
> > to answer the question "is this field ever stored?". Sounds like a table
> > scan in the
> > DB world.
> >
> > I don't think Lucene keeps meta-data for this, and spinning through all
> the
> > documents
> > would be expensive...
> >
> > Why do you want to know? Perhaps there's another way to satisfy the
> > use-case.
> >
> > I could be way off base here, I'm speaking from general principles not
> > knowledge of
> > the code...
> >
> > Best
> > Erick
> >
> > On Thu, Dec 23, 2010 at 4:43 PM, Jordon Saardchit <jsaardchit@go2.com
> >wrote:
> >
> >> Yes I have, and after testing each of the various options denoted in
> >> IndexReader.FieldOption, I cannot retrieve fieldnames that are indexed
> >> (analyzed), and unstored.  I figured this would be relatively easy to do
> and
> >> I was simply overlooking something.  Is it perhaps not possible to do
> this?
> >>
> >> Jordon
> >>
> >> On Dec 23, 2010, at 1:30 PM, Erick Erickson wrote:
> >>
> >>> Have you looked at IndexReader.getFieldNames()?
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Thu, Dec 23, 2010 at 3:23 PM, Jordon Saardchit <jsaardchit@go2.com
> >>> wrote:
> >>>
> >>>> Is there an easy way to retrieve a collection of fields (or field
> names)
> >>>> that are analyzed/tokenized from any given index?
> >>>>
> >>>> Jordon
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Get Analyzed/Tokenized Field List

Posted by Jordon Saardchit <js...@go2.com>.
The basic use case is determiniation of rules in regards to building a query.  I've got an application that programmatically builds queries (without any pre existing knowledge of the contents of the index it is searching).  We have a custom designed analyzer and filter chain.  However, it is applied to certain fields at index time.  The fields it is applied to are unstored.

On the search side, I want to be able to determine at runtime which field the analyzer should be applied to, and which field not to.  I could be approaching the solution incorrectly, but I figured this would be a pretty common or natural use case.

Jordon

On Dec 23, 2010, at 2:51 PM, Erick Erickson wrote:

> Ah, you didn't mention indexed but unstored in your original message,
> just indexed/analyzed....
> 
> I don't think you can (someone jump in here if I'm wrong, please). The
> problem
> is that Lucene doesn't require any sort of schema. So if you are perfectly
> free to
> store a field in one document and NOT store it in another. All the variants
> specified in IndexReader.fieldOption can quickly be determined by just
> looking at the
> various index files. But you'd have to spin through all the #documents# in
> order
> to answer the question "is this field ever stored?". Sounds like a table
> scan in the
> DB world.
> 
> I don't think Lucene keeps meta-data for this, and spinning through all the
> documents
> would be expensive...
> 
> Why do you want to know? Perhaps there's another way to satisfy the
> use-case.
> 
> I could be way off base here, I'm speaking from general principles not
> knowledge of
> the code...
> 
> Best
> Erick
> 
> On Thu, Dec 23, 2010 at 4:43 PM, Jordon Saardchit <js...@go2.com>wrote:
> 
>> Yes I have, and after testing each of the various options denoted in
>> IndexReader.FieldOption, I cannot retrieve fieldnames that are indexed
>> (analyzed), and unstored.  I figured this would be relatively easy to do and
>> I was simply overlooking something.  Is it perhaps not possible to do this?
>> 
>> Jordon
>> 
>> On Dec 23, 2010, at 1:30 PM, Erick Erickson wrote:
>> 
>>> Have you looked at IndexReader.getFieldNames()?
>>> 
>>> Best
>>> Erick
>>> 
>>> On Thu, Dec 23, 2010 at 3:23 PM, Jordon Saardchit <jsaardchit@go2.com
>>> wrote:
>>> 
>>>> Is there an easy way to retrieve a collection of fields (or field names)
>>>> that are analyzed/tokenized from any given index?
>>>> 
>>>> Jordon
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Get Analyzed/Tokenized Field List

Posted by Erick Erickson <er...@gmail.com>.
Ah, you didn't mention indexed but unstored in your original message,
just indexed/analyzed....

I don't think you can (someone jump in here if I'm wrong, please). The
problem
is that Lucene doesn't require any sort of schema. So if you are perfectly
free to
store a field in one document and NOT store it in another. All the variants
specified in IndexReader.fieldOption can quickly be determined by just
looking at the
various index files. But you'd have to spin through all the #documents# in
order
to answer the question "is this field ever stored?". Sounds like a table
scan in the
DB world.

I don't think Lucene keeps meta-data for this, and spinning through all the
documents
would be expensive...

Why do you want to know? Perhaps there's another way to satisfy the
use-case.

I could be way off base here, I'm speaking from general principles not
knowledge of
the code...

Best
Erick

On Thu, Dec 23, 2010 at 4:43 PM, Jordon Saardchit <js...@go2.com>wrote:

> Yes I have, and after testing each of the various options denoted in
> IndexReader.FieldOption, I cannot retrieve fieldnames that are indexed
> (analyzed), and unstored.  I figured this would be relatively easy to do and
> I was simply overlooking something.  Is it perhaps not possible to do this?
>
> Jordon
>
> On Dec 23, 2010, at 1:30 PM, Erick Erickson wrote:
>
> > Have you looked at IndexReader.getFieldNames()?
> >
> > Best
> > Erick
> >
> > On Thu, Dec 23, 2010 at 3:23 PM, Jordon Saardchit <jsaardchit@go2.com
> >wrote:
> >
> >> Is there an easy way to retrieve a collection of fields (or field names)
> >> that are analyzed/tokenized from any given index?
> >>
> >> Jordon
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Get Analyzed/Tokenized Field List

Posted by Jordon Saardchit <js...@go2.com>.
Yes I have, and after testing each of the various options denoted in IndexReader.FieldOption, I cannot retrieve fieldnames that are indexed (analyzed), and unstored.  I figured this would be relatively easy to do and I was simply overlooking something.  Is it perhaps not possible to do this?

Jordon

On Dec 23, 2010, at 1:30 PM, Erick Erickson wrote:

> Have you looked at IndexReader.getFieldNames()?
> 
> Best
> Erick
> 
> On Thu, Dec 23, 2010 at 3:23 PM, Jordon Saardchit <js...@go2.com>wrote:
> 
>> Is there an easy way to retrieve a collection of fields (or field names)
>> that are analyzed/tokenized from any given index?
>> 
>> Jordon
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Get Analyzed/Tokenized Field List

Posted by Erick Erickson <er...@gmail.com>.
Have you looked at IndexReader.getFieldNames()?

Best
Erick

On Thu, Dec 23, 2010 at 3:23 PM, Jordon Saardchit <js...@go2.com>wrote:

> Is there an easy way to retrieve a collection of fields (or field names)
> that are analyzed/tokenized from any given index?
>
> Jordon
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>