You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by DES <ma...@2des.de> on 2004/12/22 17:23:24 UTC
Indexing terms only
hi
i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks!
DES
Re: Indexing terms only
Posted by Mike Snare <mi...@gmail.com>.
Thanks for correcting me. I use the reader version -- hence my confusion.
-Mike
On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher
<er...@ehatchersolutions.com> wrote:
>
> On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
> > Whether or not the text is stored in the index is a different concern
> > that how it is analyzed. If you want the text to be indexed, and not
> > stored, then use the Field.Text(String, String) method
>
> Correction: Field.Text(String, String) is a stored field. If you want
> unstored, use Field.UnStored(String, String).
> This is a bit confusing because Field.Text(String, Reader) is not
> stored. This confusion has been cleared up in the CVS version of
> Lucene and will be deprecated in the 1.9 release, and removed in the
> 2.0 release.
>
> Erik
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing terms only
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
> Whether or not the text is stored in the index is a different concern
> that how it is analyzed. If you want the text to be indexed, and not
> stored, then use the Field.Text(String, String) method
Correction: Field.Text(String, String) is a stored field. If you want
unstored, use Field.UnStored(String, String).
This is a bit confusing because Field.Text(String, Reader) is not
stored. This confusion has been cleared up in the CVS version of
Lucene and will be deprecated in the 1.9 release, and removed in the
2.0 release.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing terms only
Posted by Mike Snare <mi...@gmail.com>.
I've never used the german analyzer, so I don't know what stop words
it defines/uses. Someone else will have to answer that. Sorry
On Wed, 22 Dec 2004 17:45:17 +0100, DES <ma...@2des.de> wrote:
> I actually use Field.Text(String,String) to add documents to my index. Maybe
> I do not understand the way an analyzer works, but I thought that all German
> articles (der, die, das etc) should be filtered out. However if I use Luke
> to view my index, the original text is completely stored in a field. And
> what I need is term vector, that I can create from an indexed document
> field. So this field should contain terms only.
>
> > Whether or not the text is stored in the index is a different concern
> > that how it is analyzed. If you want the text to be indexed, and not
> > stored, then use the Field.Text(String, String) method or the
> > appropriate constructor when adding a field to the Document. You'll
> > need to also store a reference to the actual file (URL, Path, etc) in
> > the document so it can be retrieved from the doc returned in the Hits
> > object.
> >
> > Or did I completely misunderstand the question?
> >
> > -Mike
> >
> > On Wed, 22 Dec 2004 17:23:24 +0100, DES <ma...@2des.de> wrote:
> >> hi
> >>
> >> i need to index my text so that index contains only tokenized stemmed
> >> words without stopwords etc. The text ist german, so I tried to use
> >> GermanAnalyzer, but it stores whole text, not terms. Please give me a tip
> >> how to index terms only. Thanks!
> >>
> >> DES
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing terms only
Posted by DES <ma...@2des.de>.
I actually use Field.Text(String,String) to add documents to my index. Maybe
I do not understand the way an analyzer works, but I thought that all German
articles (der, die, das etc) should be filtered out. However if I use Luke
to view my index, the original text is completely stored in a field. And
what I need is term vector, that I can create from an indexed document
field. So this field should contain terms only.
> Whether or not the text is stored in the index is a different concern
> that how it is analyzed. If you want the text to be indexed, and not
> stored, then use the Field.Text(String, String) method or the
> appropriate constructor when adding a field to the Document. You'll
> need to also store a reference to the actual file (URL, Path, etc) in
> the document so it can be retrieved from the doc returned in the Hits
> object.
>
> Or did I completely misunderstand the question?
>
> -Mike
>
> On Wed, 22 Dec 2004 17:23:24 +0100, DES <ma...@2des.de> wrote:
>> hi
>>
>> i need to index my text so that index contains only tokenized stemmed
>> words without stopwords etc. The text ist german, so I tried to use
>> GermanAnalyzer, but it stores whole text, not terms. Please give me a tip
>> how to index terms only. Thanks!
>>
>> DES
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing terms only
Posted by Mike Snare <mi...@gmail.com>.
Whether or not the text is stored in the index is a different concern
that how it is analyzed. If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method or the
appropriate constructor when adding a field to the Document. You'll
need to also store a reference to the actual file (URL, Path, etc) in
the document so it can be retrieved from the doc returned in the Hits
object.
Or did I completely misunderstand the question?
-Mike
On Wed, 22 Dec 2004 17:23:24 +0100, DES <ma...@2des.de> wrote:
> hi
>
> i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks!
>
> DES
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org