You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by DES <ma...@2des.de> on 2004/12/22 17:23:24 UTC

Indexing terms only

hi

i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks!

DES

Re: Indexing terms only

Posted by Mike Snare <mi...@gmail.com>.

Thanks for correcting me.  I use the reader version -- hence my confusion.

-Mike

On Wed, 22 Dec 2004 11:53:31 -0500, Erik Hatcher
<er...@ehatchersolutions.com> wrote:
> 
> On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
> > Whether or not the text is stored in the index is a different concern
> > that how it is analyzed.  If you want the text to be indexed, and not
> > stored, then use the Field.Text(String, String) method
> 
> Correction: Field.Text(String, String) is a stored field.  If you want
> unstored, use Field.UnStored(String, String).
> This is a bit confusing because Field.Text(String, Reader) is not
> stored.  This confusion has been cleared up in the CVS version of
> Lucene and will be deprecated in the 1.9 release, and removed in the
> 2.0 release.
> 
>         Erik
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing terms only

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Dec 22, 2004, at 11:36 AM, Mike Snare wrote:
> Whether or not the text is stored in the index is a different concern
> that how it is analyzed.  If you want the text to be indexed, and not
> stored, then use the Field.Text(String, String) method

Correction: Field.Text(String, String) is a stored field.  If you want 
unstored, use Field.UnStored(String, String).
This is a bit confusing because Field.Text(String, Reader) is not 
stored.  This confusion has been cleared up in the CVS version of 
Lucene and will be deprecated in the 1.9 release, and removed in the 
2.0 release.

	Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing terms only

Posted by Mike Snare <mi...@gmail.com>.

I've never used the german analyzer, so I don't know what stop words
it defines/uses.  Someone else will have to answer that.  Sorry

On Wed, 22 Dec 2004 17:45:17 +0100, DES <ma...@2des.de> wrote:
> I actually use Field.Text(String,String) to add documents to my index. Maybe
> I do not understand the way an analyzer works, but I thought that all German
> articles (der, die, das etc) should be filtered out. However if I use Luke
> to view my index, the original text is completely stored in a field. And
> what I need is term vector, that I can create from an indexed document
> field. So this field should contain terms only.
> 
> > Whether or not the text is stored in the index is a different concern
> > that how it is analyzed.  If you want the text to be indexed, and not
> > stored, then use the Field.Text(String, String) method or the
> > appropriate constructor when adding a field to the Document.  You'll
> > need to also store a reference to the actual file (URL, Path, etc) in
> > the document so it can be retrieved from the doc returned in the Hits
> > object.
> >
> > Or did I completely misunderstand the question?
> >
> > -Mike
> >
> > On Wed, 22 Dec 2004 17:23:24 +0100, DES <ma...@2des.de> wrote:
> >> hi
> >>
> >> i need to index my text so that index contains only tokenized stemmed
> >> words without stopwords etc. The text ist german, so I tried to use
> >> GermanAnalyzer, but it stores whole text, not terms. Please give me a tip
> >> how to index terms only. Thanks!
> >>
> >> DES
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing terms only

Posted by DES <ma...@2des.de>.

I actually use Field.Text(String,String) to add documents to my index. Maybe 
I do not understand the way an analyzer works, but I thought that all German 
articles (der, die, das etc) should be filtered out. However if I use Luke 
to view my index, the original text is completely stored in a field. And 
what I need is term vector, that I can create from an indexed document 
field. So this field should contain terms only.

> Whether or not the text is stored in the index is a different concern
> that how it is analyzed.  If you want the text to be indexed, and not
> stored, then use the Field.Text(String, String) method or the
> appropriate constructor when adding a field to the Document.  You'll
> need to also store a reference to the actual file (URL, Path, etc) in
> the document so it can be retrieved from the doc returned in the Hits
> object.
>
> Or did I completely misunderstand the question?
>
> -Mike
>
> On Wed, 22 Dec 2004 17:23:24 +0100, DES <ma...@2des.de> wrote:
>> hi
>>
>> i need to index my text so that index contains only tokenized stemmed 
>> words without stopwords etc. The text ist german, so I tried to use 
>> GermanAnalyzer, but it stores whole text, not terms. Please give me a tip 
>> how to index terms only. Thanks!
>>
>> DES
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Indexing terms only

Posted by Mike Snare <mi...@gmail.com>.

Whether or not the text is stored in the index is a different concern
that how it is analyzed.  If you want the text to be indexed, and not
stored, then use the Field.Text(String, String) method or the
appropriate constructor when adding a field to the Document.  You'll
need to also store a reference to the actual file (URL, Path, etc) in
the document so it can be retrieved from the doc returned in the Hits
object.

Or did I completely misunderstand the question?

-Mike

On Wed, 22 Dec 2004 17:23:24 +0100, DES <ma...@2des.de> wrote:
> hi
> 
> i need to index my text so that index contains only tokenized stemmed words without stopwords etc. The text ist german, so I tried to use GermanAnalyzer, but it stores whole text, not terms. Please give me a tip how to index terms only. Thanks!
> 
> DES
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org