You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christoph Pächter <Pa...@htwg-konstanz.de> on 2007/01/31 12:25:13 UTC
Field methods and usage
Hi,
I was wondering, if there is anywhere a table (similar to Table 1.2 An overview
of different field types, their characteristics, and their usage in Lucene in
Action), listing the possible methods and their usage.
I have created one quickly (thus, not complete and there could be a lot of
failures ;-)
Store |TermVector |Index |reasonable |Usage
YES |NO |NO |1 |URLs
|telephone number
YES |WITH_OFFSETS |NO |1 |?
YES |WITH_POSITIONS |NO |1 |?
YES |WITH_POSITIONS_OFFSETS |NO |1 |?
YES |YES |NO |1 |?
NO |NO |NO |0 |DO NOT USE
NO |WITH_OFFSETS |NO |? |?
NO |WITH_POSITIONS |NO |? |?
NO |WITH_POSITIONS_OFFSETS |NO |? |?
NO |YES |NO |? |?
YES |* |NO_NORMS |0 |no Analyzer,
|not store
NO |NO |NO_NORMS |1 |
NO |WITH_OFFSETS |NO_NORMS |1 |
NO |WITH_POSITIONS |NO_NORMS |1 |
NO |WITH_POSITIONS_OFFSETS |NO_NORMS |1 |
NO |YES |NO_NORMS |1 |
YES |NO |TOKENIZED |1 |Doc content
YES |WITH_OFFSETS |TOKENIZED |1 |
YES |WITH_POSITIONS |TOKENIZED |1 |
YES |WITH_POSITIONS_OFFSETS |TOKENIZED |1 |
YES |YES |TOKENIZED |1 |
NO |NO |TOKENIZED |1 |
NO |WITH_OFFSETS |TOKENIZED |? |
NO |WITH_POSITIONS |TOKENIZED |? |
NO |WITH_POSITIONS_OFFSETS |TOKENIZED |? |
NO |YES |TOKENIZED |? |
YES |* |UN_TOKENIZED |0 |no Analyzer,
|not store
NO |NO |UN_TOKENIZED |1 |
NO |WITH_OFFSETS |UN_TOKENIZED |1 |
NO |WITH_POSITIONS |UN_TOKENIZED |1 |
NO |WITH_POSITIONS_OFFSETS |UN_TOKENIZED |1 |
NO |YES |UN_TOKENIZED |1 |
I think, COMPRESS applies analogously to YES.But use for binary values and long
documents
(x words) (What is the dimension for x??)
Is somewhere, something similar available?
Cheers
Christoph
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Field methods and usage
Posted by karl wettin <ka...@gmail.com>.
31 jan 2007 kl. 12.25 skrev Christoph Pächter:
>
> I was wondering, if there is anywhere a table (similar to Table 1.2
> An overview
> of different field types, their characteristics, and their usage in
> Lucene in
> Action), listing the possible methods and their usage.
Implementations will differ, for example:
>
> Store |TermVector |Index |reasonable |Usage
> YES |NO |NO |1 |URLs
> |
> telephone number
You never have to store anything in the index, perhaps that
information is persistent somewhere else?
If you use a term vector or not depends very little on what kind of
information you store in there, it is up to what analysis you plan to
include the documents in. Highlighting? More like this? Neural networks?
Some are more than happy with one large token. Other people might
want to tokenize the exact same information.
An URL in [protocol://host:port/path], a phone number in country-,
area, and district parts.
It really up to each and every implementer to decide what settings is
best for them.
Also, a Lucene index is not made up of static rows and columns the
way a relational database is. The spoon does not exists. You can bend
it any way you want. Documents in a corpus can share field that share
names but not settings. Perhaps you only want to index phone numbers
in a specific area code.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Field methods and usage
Posted by karl wettin <ka...@gmail.com>.
31 jan 2007 kl. 12.25 skrev Christoph Pächter:
>
> I was wondering, if there is anywhere a table (similar to Table 1.2
> An overview
> of different field types, their characteristics, and their usage in
> Lucene in
> Action), listing the possible methods and their usage.
Implementations will differ, for example:
>
> Store |TermVector |Index |reasonable |Usage
> YES |NO |NO |1 |URLs
> |
> telephone number
You never have to store anything in the index, perhaps that
information is persistent somewhere else?
If you use a term vector or not depends very little on what kind of
information you store in there, it is up to what analysis you plan to
include the documents in. Highlighting? More like this? Neural networks?
Some are more than happy with one large token. Other people might
want to tokenize the exact same information.
An URL in [protocol://host:port/path], a phone number in country-,
area, and district parts.
It really up to each and every implementer to decide what settings is
best for them.
Also, a Lucene index is not made up of static rows and columns the
way a relational database is. The spoon does not exists. You can bend
it any way you want. Documents in a corpus can share field that share
names but not settings. Perhaps you only want to index phone numbers
in a specific area code.
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org