You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Christoph Pächter <Pa...@htwg-konstanz.de> on 2007/01/31 12:25:13 UTC

Field methods and usage

Hi,

I was wondering, if there is anywhere a table (similar to Table 1.2 An overview
of different field types, their characteristics, and their usage in Lucene in
Action), listing the possible methods and their usage.

I have created one quickly (thus, not complete and there could be a lot of
failures ;-)

Store   |TermVector              |Index          |reasonable |Usage
YES     |NO                      |NO             |1          |URLs
                                                             |telephone number
YES     |WITH_OFFSETS            |NO             |1          |?
YES     |WITH_POSITIONS          |NO             |1          |?
YES     |WITH_POSITIONS_OFFSETS  |NO             |1          |?
YES     |YES                     |NO             |1          |?
NO      |NO                      |NO             |0          |DO NOT USE
NO      |WITH_OFFSETS            |NO             |?          |?
NO      |WITH_POSITIONS          |NO             |?          |?
NO      |WITH_POSITIONS_OFFSETS  |NO             |?          |?
NO      |YES                     |NO             |?          |?
YES     |*                       |NO_NORMS       |0          |no Analyzer,
                                                             |not store
NO      |NO                      |NO_NORMS       |1          |
NO      |WITH_OFFSETS            |NO_NORMS       |1          |
NO      |WITH_POSITIONS          |NO_NORMS       |1          |
NO      |WITH_POSITIONS_OFFSETS  |NO_NORMS       |1          |
NO      |YES                     |NO_NORMS       |1          |
YES     |NO                      |TOKENIZED      |1          |Doc content
YES     |WITH_OFFSETS            |TOKENIZED      |1          |
YES     |WITH_POSITIONS          |TOKENIZED      |1          |
YES     |WITH_POSITIONS_OFFSETS  |TOKENIZED      |1          |
YES     |YES                     |TOKENIZED      |1          |
NO      |NO                      |TOKENIZED      |1          |
NO      |WITH_OFFSETS            |TOKENIZED      |?          |
NO      |WITH_POSITIONS          |TOKENIZED      |?          |
NO      |WITH_POSITIONS_OFFSETS  |TOKENIZED      |?          |
NO      |YES                     |TOKENIZED      |?          |
YES     |*                       |UN_TOKENIZED   |0          |no Analyzer,
                                                             |not store
NO      |NO                      |UN_TOKENIZED   |1          |
NO      |WITH_OFFSETS            |UN_TOKENIZED   |1          |
NO      |WITH_POSITIONS          |UN_TOKENIZED   |1          |
NO      |WITH_POSITIONS_OFFSETS  |UN_TOKENIZED   |1          |
NO      |YES                     |UN_TOKENIZED   |1          |
                                
                                
I think, COMPRESS applies analogously to YES.But use for binary values and long
documents
(x words) (What is the dimension for x??)

Is somewhere, something similar available?

Cheers
Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Field methods and usage

Posted by karl wettin <ka...@gmail.com>.

31 jan 2007 kl. 12.25 skrev Christoph Pächter:
>
> I was wondering, if there is anywhere a table (similar to Table 1.2  
> An overview
> of different field types, their characteristics, and their usage in  
> Lucene in
> Action), listing the possible methods and their usage.

Implementations will differ, for example:

>
> Store   |TermVector              |Index          |reasonable |Usage
> YES     |NO                      |NO             |1          |URLs
>                                                              | 
> telephone number

You never have to store anything in the index, perhaps that  
information is persistent somewhere else?

If you use a term vector or not depends very little on what kind of  
information you store in there, it is up to what analysis you plan to  
include the documents in. Highlighting? More like this? Neural networks?

Some are more than happy with one large token. Other people might  
want to tokenize the exact same information.

An URL in [protocol://host:port/path], a phone number in country-,  
area, and district parts.

It really up to each and every implementer to decide what settings is  
best for them.

Also, a Lucene index is not made up of static rows and columns the  
way a relational database is. The spoon does not exists. You can bend  
it any way you want. Documents in a corpus can share field that share  
names but not settings. Perhaps you only want to index phone numbers  
in a specific area code.

-- 
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Field methods and usage

Posted by karl wettin <ka...@gmail.com>.

31 jan 2007 kl. 12.25 skrev Christoph Pächter:
>
> I was wondering, if there is anywhere a table (similar to Table 1.2  
> An overview
> of different field types, their characteristics, and their usage in  
> Lucene in
> Action), listing the possible methods and their usage.

Implementations will differ, for example:

>
> Store   |TermVector              |Index          |reasonable |Usage
> YES     |NO                      |NO             |1          |URLs
>                                                              | 
> telephone number

You never have to store anything in the index, perhaps that  
information is persistent somewhere else?

If you use a term vector or not depends very little on what kind of  
information you store in there, it is up to what analysis you plan to  
include the documents in. Highlighting? More like this? Neural networks?

Some are more than happy with one large token. Other people might  
want to tokenize the exact same information.

An URL in [protocol://host:port/path], a phone number in country-,  
area, and district parts.

It really up to each and every implementer to decide what settings is  
best for them.

Also, a Lucene index is not made up of static rows and columns the  
way a relational database is. The spoon does not exists. You can bend  
it any way you want. Documents in a corpus can share field that share  
names but not settings. Perhaps you only want to index phone numbers  
in a specific area code.

-- 
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org