You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2012/05/11 22:12:46 UTC
[Solr Wiki] Update of "SimplePreAnalyzedParser" by AndrzejBialecki
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "SimplePreAnalyzedParser" page has been changed by AndrzejBialecki:
http://wiki.apache.org/solr/SimplePreAnalyzedParser
New page:
= SimplePreAnalyzedParser format =
This page describes the simple serializatio formar for PreAnalyzedField type.
== General syntax ==
The format of the serialization is as follows:
{{{
content ::= version (stored)? tokens
version ::= digit+ " "
; stored field value - any "=" inside must be escaped!
stored ::= "=" text "="
tokens ::= (token ((" ") + token)*)*
token ::= text ("," attrib)*
attrib ::= name '=' value
name ::= text
value ::= text
}}}
Special characters in "text" values can be escaped using the escape character \ . The following escape sequences are recognized:
{{{
"\ " - literal space character
"\," - literal , character
"\=" - literal = character
"\\" - literal \ character
"\n" - newline
"\r" - carriage return
"\t" - horizontal tab
}}}
Please note that Unicode sequences (e.g. \u0001) are not supported.
== Supported attribute names ==
The following token attributes are supported, and identified with short symbolic names:
* `i` - position increment (integer)
* `s` - token offset, start position (integer)
* `e` - token offset, end position (integer)
* `y` - token type (string)
* `f` - token flags (hexadecimal integer)
* `p` - payload (bytes in hexadecimal format)
Token positions are tracked and implicitly added to the token stream - the start and end offsets consider only the term text and whitespace, and exclude the space taken by token attributes.
== Example token streams ==
{{{
1 one two three
- version 1
- stored: 'null'
- tok: '(term=one,startOffset=0,endOffset=3)'
- tok: '(term=two,startOffset=4,endOffset=7)'
- tok: '(term=three,startOffset=8,endOffset=13)'
1 one two three
- version 1
- stored: 'null'
- tok: '(term=one,startOffset=1,endOffset=4)'
- tok: '(term=two,startOffset=6,endOffset=9)'
- tok: '(term=three,startOffset=12,endOffset=17)'
1 one,s=123,e=128,i=22 two three,s=20,e=22
- version 1
- stored: 'null'
- tok: '(term=one,positionIncrement=22,startOffset=123,endOffset=128)'
- tok: '(term=two,positionIncrement=1,startOffset=5,endOffset=8)'
- tok: '(term=three,positionIncrement=1,startOffset=20,endOffset=22)'
1 \ one\ \,,i=22,a=\, two\=
\n,\ =\ \
- version 1
- stored: 'null'
- tok: '(term= one ,,positionIncrement=22,startOffset=0,endOffset=6)'
- tok: '(term=two=
,positionIncrement=1,startOffset=7,endOffset=15)'
- tok: '(term=\,positionIncrement=1,startOffset=17,endOffset=18)'
1 ,i=22 ,i=33,s=2,e=20 ,
- version 1
- stored: 'null'
- tok: '(term=,positionIncrement=22,startOffset=0,endOffset=0)'
- tok: '(term=,positionIncrement=33,startOffset=2,endOffset=20)'
- tok: '(term=,positionIncrement=1,startOffset=2,endOffset=2)'
1 =This is the stored part with \=
\n \t escapes.=one two three
- version 1
- stored: 'This is the stored part with =
\n \t escapes.'
- tok: '(term=one,startOffset=0,endOffset=3)'
- tok: '(term=two,startOffset=4,endOffset=7)'
- tok: '(term=three,startOffset=8,endOffset=13)'
1 ==
- version 1
- stored: ''
- (no tokens)
1 =this is a test.=
- version 1
- stored: 'this is a test.'
- (no tokens)
}}}