You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2009/02/15 10:44:59 UTC

[jira] Issue Comment Edited: (SOLR-1020) PreAnalyzed field analyzer

    [ https://issues.apache.org/jira/browse/SOLR-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673581#action_12673581 ] 

karl.wettin edited comment on SOLR-1020 at 2/15/09 1:43 AM:
------------------------------------------------------------

bq. Karl, would it make sense to use the NamedList format instead of a custom XML one? That way, you can use most of the existing parsing code. 

I don't know, would it? 

bq. Thoughts?

The reason I choose JSR173 is that it allows for unmarshalling one token at the time rather than all at once. I.e. I want to reuse the token instance in the TokenStream the Analyzer produce rather than unmarshall all of the data at once. My first thought was to parse the XML using a lexer but some simple tests showed that the overhead of JSR173 was very small compared to jflex. I am however considering jflex for the binary format.

I came up with this patch because I have a rather elaborate tokenization scheme using ShingleMatrixFilter. The current solution of mine is to pass a base64 encoded serialized object as field value and use a custom Analyzer that assemble and tokenize the entity object passed down in the field value. However the tokenization is rather expensive (especially during initial bulk import of my zillions of documents)  so I'd rather do this on my clients as I've got plenty of those but only one Solr.


      was (Author: karl.wettin):
    bq. Karl, would it make sense to use the NamedList format instead of a custom XML one? That way, you can use most of the existing parsing code. 

I don't know, would it? 

bq. Thoughts?

The reason I choose JSR173 is that it allows for unmarshalling one token at the time rather than all at once. I.e. I want to reuse the token instance in the TokenStream the Analyzer produce rather than unmarshall all of the data at once. My first thought was to parse the XML using a lexer but some simple tests showed that the overhead of JSR173 was very small compared to jflex. I am however considering jflex for the binary format.

I came up with this patch because I have a rather elaborate tokenization scheme using ShingleMatrixFilter. The current solution of mine is to pass a base64 encoded serialized object as field value and use a custom Analyzer that produce the TokenStream. However the tokenization is rather expensive (especially during initial bulk import of my zillions of documents)  so I'd rather do this on my clients as I've got plenty of those but only one Solr.

  
> PreAnalyzed field analyzer
> --------------------------
>
>                 Key: SOLR-1020
>                 URL: https://issues.apache.org/jira/browse/SOLR-1020
>             Project: Solr
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 1.3
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: SOLR-1020.txt
>
>
> An Analyzer that produce a TokenStream based on XML input that contains a marshalled TokenStream. Also contains static TokenStream XML marshaller.
> I kind of pulled this out of my pocket without testing it in a real environment in order to get some comments on the solution before I add it to my project. So cosider it a beta-patch.
> It use JSR173 XMLStream API available in Java 1.6, compatible with Java 1.5 and downloadable from https://sjsxp.dev.java.net/
> XSD:
> {code:xml}
> <?xml version="1.0" encoding="UTF-8"?>
> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified"
>            xmlns:xs="http://www.w3.org/2001/XMLSchema">
>     <xs:element name="tokens" type="tokensType"/>
>     <xs:complexType name="tokensType">
>         <xs:sequence>
>             <xs:element type="tokenType" name="token"/>
>         </xs:sequence>
>     </xs:complexType>
>     <xs:complexType name="tokenType">
>         <xs:sequence>
>             <xs:element type="xs:int" name="positionIncrement" maxOccurs="1"/>
>             <xs:element type="xs:string" name="term" minOccurs="1" maxOccurs="1"/>
>             <xs:element type="xs:string" name="type" maxOccurs="1"/>
>             <xs:element type="xs:int" name="startOffset" maxOccurs="1"/>
>             <xs:element type="xs:int" name="endOffset" maxOccurs="1"/>
>             <xs:element type="xs:int" name="flags" maxOccurs="1"/>
>             <xs:element type="payloadType" name="payload" maxOccurs="1"/>
>         </xs:sequence>
>     </xs:complexType>
>     <xs:complexType name="payloadType">
>         <xs:choice maxOccurs="1" minOccurs="1">
>             <xs:element type="bytesType" name="bytes"/>
>             <xs:element type="xs:string" name="hex"/>
>             <xs:element type="xs:string" name="base64"/>
>         </xs:choice>
>     </xs:complexType>
>     <xs:complexType name="bytesType">
>         <xs:sequence>
>             <xs:element type="xs:byte" name="byte" maxOccurs="unbounded" minOccurs="1"/>
>         </xs:sequence>
>     </xs:complexType>
> </xs:schema>
> {code}
> Even though I've added a couple of variants to how to handle a Payload in the XSD only <hex> is supported.
> Example XML:
> {code:xml}
> <tokens>
>   <token>
>     <positionIncrement>1</positionIncrement>
>     <term>term</term>
>     <type>type</type>
>     <startOffset>0</startOffset>
>     <endOffset>3</endOffset>
>     <flags>65535</flags>
>     <payload><hex>fffefd</hex></payload>
>   </token>
> </tokens>
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.