You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Doğacan Güney <do...@gmail.com> on 2007/07/17 13:55:56 UTC

Passing arguments to analyzers

Hi all,

Is there a way to pass arguments to analyzers per document? Let's say
that I have a field "foo" which is tokenized by WhitespaceTokenizer
and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
can stem more than one language but (obviously) it needs to know the
language of the document it is working on. So what I need is to
specify the language per document (actually per field).

Here is an example:
<doc>
   <field name="....
    .....
    <field name="foo" lang="en">My spam egg bars baz.</field>
</doc>

Is something like this possible with Solr?

-- 
Doğacan Güney

Re: Passing arguments to analyzers

Posted by Doğacan Güney <do...@gmail.com>.
On 7/17/07, Yonik Seeley <yo...@apache.org> wrote:
> On 7/17/07, Doğacan Güney <do...@gmail.com> wrote:
> > Hi,
> >
> > On 7/17/07, Yonik Seeley <yo...@apache.org> wrote:
> > > On 7/17/07, Doğacan Güney <do...@gmail.com> wrote:
> > > > Hi all,
> > > >
> > > > Is there a way to pass arguments to analyzers per document? Let's say
> > > > that I have a field "foo" which is tokenized by WhitespaceTokenizer
> > > > and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> > > > can stem more than one language but (obviously) it needs to know the
> > > > language of the document it is working on. So what I need is to
> > > > specify the language per document (actually per field).
> > > >
> > > > Here is an example:
> > > > <doc>
> > > >    <field name="....
> > > >     .....
> > > >     <field name="foo" lang="en">My spam egg bars baz.</field>
> > > > </doc>
> > > >
> > > > Is something like this possible with Solr?
> > >
> > > You can pass extra args to a factory in the field-type definition, but
> > > that means you would need a separate field-type per language.
> >
> > Thanks for the answer.
> >
> > Your suggestion would work for this particular use case, but IMHO
> > there are other use cases out there that can benefit (for example, one
> > may process the whole document and add parameters for each field based
> > on document-level analysis) from this.
> >
> > Would this be useful feature for Solr? I would actually like to work
> > on it if others consider this as a useful add-on. It seems simple to
> > accomplish and it would probably be a good introduction to Solr
> > internals.
>
> wrt passing more info to the analyzer at runtime to alter its
> behavior: analyzers are singletons per field-type, and
> Analyzer.tokenStream(String fieldName, Reader reader) is called to
> analyze a particular value.  There isn't really a good place to pass
> in extra info.
>
> During XML parsing, we *could* build up a Map of the parameters we
> don't know about, but then the question is what to do with them.  One
> hackish solution would be to store them in a thread-local where your
> analyzer could check it.  Perhaps a custom request processor could do
> that task.
>
> It seems there does need to be some kind of framework more aligned
> with parsing documents (word docs, pdf, etc), for adding metadata to
> fields at runtime (how does UIMA or Tika fit into this?), and for
> mapping the fields+metadata to Solr/Lucene document fields.

I opened SORL-313 for this.

>
> -Yonik
>


-- 
Doğacan Güney

Re: Passing arguments to analyzers

Posted by Yonik Seeley <yo...@apache.org>.
On 7/17/07, Doğacan Güney <do...@gmail.com> wrote:
> Hi,
>
> On 7/17/07, Yonik Seeley <yo...@apache.org> wrote:
> > On 7/17/07, Doğacan Güney <do...@gmail.com> wrote:
> > > Hi all,
> > >
> > > Is there a way to pass arguments to analyzers per document? Let's say
> > > that I have a field "foo" which is tokenized by WhitespaceTokenizer
> > > and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> > > can stem more than one language but (obviously) it needs to know the
> > > language of the document it is working on. So what I need is to
> > > specify the language per document (actually per field).
> > >
> > > Here is an example:
> > > <doc>
> > >    <field name="....
> > >     .....
> > >     <field name="foo" lang="en">My spam egg bars baz.</field>
> > > </doc>
> > >
> > > Is something like this possible with Solr?
> >
> > You can pass extra args to a factory in the field-type definition, but
> > that means you would need a separate field-type per language.
>
> Thanks for the answer.
>
> Your suggestion would work for this particular use case, but IMHO
> there are other use cases out there that can benefit (for example, one
> may process the whole document and add parameters for each field based
> on document-level analysis) from this.
>
> Would this be useful feature for Solr? I would actually like to work
> on it if others consider this as a useful add-on. It seems simple to
> accomplish and it would probably be a good introduction to Solr
> internals.

wrt passing more info to the analyzer at runtime to alter its
behavior: analyzers are singletons per field-type, and
Analyzer.tokenStream(String fieldName, Reader reader) is called to
analyze a particular value.  There isn't really a good place to pass
in extra info.

During XML parsing, we *could* build up a Map of the parameters we
don't know about, but then the question is what to do with them.  One
hackish solution would be to store them in a thread-local where your
analyzer could check it.  Perhaps a custom request processor could do
that task.

It seems there does need to be some kind of framework more aligned
with parsing documents (word docs, pdf, etc), for adding metadata to
fields at runtime (how does UIMA or Tika fit into this?), and for
mapping the fields+metadata to Solr/Lucene document fields.

-Yonik

Re: Passing arguments to analyzers

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On 7/17/07, Yonik Seeley <yo...@apache.org> wrote:
> On 7/17/07, Doğacan Güney <do...@gmail.com> wrote:
> > Hi all,
> >
> > Is there a way to pass arguments to analyzers per document? Let's say
> > that I have a field "foo" which is tokenized by WhitespaceTokenizer
> > and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> > can stem more than one language but (obviously) it needs to know the
> > language of the document it is working on. So what I need is to
> > specify the language per document (actually per field).
> >
> > Here is an example:
> > <doc>
> >    <field name="....
> >     .....
> >     <field name="foo" lang="en">My spam egg bars baz.</field>
> > </doc>
> >
> > Is something like this possible with Solr?
>
> You can pass extra args to a factory in the field-type definition, but
> that means you would need a separate field-type per language.

Thanks for the answer.

Your suggestion would work for this particular use case, but IMHO
there are other use cases out there that can benefit (for example, one
may process the whole document and add parameters for each field based
on document-level analysis) from this. Also, again IMHO, per-field
parameters are more flexible.

Would this be useful feature for Solr? I would actually like to work
on it if others consider this as a useful add-on. It seems simple to
accomplish and it would probably be a good introduction to Solr
internals.

>
> -Yonik
>


-- 
Doğacan Güney

Re: Passing arguments to analyzers

Posted by Yonik Seeley <yo...@apache.org>.
On 7/17/07, Doğacan Güney <do...@gmail.com> wrote:
> Hi all,
>
> Is there a way to pass arguments to analyzers per document? Let's say
> that I have a field "foo" which is tokenized by WhitespaceTokenizer
> and then filtered by MyCustomStemmingFilter. MyCustomStemmingFilter
> can stem more than one language but (obviously) it needs to know the
> language of the document it is working on. So what I need is to
> specify the language per document (actually per field).
>
> Here is an example:
> <doc>
>    <field name="....
>     .....
>     <field name="foo" lang="en">My spam egg bars baz.</field>
> </doc>
>
> Is something like this possible with Solr?

You can pass extra args to a factory in the field-type definition, but
that means you would need a separate field-type per language.

-Yonik