You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Greg Smith <au...@gmail.com> on 2010/11/30 10:56:31 UTC

Creating Email Token Filter

Hi,

I have written a plugin to filter on email types and keep those tokens,
however when I run it in the analysis in the admin it all works fine.

But when I use the data import handler to import the data and set the field
type it doesn't remove the other tokens and keeps the field in the original
form.

I have sent the query and index analyzers to use the standard tokenizer
factory and my custom email filter only.

What could be causing this issue?

Thanks

Greg

Re: Creating Email Token Filter

Posted by Erick Erickson <er...@gmail.com>.
See below. If this still doesn't make sense, could you show us some
examples?

Best
Erick
On Tue, Nov 30, 2010 at 8:33 AM, Greg Smith <au...@gmail.com> wrote:

> Bernd,
>
> Looking at the results returned in the search results the field is
> populated
> with all of the information regardless of whether there was an email
> contained in the contents.
>
> Right here is what Bernd was talking about. What's returned is the stored,
verbatim
text that was input. It is a literal copy. It has not been analyzed,
fitered, or otherwise
manipulated. Consider the poor user if this wasn't the case. You input "The
Party is going swimmingly",
would you really want the user to see "parti go swim"? So the returned data
is the
literal input.

Which has nothing to do with what's searched. Searching is done against the
analyzed text.


> Would the way the analysers and tokens be handled different if using a copy
> field?
>
> It's not. I literal copy of the input is sent to the copyfield and the
analysis stack you've
defined for that field is used. Do note that the raw data is sent to the
copy field, not the
analyzed stream.

Try looking at the solr/admin/schema.jsp page (schema browser) to see the
terms, which
are the analyzed form of your input for your fields.... You might get some
additional
mileage out of TermsComponent, see:
http://wiki.apache.org/solr/TermsComponent

> Thanks
>
> On 30 November 2010 10:54, Bernd Fehling <bernd.fehling@uni-bielefeld.de
> >wrote:
>
> >
> > Am 30.11.2010 10:56, schrieb Greg Smith:
> > > Hi,
> > >
> > > I have written a plugin to filter on email types and keep those tokens,
> > > however when I run it in the analysis in the admin it all works fine.
> > >
> > > But when I use the data import handler to import the data and set the
> > field
> > > type it doesn't remove the other tokens and keeps the field in the
> > original
> > > form.
> > >
> > > I have sent the query and index analyzers to use the standard tokenizer
> > > factory and my custom email filter only.
> > >
> > > What could be causing this issue?
> > >
> >
> > It sound like my misunderstanding which I had till the end of
> > last week about indexing and storing of solr/lucene databases.
> > I also had several Tokenizers and Filters and thought they aren't working
> > but only in analysis of admin.
> > As a matter of fact if they work in the analysis of admin then they work
> > :-)
> > But you can't see it with the search result page, because the search
> result
> > page is always displaying the original stored value _not_ the tokenized
> or
> > filtered
> > indexed value.
> > The Tokenized/Filtered content will be indexed which is not represented
> > with the result page.
> > Check with Schema Browser from admin what the indexed content of your
> > Tokenized/Filtered field is.
> >
> > Best regards
> > Bernd
> >
>

Re: Creating Email Token Filter

Posted by Greg Smith <au...@gmail.com>.
Bernd,

Looking at the results returned in the search results the field is populated
with all of the information regardless of whether there was an email
contained in the contents.

Would the way the analysers and tokens be handled different if using a copy
field?

Thanks

On 30 November 2010 10:54, Bernd Fehling <be...@uni-bielefeld.de>wrote:

>
> Am 30.11.2010 10:56, schrieb Greg Smith:
> > Hi,
> >
> > I have written a plugin to filter on email types and keep those tokens,
> > however when I run it in the analysis in the admin it all works fine.
> >
> > But when I use the data import handler to import the data and set the
> field
> > type it doesn't remove the other tokens and keeps the field in the
> original
> > form.
> >
> > I have sent the query and index analyzers to use the standard tokenizer
> > factory and my custom email filter only.
> >
> > What could be causing this issue?
> >
>
> It sound like my misunderstanding which I had till the end of
> last week about indexing and storing of solr/lucene databases.
> I also had several Tokenizers and Filters and thought they aren't working
> but only in analysis of admin.
> As a matter of fact if they work in the analysis of admin then they work
> :-)
> But you can't see it with the search result page, because the search result
> page is always displaying the original stored value _not_ the tokenized or
> filtered
> indexed value.
> The Tokenized/Filtered content will be indexed which is not represented
> with the result page.
> Check with Schema Browser from admin what the indexed content of your
> Tokenized/Filtered field is.
>
> Best regards
> Bernd
>

Re: Creating Email Token Filter

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Am 30.11.2010 10:56, schrieb Greg Smith:
> Hi,
> 
> I have written a plugin to filter on email types and keep those tokens,
> however when I run it in the analysis in the admin it all works fine.
> 
> But when I use the data import handler to import the data and set the field
> type it doesn't remove the other tokens and keeps the field in the original
> form.
> 
> I have sent the query and index analyzers to use the standard tokenizer
> factory and my custom email filter only.
> 
> What could be causing this issue?
> 

It sound like my misunderstanding which I had till the end of
last week about indexing and storing of solr/lucene databases.
I also had several Tokenizers and Filters and thought they aren't working
but only in analysis of admin.
As a matter of fact if they work in the analysis of admin then they work :-)
But you can't see it with the search result page, because the search result
page is always displaying the original stored value _not_ the tokenized or filtered
indexed value.
The Tokenized/Filtered content will be indexed which is not represented
with the result page.
Check with Schema Browser from admin what the indexed content of your
Tokenized/Filtered field is.

Best regards
Bernd