You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael _ <so...@gmail.com> on 2009/07/10 14:26:40 UTC

Modifying a stored field after analyzing it?

Hello,
I've got a stored, indexed field that contains some actual text, and some
metainfo, like this:

   one two three four [METAINFO] oneprime twoprime threeprime fourprime

I have written a Tokenizer that skips past the [METAINFO] marker and uses
the last four words as the tokens for the field, mapping to the first four
words.  E.g. "twoprime" is the second token, with startposition=4 and
endposition=8.

When someone searches for "twoprime", therefore, they get back a highlighted
result like

   one <em>two</em> three ...

This is great and serves my needs, but I hate that I'm storing all that
METAINFO uselessly (there's actually a good deal more than in this
simplified example).  After I've used it to make my tokens, I'd really like
to convert the stored field to just

   one two three four

and store that.

I thought about using an UpdateRequestProcessor to do this, but that happens
*before* the Analyzers run, so if I strip the [METAINFO] there I can't use
it to build my tokens.  I also thought about sending the data in in two
fields, like

   f1: one two three four
   f1_meta: oneprime twoprime threeprime fourprime

but I can't figure out a way for f1's analyzer to grab the stream from
f1_meta.

Is there some clever way that I'm missing to build my token stream outside
of Solr, and store just the original text and index my token stream?

Thanks in advance!

Re: Modifying a stored field after analyzing it?

Posted by solrcoder <so...@gmail.com>.

Shalin Shekhar Mangar wrote:
> 
> Can't you have two fields like this?
> 
> f1 (indexed, not stored) -> one two three four [METAINFO] oneprime
> twoprime
> threeprime fourprime
> f2 (not indexed, stored) -> one two three four
> 

Perhaps I don't understand highlighting, but won't that prevent snippets
from returning correctly?

E.g. if someone searches for [twoprime], and f1 is not stored, then there
will be a match on the token "twoprime", but no way to correlate that with
the word "two" in a snippet result.  Or is there?

I have looked at the Highlighter code some and I see that it gets fragments
from all fields... so maybe there's something more complicated going on that
will cause it to correctly return

one <em>two</em> three four

from f2?
-- 
View this message in context: http://www.nabble.com/Modifying-a-stored-field-after-analyzing-it--tp24426623p24427303.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modifying a stored field after analyzing it?

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Jul 10, 2009 at 5:56 PM, Michael _ <so...@gmail.com> wrote:

> Hello,
> I've got a stored, indexed field that contains some actual text, and some
> metainfo, like this:
>
>   one two three four [METAINFO] oneprime twoprime threeprime fourprime
>
> I have written a Tokenizer that skips past the [METAINFO] marker and uses
> the last four words as the tokens for the field, mapping to the first four
> words.  E.g. "twoprime" is the second token, with startposition=4 and
> endposition=8.
>
> When someone searches for "twoprime", therefore, they get back a
> highlighted
> result like
>
>   one <em>two</em> three ...
>
> This is great and serves my needs, but I hate that I'm storing all that
> METAINFO uselessly (there's actually a good deal more than in this
> simplified example).  After I've used it to make my tokens, I'd really like
> to convert the stored field to just
>
>   one two three four
>
> and store that.
>
> I thought about using an UpdateRequestProcessor to do this, but that
> happens
> *before* the Analyzers run, so if I strip the [METAINFO] there I can't use
> it to build my tokens.  I also thought about sending the data in in two
> fields, like
>
>   f1: one two three four
>   f1_meta: oneprime twoprime threeprime fourprime
>
> but I can't figure out a way for f1's analyzer to grab the stream from
> f1_meta.
>
> Is there some clever way that I'm missing to build my token stream outside
> of Solr, and store just the original text and index my token stream?
>

Can't you have two fields like this?

f1 (indexed, not stored) -> one two three four [METAINFO] oneprime twoprime
threeprime fourprime
f2 (not indexed, stored) -> one two three four

-- 
Regards,
Shalin Shekhar Mangar.

Re: Modifying a stored field after analyzing it?

Posted by solrcoder <so...@gmail.com>.

markrmiller wrote:
> 
> Yonik's patch makes it so that you can supply the TokenStream straight to
> the field and still store an *independent* text value in a stored field.
> When building the Lucene Document, when adding the field, you would add
> the
> raw TokenStream and then use setValue to set the stored text.
> 

Ah! I got it.  That's great.  Thanks for the explanation.

So all that's left is to sit back and hope this makes it into Solr 1.4 :)
-- 
View this message in context: http://www.nabble.com/Modifying-a-stored-field-after-analyzing-it--tp24426623p24500495.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modifying a stored field after analyzing it?

Posted by Mark Miller <ma...@gmail.com>.
On Fri, Jul 10, 2009 at 3:42 PM, solrcoder <so...@gmail.com> wrote:

>
>
> markrmiller wrote:
> >
> > When you specify a custom UpdateProcessor chain, you will normally make
> > the
> > RunUpdateProcessor the last processor in the chain, as it will add the
> doc
> > to Solr.
> > Rather than using the built in RunUpdateProcessor though, you could
> simply
> > specify your own UpdateProcessor as the last one.
> >
>
> So, to make sure I understand you:
>
> 1) As of today, if I were to drop in a custom RequestUpdateProcessor that
> modeled RunUpdateProcessor but did some Document modification, it wouldn't
> help, because today Document fields can't support stored form tokenizing.
> Modifying the fields would just strip the data that the tokenizer would
> need
> to index properly.
>
> 2) The patch Yonik submitted, which I read and poorly understood out of
> context, will allow tokenization of the stored form in addition to the
> indexed form, so that from an input text "A" I can produce stored form "B"
> and indexed form "C".
>
> Yes?
>
> Again, I didn't understand the patch well, but it looked to me like it only
> provided the ability to say "the tokenizer I'm using on the indexed form
> should be used on the stored form as well."  However, I'll actually need
> *separate* tokenization -- the field
>
>   one two three four [MARKER] oneprime twoprime threeprime fourprime
>
> essentially needs the first part stripped for indexing, and the second part
> stripped for storing.  Once Yonik's patch goes live, how would I tell my
> tokenizer to behave differently for the stored form vs the indexed form?
>
> I'm sure I'm missing something; sorry for the confusion.
> --
> View this message in context:
> http://www.nabble.com/Modifying-a-stored-field-after-analyzing-it--tp24426623p24429917.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Yonik's patch makes it so that you can supply the TokenStream straight to
the field and still store an *independent* text value in a stored field.
When building the Lucene Document, when adding the field, you would add the
raw TokenStream and then use setValue to set the stored text.

-- 
-- 
- Mark

http://www.lucidimagination.com

Re: Modifying a stored field after analyzing it?

Posted by solrcoder <so...@gmail.com>.

markrmiller wrote:
> 
> When you specify a custom UpdateProcessor chain, you will normally make
> the
> RunUpdateProcessor the last processor in the chain, as it will add the doc
> to Solr.
> Rather than using the built in RunUpdateProcessor though, you could simply
> specify your own UpdateProcessor as the last one.
> 

So, to make sure I understand you:

1) As of today, if I were to drop in a custom RequestUpdateProcessor that
modeled RunUpdateProcessor but did some Document modification, it wouldn't
help, because today Document fields can't support stored form tokenizing. 
Modifying the fields would just strip the data that the tokenizer would need
to index properly.

2) The patch Yonik submitted, which I read and poorly understood out of
context, will allow tokenization of the stored form in addition to the
indexed form, so that from an input text "A" I can produce stored form "B"
and indexed form "C".

Yes?

Again, I didn't understand the patch well, but it looked to me like it only
provided the ability to say "the tokenizer I'm using on the indexed form
should be used on the stored form as well."  However, I'll actually need
*separate* tokenization -- the field

   one two three four [MARKER] oneprime twoprime threeprime fourprime

essentially needs the first part stripped for indexing, and the second part
stripped for storing.  Once Yonik's patch goes live, how would I tell my
tokenizer to behave differently for the stored form vs the indexed form?

I'm sure I'm missing something; sorry for the confusion.
-- 
View this message in context: http://www.nabble.com/Modifying-a-stored-field-after-analyzing-it--tp24426623p24429917.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modifying a stored field after analyzing it?

Posted by Mark Miller <ma...@gmail.com>.
On Fri, Jul 10, 2009 at 2:02 PM, solrcoder <so...@gmail.com> wrote:

>
>
> markrmiller wrote:
> >
> > Coming soon. First step was here:
> > http://issues.apache.org/jira/browse/LUCENE-1699
> > Trunk doesn't have that version of Lucene yet though (I believe thats
> > still
> > the case).
> >
> > Replacing the RunUpdateProcessor give you full control of the Lucene
> > document creation.
> >
>
> Is "replacing the RunUpdateProcessor" something *I* would do with a custom
> subclass, or something that the devs are merging into trunk?  I couldn't
> find many docs about it on the web.


When you specify a custom UpdateProcessor chain, you will normally make the
RunUpdateProcessor the last processor in the chain, as it will add the doc
to Solr.
Rather than using the built in RunUpdateProcessor though, you could simply
specify your own UpdateProcessor as the last one. Take a look at the
RunUpdateProcessor code - its pretty simple. On update it does:

  @Override
  public void processAdd(AddUpdateCommand cmd) throws IOException {
    cmd.doc = DocumentBuilder.toDocument(cmd.getSolrInputDocument(),
req.getSchema());
    updateHandler.addDoc(cmd);
    super.processAdd(cmd);
  }

That is where the Lucene Document is created, and so you can use this plugin
hook (a custom update chain) to create it however you want (though you
obviously need to model what you do on DocumentBuilder.toDocument).


>
>
> Also, any idea if "soon" means in the Solr 1.4 release?


Dunno - but I think so. If all it needs is the Lucene update, than def - and
I think you can prob do it with just that. But there may still be a gotchya
to resolve on the Solr end - I'm not 100% at the moment.


>
>
> Thanks for the heads up!
> --
> View this message in context:
> http://www.nabble.com/Modifying-a-stored-field-after-analyzing-it--tp24426623p24428208.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
-- 
- Mark

http://www.lucidimagination.com

Re: Modifying a stored field after analyzing it?

Posted by solrcoder <so...@gmail.com>.

markrmiller wrote:
> 
> Coming soon. First step was here:
> http://issues.apache.org/jira/browse/LUCENE-1699
> Trunk doesn't have that version of Lucene yet though (I believe thats
> still
> the case).
> 
> Replacing the RunUpdateProcessor give you full control of the Lucene
> document creation.
> 

Is "replacing the RunUpdateProcessor" something *I* would do with a custom
subclass, or something that the devs are merging into trunk?  I couldn't
find many docs about it on the web.  

Also, any idea if "soon" means in the Solr 1.4 release?

Thanks for the heads up!
-- 
View this message in context: http://www.nabble.com/Modifying-a-stored-field-after-analyzing-it--tp24426623p24428208.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modifying a stored field after analyzing it?

Posted by Mark Miller <ma...@gmail.com>.
>
> Is there some clever way that I'm missing to build my token stream outside
> of Solr, and store just the original text and index my token stream?
>
>
Coming soon. First step was here:
http://issues.apache.org/jira/browse/LUCENE-1699
Trunk doesn't have that version of Lucene yet though (I believe thats still
the case).

Replacing the RunUpdateProcessor give you full control of the Lucene
document creation.

-- 
-- 
- Mark

http://www.lucidimagination.com