You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Fuu <ju...@gmail.com> on 2012/08/16 15:54:17 UTC

How to index multivalued field tokens by their attached metadata?

Hello all Solrians.

I'm fairly new to Solr, having only played with it for about a month now.
I'm working with the Solr 4.0.0-Alpha release, trying to figure out a proper
approach to an indexing problem, but the methods I've come up with are not
panning out. I describe below the problem and my 3 attempts of solving it. I
hope someone here has had similar issues and solved them or can tell me that
my current ones are no good. :)

Problem: I have a dataset that consists of email type of documents. From
these documents I need to extract to extract certain tokens, attach meta
information to each token and then make them searchable based on the
attached meta information. If it works, I could search the index for tokens
that were in a document created in certain data range, or based on any other
metadata like this attached to the token.

Basically: For each document on disk => X amount of extracted token based
documents in the index.

Attempt one: As a starting point I first used a PatternTokenizer to get the
tokens that I want, so each indexed document now would have a multivalued
field of tokens. I then wrote a TokenFilter that attached the metadata to
each token as a payload. I tried searching by payload and discovered it only
worked if I used the token as search parameter. Apparently searching by
keywords in token payload is not implemented yet?

Attempt two: I read about PpdateRequestProcessors and processor chains, and
tried writing a Processor that would take in a document, check if it has a
field with my tokens extracted using the TokenFilter from the first approach
in it, and then write hand out each token as a separate document to the next
processor. I couldn't figure out how to do this, apparently once you call
super.processAdd() it jumps to the next document, rather than allow me to
insert new document based on next token of the current document.

Attempt 3: Use a Lucene IndexWriter directly from the custom
UpdaterequestProcessor to write the created meta-token documents to a
separate index. As a concept it should work, but how would this second index
conform to it's Solr schema if i directly write data to an Index? I assume
that I would configure the index as a second core with it's own schema and
search parameters. Can Solr still query the index normally?

As you can see I'm little bit at loss on how to implement this. Are all of
the above approaches are bad? Have I misunderstood one of them and it should
actually work? I could go back to basics and write a document processor in
Python to do all the parsing and patternmatching and token extraction
outside of Solr and just feed it documents to index, but this seems like
something that Solr should do and I'm just not seeing The Right Way.

Regards,
Juha

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-multivalued-field-tokens-by-their-attached-metadata-tp4001627.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to index multivalued field tokens by their attached metadata?

Posted by Fuu <ju...@gmail.com>.

After pondering it for a while I decided to take the advice and write the
processing as a separate program.  It will probably be easier to pre-format
the data with a scripting language anyways.

Thank you for taking your time to reply. :)

- Fuu



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-index-multivalued-field-tokens-by-their-attached-metadata-tp4001627p4002163.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to index multivalued field tokens by their attached metadata?

Posted by Jack Krupansky <ja...@basetechnology.com>.

It would help if you could give a simple, contrived example of the kind of 
input you want to process, what tokens you consider special, their metadata, 
and the Solr documents you hope to generate, with the schema definition 
(summarized.) But keep it simple, at first.

That said, as a general rule, any "complex processing" is best done upstream 
from Solr. Sure, you can do some interesting processing in an update 
handler, including with scripts now, but just don't go overboard.

In fact, maybe you might be better off by first writing your processing as a 
standalone program that feeds Solr XML documents to Solr, get that working, 
and THEN decide whether and how that processing might be integrated more 
tightly with Solr.

But, start by defining the inputs, processing, and output as requested 
above.

-- Jack Krupansky

-----Original Message----- 
From: Fuu
Sent: Thursday, August 16, 2012 9:54 AM
To: solr-user@lucene.apache.org
Subject: How to index multivalued field tokens by their attached metadata?

Hello all Solrians.

I'm fairly new to Solr, having only played with it for about a month now.
I'm working with the Solr 4.0.0-Alpha release, trying to figure out a proper
approach to an indexing problem, but the methods I've come up with are not
panning out. I describe below the problem and my 3 attempts of solving it. I
hope someone here has had similar issues and solved them or can tell me that
my current ones are no good. :)

Problem: I have a dataset that consists of email type of documents. From
these documents I need to extract to extract certain tokens, attach meta
information to each token and then make them searchable based on the
attached meta information. If it works, I could search the index for tokens
that were in a document created in certain data range, or based on any other
metadata like this attached to the token.

Basically: For each document on disk => X amount of extracted token based
documents in the index.

Attempt one: As a starting point I first used a PatternTokenizer to get  the
tokens that I want, so each indexed document now would have a multivalued
field of tokens. I then wrote a TokenFilter that attached the metadata to
each token as a payload. I tried searching by payload and discovered it only
worked if I used the token as search parameter. Apparently searching by
keywords in token payload is not implemented yet?

Attempt two: I read about PpdateRequestProcessors and processor chains, and
tried writing a Processor that would take in a document, check if it has a
field with my tokens extracted using the TokenFilter from the first approach
in it, and then write hand out each token as a separate document to the next
processor. I couldn't figure out how to do this, apparently once you call
super.processAdd() it jumps to the next document, rather than allow me to
insert new document based on next token of the current document.

Attempt 3: Use a Lucene IndexWriter directly from the custom
UpdaterequestProcessor to write the created meta-token documents to a
separate index. As a concept it should work, but how would this second index
conform to it's Solr schema if i directly write data to an Index? I assume
that I would configure the index as a second core with it's own schema and
search parameters. Can Solr still query the index normally?

As you can see I'm little bit at loss on how to implement this. Are all of
the above approaches are bad? Have I misunderstood one of them and it should
actually work?  I could go back to basics and write a document processor in
Python to do all the parsing and patternmatching and token extraction
outside of Solr and just feed it documents to index, but this seems like
something that Solr should do and I'm just not seeing The Right Way.

Regards,
Juha






--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-index-multivalued-field-tokens-by-their-attached-metadata-tp4001627.html
Sent from the Solr - User mailing list archive at Nabble.com.