You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2007/01/22 03:17:36 UTC

Split one string into many fields

Is there any easy way to split a string into a multi-field on the server:

given:
<add>
 <field name="subject">subject1; subject2; subject- 3</field>
</doc>

I would like:
<add>
 <field name="subject">subject1</field>
 <field name="subject">subject2</field>
 <field name="subject">subject- 3</field>
</doc>

Thanks for any pointers

ryan

Re: Split one string into many fields

Posted by Ryan McKinley <ry...@gmail.com>.

looks like we wont save the discussion for later :)


>
> At this point though, I can't for the life of me remeber what Ryan said to
> convince me that it made sense to have a DocumentParser concept that
> UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing
> it directly :)
>

We were discussing a handler that crawls an svn repository and another
that may accept a single file.  They should be able to share the logic
of parsing a single ContentStream into a Document.

Essentially, I was suggesting making a standard DocumentHandler
framework (like the one in LIA that gets pointed to at least once a
week for people wondering how to parse XML/PDF/TXT/etc into lucene a
Document)

With SOLR-104, this will be straight forward to implement.  I totally
agree it probably belongs in a 'tools' or 'plugins' directory along
with other things that are useful, but not the focus of solr.

Re: Split one string into many fields

Posted by Chris Hostetter <ho...@fucit.org>.

: > ...When we get to it, I'd like to hear why it (things like PDF parsing)
: > should be inside Solr rather than outside using our update interfaces....
:
: Same here.

I wouldn't way that i think it *should* be inside of Solr, just that it
*could* be inside of Solr.  the use case i imagine is when you run an
operation in which multiple clients that all want to index PDF files
according to some custom rules to map pices of the fiels to fields in your
schema ... if they have to send Solr XML data listing all the field=value
pairs then they all have to not only load the same PDF Parsing library,
but they also have to share the same biz logic built in to understand what
kinds of SOlr XML documents to produce and send to the server.

If you let people write their own PDFMUpdateHandler then all of those
clients can POST (or upload, refer via URL) the raw PDF file, and the
extraction logic is in one place.


At this point though, I can't for the life of me remeber what Ryan said to
convince me that it made sense to have a DocumentParser concept that
UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing
it directly :)

: I haven't had time to follow the recent (rich) design discussions
: about this stuff, but if I was designing this, I'd put all the
: document processing code in a separate module (separate servlet?) and

never fret ... i too want to keep Solr lean.  The idea (in my mind anyway)
is that there are very few out of the box UpdateHandlers (one for XML, one
for CSV, probably want for JDBC) but that there could be lots of contrib
style Updaters that know how to deal with different exotic document types
users could load if they wanted to.




-Hoss