You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ryan McKinley <ry...@gmail.com> on 2007/01/22 03:17:36 UTC

Split one string into many fields

Is there any easy way to split a string into a multi-field on the server:

given:
<add>
 <field name="subject">subject1; subject2; subject- 3</field>
</doc>

I would like:
<add>
 <field name="subject">subject1</field>
 <field name="subject">subject2</field>
 <field name="subject">subject- 3</field>
</doc>

Thanks for any pointers

ryan

Re: Split one string into many fields

Posted by Ryan McKinley <ry...@gmail.com>.
looks like we wont save the discussion for later :)


>
> At this point though, I can't for the life of me remeber what Ryan said to
> convince me that it made sense to have a DocumentParser concept that
> UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing
> it directly :)
>

We were discussing a handler that crawls an svn repository and another
that may accept a single file.  They should be able to share the logic
of parsing a single ContentStream into a Document.

Essentially, I was suggesting making a standard DocumentHandler
framework (like the one in LIA that gets pointed to at least once a
week for people wondering how to parse XML/PDF/TXT/etc into lucene a
Document)

With SOLR-104, this will be straight forward to implement.  I totally
agree it probably belongs in a 'tools' or 'plugins' directory along
with other things that are useful, but not the focus of solr.

Re: Split one string into many fields

Posted by Chris Hostetter <ho...@fucit.org>.
: > ...When we get to it, I'd like to hear why it (things like PDF parsing)
: > should be inside Solr rather than outside using our update interfaces....
:
: Same here.

I wouldn't way that i think it *should* be inside of Solr, just that it
*could* be inside of Solr.  the use case i imagine is when you run an
operation in which multiple clients that all want to index PDF files
according to some custom rules to map pices of the fiels to fields in your
schema ... if they have to send Solr XML data listing all the field=value
pairs then they all have to not only load the same PDF Parsing library,
but they also have to share the same biz logic built in to understand what
kinds of SOlr XML documents to produce and send to the server.

If you let people write their own PDFMUpdateHandler then all of those
clients can POST (or upload, refer via URL) the raw PDF file, and the
extraction logic is in one place.


At this point though, I can't for the life of me remeber what Ryan said to
convince me that it made sense to have a DocumentParser concept that
UpdateHandlers could delegate to -- as opposed to the UpdateHandler doing
it directly :)

: I haven't had time to follow the recent (rich) design discussions
: about this stuff, but if I was designing this, I'd put all the
: document processing code in a separate module (separate servlet?) and

never fret ... i too want to keep Solr lean.  The idea (in my mind anyway)
is that there are very few out of the box UpdateHandlers (one for XML, one
for CSV, probably want for JDBC) but that there could be lots of contrib
style Updaters that know how to deal with different exotic document types
users could load if they wanted to.




-Hoss


Re: Split one string into many fields

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 1/22/07, Yonik Seeley <yo...@apache.org> wrote:
> ...When we get to it, I'd like to hear why it (things like PDF parsing)
> should be inside Solr rather than outside using our update interfaces....

Same here.

I haven't had time to follow the recent (rich) design discussions
about this stuff, but if I was designing this, I'd put all the
document processing code in a separate module (separate servlet?) and
keep the Solr core lean and mean, with as thin an interface as
possible.

-Bertrand

Re: Split one string into many fields

Posted by Yonik Seeley <yo...@apache.org>.
On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> Deep within the "Update Plugin" discussion, Hoss and I agreed that
> adding an interface and registry for DocumentParsers is a good idea:
>
> interface SolrDocumentParser
> {
>    Document parse(ContentStream content);
> }
>
> SolrDocumentParser parser = core.getDocumentParse( "text/html");
>
> This would let update plugins share (pluggable) logic for how to
> convert a single stream into a single document...  this is more then
> we are talking about doing now, but something (else) to keep in mind.

Yes, please, for another day... ;-)

It would be interesting to explore what we could share with Nutch
too... they're in the business of doc parsing.

When we get to it, I'd like to hear why it (things like PDF parsing)
should be inside Solr rather than outside using our update interfaces.

-Yonik

Re: Split one string into many fields

Posted by Ryan McKinley <ry...@gmail.com>.
> >
> > In the case I'm looking at, it would be cleaner and more safe to have
> > it on the server side...
>
> Safer? It precludes adding a subject with a ';' in it...
>

well, in *this* case it is :)

>
> An aside: your need sounds like it's part of that much bigger issue of
> processing documents and splitting them up into multiple fields, or at
> least processing certain fields in a way that can add other fields.

Yes, it is.  I'm working with data that is almost structured, but I'd
like to have some level of validation and reprocessing before sticking
it in solr.  I'll use SOLR-104 as that seems like the right thing.


> I'm not sure what a general solution would look like in that case.
> For example, you might have a field called "mail-headers", and want
> that split up into multiple fields.
>
> Another longer term thing to keep our eye on is UIMA (added to the
> Apache incubator not that long ago).
>

Deep within the "Update Plugin" discussion, Hoss and I agreed that
adding an interface and registry for DocumentParsers is a good idea:

interface SolrDocumentParser
{
   Document parse(ContentStream content);
}

SolrDocumentParser parser = core.getDocumentParse( "text/html");

This would let update plugins share (pluggable) logic for how to
convert a single stream into a single document...  this is more then
we are talking about doing now, but something (else) to keep in mind.

Re: Split one string into many fields

Posted by Yonik Seeley <yo...@apache.org>.
On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> > >
> > > I want something that is equivalent to splitting the string on the
> > > client side and filling multiple *fields* not just tokens.
> >
> > Oh, I was talking about indexing only.
> >
>
> aaah.
>
> > Why is it that multiple fields are needed?  Multiple tokens are
> > indistinguishable from multiple fields during search.
> >
>
> When the app displays search results, it shows a list of subjects.
> (from the returned doc list).  That should be split properly.
> (Ideally without knowledge of the schema)
>
>
> > Actually splitting things into different fields normally happens in
> > the client (outside Solr), or in a specialized handler (like CSV, SQL,
> > etc).
> >
>
> In the case I'm looking at, it would be cleaner and more safe to have
> it on the server side...

Safer? It precludes adding a subject with a ';' in it...

Solr currently assumes your data is structured.  Lucene does too... an
analyzer in lucene can't create more fields or take info from one
field and add it to another.

An aside: your need sounds like it's part of that much bigger issue of
processing documents and splitting them up into multiple fields, or at
least processing certain fields in a way that can add other fields.
I'm not sure what a general solution would look like in that case.
For example, you might have a field called "mail-headers", and want
that split up into multiple fields.

Another longer term thing to keep our eye on is UIMA (added to the
Apache incubator not that long ago).

-Yonik

Re: Split one string into many fields

Posted by Ryan McKinley <ry...@gmail.com>.
> >
> > I want something that is equivalent to splitting the string on the
> > client side and filling multiple *fields* not just tokens.
>
> Oh, I was talking about indexing only.
>

aaah.

> Why is it that multiple fields are needed?  Multiple tokens are
> indistinguishable from multiple fields during search.
>

When the app displays search results, it shows a list of subjects.
(from the returned doc list).  That should be split properly.
(Ideally without knowledge of the schema)


> Actually splitting things into different fields normally happens in
> the client (outside Solr), or in a specialized handler (like CSV, SQL,
> etc).
>

In the case I'm looking at, it would be cleaner and more safe to have
it on the server side...

I guess i have to wait for the "Update Plugins" discussion to wind down!

Re: Split one string into many fields

Posted by Yonik Seeley <yo...@apache.org>.
On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> Maybe the name is wrong, but it is something to tell the updateHandler
> to use the tokenizer and filters (normally used for analysis) to
> convert the single field into many fields.
>
> I want something that is equivalent to splitting the string on the
> client side and filling multiple *fields* not just tokens.

Oh, I was talking about indexing only.

Why is it that multiple fields are needed?  Multiple tokens are
indistinguishable from multiple fields during search.

Actually splitting things into different fields normally happens in
the client (outside Solr), or in a specialized handler (like CSV, SQL,
etc).

-Yonik

Re: Split one string into many fields

Posted by Ryan McKinley <ry...@gmail.com>.
On 1/21/07, Yonik Seeley <yo...@apache.org> wrote:
> On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> > Are you suggesting something like this:
> >
> >
> >     <fieldtype name="splitField" class="solr.TextField"
> > sortMissingLast="true" omitNorms="true">
> >       <multi-field>
> >         <tokenizer class="solr.RegexTokenizerFactory" pattern=";" />
> >         <filter class="solr.TrimFilterFactory" />
> >       </multi-field>
> >       <analyzer>
> >         ...
> >       </analyzer>
> >     </fieldtype>
>
> Exactly, except for that <multi-field> bit... what's that?
>

Maybe the name is wrong, but it is something to tell the updateHandler
to use the tokenizer and filters (normally used for analysis) to
convert the single field into many fields.

I want something that is equivalent to splitting the string on the
client side and filling multiple *fields* not just tokens.

or are you suggesting:

<fieldtype name="splitField" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <tokenizer class="solr.RegexTokenizerFactory" pattern=";" />
      <filter class="solr.TrimFilterFactory" />

      <analyzer>
        ...
      </analyzer>
    </fieldtype>

Re: Split one string into many fields

Posted by Yonik Seeley <yo...@apache.org>.
On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> Are you suggesting something like this:
>
>
>     <fieldtype name="splitField" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>       <multi-field>
>         <tokenizer class="solr.RegexTokenizerFactory" pattern=";" />
>         <filter class="solr.TrimFilterFactory" />
>       </multi-field>
>       <analyzer>
>         ...
>       </analyzer>
>     </fieldtype>

Exactly, except for that <multi-field> bit... what's that?

-Yonik

Re: Split one string into many fields

Posted by Ryan McKinley <ry...@gmail.com>.
Are you suggesting something like this:


    <fieldtype name="splitField" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
      <multi-field>
        <tokenizer class="solr.RegexTokenizerFactory" pattern=";" />
        <filter class="solr.TrimFilterFactory" />
      </multi-field>
      <analyzer>
        ...
      </analyzer>
    </fieldtype>



On 1/21/07, Yonik Seeley <yo...@apache.org> wrote:
> On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> > Is there any easy way to split a string into a multi-field on the server:
>
> From an indexing perspective, yes... just assign a tokenizer that splits on ';'
> I don't think we currently have such as configurable Tokenizer though.
> The (hypothetical) tokenizer could even add a positionIncrement,
> emulating multiple fields exactly from the indexing perspective.  Then
> you could follow it with the newly added TrimFilter to trim
> whitespace.
>
> From the stored field perspective, you get back what you put in.
>
> To be nice and general, perhaps it could be regex based like String.split()
>
> -Yonik
>
> > given:
> > <add>
> >  <field name="subject">subject1; subject2; subject- 3</field>
> > </doc>
> >
> > I would like:
> > <add>
> >  <field name="subject">subject1</field>
> >  <field name="subject">subject2</field>
> >  <field name="subject">subject- 3</field>
> > </doc>
> >
> > Thanks for any pointers
> >
> > ryan
> >
>

Re: Split one string into many fields

Posted by Yonik Seeley <yo...@apache.org>.
On 1/21/07, Ryan McKinley <ry...@gmail.com> wrote:
> Is there any easy way to split a string into a multi-field on the server:

>From an indexing perspective, yes... just assign a tokenizer that splits on ';'
I don't think we currently have such as configurable Tokenizer though.
The (hypothetical) tokenizer could even add a positionIncrement,
emulating multiple fields exactly from the indexing perspective.  Then
you could follow it with the newly added TrimFilter to trim
whitespace.

>From the stored field perspective, you get back what you put in.

To be nice and general, perhaps it could be regex based like String.split()

-Yonik

> given:
> <add>
>  <field name="subject">subject1; subject2; subject- 3</field>
> </doc>
>
> I would like:
> <add>
>  <field name="subject">subject1</field>
>  <field name="subject">subject2</field>
>  <field name="subject">subject- 3</field>
> </doc>
>
> Thanks for any pointers
>
> ryan
>