You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2011/06/27 17:13:16 UTC

Default schema - 'keywords' not multivalued

This was a little curious to me and I wondered what the thought process 
was behind it before I decide to change it.


Thanks - Tod

Re: Default schema - 'keywords' not multivalued

Posted by Chris Hostetter <ho...@fucit.org>.
: The problem with TikaEntityProcessor is this installation is still running
: v1.4.1 so I'll need to upgrade.
: 
: Any short and sweet instructions for upgrading to 3.2?  I have a pretty
: straight forward Tomcat install, would just dropping in the new war suffice?

It should be fairly straight forward, check the instructions in 
CHANGES.txt for any potential gotchas.

I posted a writtup a while back on upgrading from 1.4 to 3.1 from a user 
perspective...

http://www.lucidimagination.com/blog/2011/04/01/solr-powered-isfdb-part-8/



-Hoss

Re: Default schema - 'keywords' not multivalued

Posted by Tod <li...@gmail.com>.
On 06/28/2011 12:04 PM, Chris Hostetter wrote:
>
> : I'm streaming over the document content (presumably via tika) and its
> : gathering the document's metadata which includes the keywords metadata field.
> : Since I'm also passing that field from the DB to the REST call as a list (as
> : you suggested) there is a collision because the keywords field is single
> : valued.
> :
> : I can change this behavior using a copy field.  What I wanted to know is if
> : there was a specific reason the default schema defined a field like keywords
> : single valued so I could make sure I wasn't missing something before I changed
> : things.
>
> That file is just an example, you're absolutely free to change it to meet
> your use case.
>
> I'm not very familiar with Tika, but based on the comment in the example
> config...
>
>     <!-- Common metadata fields, named specifically to match up with
>       SolrCell metadata when parsing rich documents such as Word, PDF.
>       Some fields are multiValued only because Tika currently may return
>       multiple values for them.
>     -->
>
> ...i suspect it was intentional that that field is *not* multiValued (i
> guess Tika always returns a single delimited value?) but if you have
> multiple descrete values you want to send for your DB backed data there is
> no downside to changing that.
>
> : While I'm at it, I'd REALLY like to know how to use DIH to index the metadata
> : from the database while simultaneously streaming over the document content and
> : indexing it.  I've never quite figured it out yet but I have to believe it is
> : a possibility.
>
> There's a TikaEntityProcessor that can be used to have Tika crunch the
> data that comes from an "entity" and extract out specific fields, and it
> can be used in combination with a JdbcDataSource and a BinFileDataSource
> so that a field in your db data specifies the name of a file on disk to
> use as the TikaEntity -- but i've personally never tried it
>
> Here's a simple example someone posted last year that they got working...
>
> http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html
>
>
>
> -Hoss
>

Thanks Hoss, I'll just change the schema then.

The problem with TikaEntityProcessor is this installation is still 
running v1.4.1 so I'll need to upgrade.

Any short and sweet instructions for upgrading to 3.2?  I have a pretty 
straight forward Tomcat install, would just dropping in the new war suffice?


- Tod

Re: Default schema - 'keywords' not multivalued

Posted by Chris Hostetter <ho...@fucit.org>.
: I'm streaming over the document content (presumably via tika) and its
: gathering the document's metadata which includes the keywords metadata field.
: Since I'm also passing that field from the DB to the REST call as a list (as
: you suggested) there is a collision because the keywords field is single
: valued.
: 
: I can change this behavior using a copy field.  What I wanted to know is if
: there was a specific reason the default schema defined a field like keywords
: single valued so I could make sure I wasn't missing something before I changed
: things.

That file is just an example, you're absolutely free to change it to meet 
your use case.

I'm not very familiar with Tika, but based on the comment in the example 
config...

   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them.
   -->

...i suspect it was intentional that that field is *not* multiValued (i 
guess Tika always returns a single delimited value?) but if you have 
multiple descrete values you want to send for your DB backed data there is 
no downside to changing that.

: While I'm at it, I'd REALLY like to know how to use DIH to index the metadata
: from the database while simultaneously streaming over the document content and
: indexing it.  I've never quite figured it out yet but I have to believe it is
: a possibility.

There's a TikaEntityProcessor that can be used to have Tika crunch the 
data that comes from an "entity" and extract out specific fields, and it 
can be used in combination with a JdbcDataSource and a BinFileDataSource 
so that a field in your db data specifies the name of a file on disk to 
use as the TikaEntity -- but i've personally never tried it

Here's a simple example someone posted last year that they got working...

http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html



-Hoss

Re: Default schema - 'keywords' not multivalued

Posted by Tod <li...@gmail.com>.
On 06/27/2011 11:23 AM, lee carroll wrote:
> Hi Tod,
> A list of keywords would be fine in a non multi valued field:
>
> keywords : "xxx yyy sss aaa qqqq"
>
> multi value field would allow you to repeat the field when indexing
>
> keywords: "xxx"
> keywords: "yyy"
> keywords: "sss"
> etc


Thanks Lee. the problem is I'm manually pushing a document (via 
stream.url) and its metadata from a database with the Solr 
/update/extract REST service, HTTP GET, using Perl.

I'm streaming over the document content (presumably via tika) and its 
gathering the document's metadata which includes the keywords metadata 
field.  Since I'm also passing that field from the DB to the REST call 
as a list (as you suggested) there is a collision because the keywords 
field is single valued.

I can change this behavior using a copy field.  What I wanted to know is 
if there was a specific reason the default schema defined a field like 
keywords single valued so I could make sure I wasn't missing something 
before I changed things.

While I'm at it, I'd REALLY like to know how to use DIH to index the 
metadata from the database while simultaneously streaming over the 
document content and indexing it.  I've never quite figured it out yet 
but I have to believe it is a possibility.


- Tod

Re: Default schema - 'keywords' not multivalued

Posted by lee carroll <le...@googlemail.com>.
Hi Tod,
A list of keywords would be fine in a non multi valued field:

keywords : "xxx yyy sss aaa qqqq"

multi value field would allow you to repeat the field when indexing

keywords: "xxx"
keywords: "yyy"
keywords: "sss"
etc


On 27 June 2011 16:13, Tod <li...@gmail.com> wrote:
> This was a little curious to me and I wondered what the thought process was
> behind it before I decide to change it.
>
>
> Thanks - Tod
>