You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2010/10/28 17:53:44 UTC

Overriding Tika's field processing

I'm reading my document data from a CMS and indexing it using calls to 
curl.  The curl call includes 'stream.url' so Tika will also index the 
actual document pointed to by the CMS' stored url.  This works fine.

Presentation side I have a dropdown with the title of all the indexed 
documents such that when a user clicks one of them it opens in a new 
window.  Using js, I've been parsing the json returned from Solr to 
create the dropdown.  The problem is I can't get the titles sorted 
alphabetically.

If I use a facet.sort on the title field I get back ALL the sorted 
titles in the facet block, but that doesn't include the associated 
URL's.  A sorted query won't work because title is a multivalued field.

The one option I can think of is to make the title single valued so that 
I have a one to one relationship to the returned url.  To do that I'd 
need to be able to *not* index the Tika returned values.

If I read right, my understanding was that I could use 'literal.title' 
in the curl call to limit what would be included in the index from Tika. 
  That doesn't seem to be working as a test facet query returns more 
than I have in the CMS.

Am I understanding the 'literal.title' processing correctly?  Does 
anybody have experience/suggestions on how to handle this?


Thanks - Tod

Re: Overriding Tika's field processing

Posted by Lance Norskog <go...@gmail.com>.

If you change 'title' to be single-valued, the Extracting thing may or
may not override it. I remember a go-round on this problem. But the
ExtractingWhatsIt has code that explicitly checks for single-valued
v.s. multi-valued.

And this may all be different in different Solr versions. The
DataImportHandler has Tika support in 3.x and trunk, and the DIH gives
a lot more control about what field has what value.

On Thu, Oct 28, 2010 at 8:53 AM, Tod <li...@gmail.com> wrote:
> I'm reading my document data from a CMS and indexing it using calls to curl.
>  The curl call includes 'stream.url' so Tika will also index the actual
> document pointed to by the CMS' stored url.  This works fine.
>
> Presentation side I have a dropdown with the title of all the indexed
> documents such that when a user clicks one of them it opens in a new window.
>  Using js, I've been parsing the json returned from Solr to create the
> dropdown.  The problem is I can't get the titles sorted alphabetically.
>
> If I use a facet.sort on the title field I get back ALL the sorted titles in
> the facet block, but that doesn't include the associated URL's.  A sorted
> query won't work because title is a multivalued field.
>
> The one option I can think of is to make the title single valued so that I
> have a one to one relationship to the returned url.  To do that I'd need to
> be able to *not* index the Tika returned values.
>
> If I read right, my understanding was that I could use 'literal.title' in
> the curl call to limit what would be included in the index from Tika.  That
> doesn't seem to be working as a test facet query returns more than I have in
> the CMS.
>
> Am I understanding the 'literal.title' processing correctly?  Does anybody
> have experience/suggestions on how to handle this?
>
>
> Thanks - Tod
>
>



-- 
Lance Norskog
goksron@gmail.com