You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Rodolico Piero <p....@vitrociset.it> on 2009/12/02 10:45:18 UTC

Indexing file content with custom field

Hi,

I need to index the contents of a file (doc, pdf, ecc) and a set of
custom metadata specified in the XML like a standard request to Solr.
>From the documentation I can extract the contents of a file with the
request "/update/extract" (tika) and index metadata with a second
request "/update" by passing the XML. How do I do it all in a single
request? (without using curl but using http java lib or solrj). For
example (although I know that is not correct):

<add>
  <doc>
    <field name="id> </ field>
    <field name="myfield-1> </ field>
    <field name="myfield-n> </ field>
    <field name="content"> content of the extracted file (text) </
field>
    </doc>
  </add>

So I search it or by using metadata or full text on the content.
Sorry for my English ...

Thanks a lot.

 

Piero

Re: Indexing file content with custom field

Posted by Sascha Szott <sz...@zib.de>.

Piero,

it sounds you're looking for an integration of Solr Cell and Solr's DIH 
facility -- a feature that isn't implemented yet (but the issue is 
already addressed in Solr-1358).

As a workaround, you could store the extracted contents in plain text 
files (either by using Solr Cell or Apache Tika directly, which is under 
the hood of Solr Cell). Afterwards, you could use DIH's 
XPathEntityProcessor (to read the metadata in your XML files) in 
conjunction with DIH's PlainTextEntityProcessor (to read the previously 
created text files).

Another workaround would be to pass the metadata content as literal 
parameters along with the /update/extract request, as described in [1]. 
This would require you to write a small program that constructs and 
sends appropriate POST requests by parsing your XML metadata files.

Best,
Sascha

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

Rodolico Piero wrote:
> Hi,
> 
> I need to index the contents of a file (doc, pdf, ecc) and a set of
> custom metadata specified in the XML like a standard request to Solr.
> From the documentation I can extract the contents of a file with the
> request "/update/extract" (tika) and index metadata with a second
> request "/update" by passing the XML. How do I do it all in a single
> request? (without using curl but using http java lib or solrj). For
> example (although I know that is not correct):
> 
> <add>
>   <doc>
>     <field name="id> </ field>
>     <field name="myfield-1> </ field>
>     <field name="myfield-n> </ field>
>     <field name="content"> content of the extracted file (text) </
> field>
>     </doc>
>   </add>
> 
> So I search it or by using metadata or full text on the content.
> Sorry for my English ...
> 
> Thanks a lot.
> 
>  
> 
> Piero
> 
>  
> 
>