You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by sam k <sa...@gmail.com> on 2022/11/03 18:08:18 UTC
Sending custom fields with SolrEmitter
Hi,
I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
great.
When asking Tika to send documents to Solr, I can specify the document id
as "emitKey" parameter:
curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
"fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
http://tika.server
Is there a way to specify more custom fields for the Solr document being
submitted, like:
curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
"fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
"anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
http://tika.server
We would like to set around 10 custom fields in each Solr document, such as
the id of the user who created the PDF/Word, etc, so the values for the
Solr fields would be different for each Solr document.
Thanks,
-Sam
Re: Sending custom fields with SolrEmitter
Posted by Tim Allison <ta...@apache.org>.
Exactly right. You get what you get with no metadata filter. The metadata
filter is applied before these are added (IIRC...?).
Still need to improve our documentation...
On Fri, Nov 4, 2022 at 4:30 PM sam k <sa...@gmail.com> wrote:
> Thanks Tim, this is great!
>
> I was experimenting, whether
> org.apache.tika.metadata.filter.FieldNameMappingFilter in tika-config.xml
> can be used to also rename those custom metadata fields, but it seems to
> let them go through without renaming. Not sure if it would be a very useful
> feature anyhow.
>
> Thanks,
> -Sam
>
> On Thu, Nov 3, 2022 at 12:52 PM Tim Allison <ta...@apache.org> wrote:
>
>> Yes. We need to do a better job of documenting this. To inject
>> custom/external metadata, do something like this:
>>
>> {
>> "emitKey": "emitKey1",
>> "emitter": "my_emitter",
>> "fetchKey": "fetchKey1",
>> "fetcher": "my_fetcher",
>> "handlerConfig": {
>> "maxEmbeddedResources": 10,
>> "parseMode": "concatenate",
>> "type": "xml",
>> "writeLimit": 10000
>> },
>> "id": "my_id",
>> "metadata": {
>> "m1": [
>> "v1",
>> "v1"
>> ],
>> "m2": [
>> "v2",
>> "v3"
>> ],
>> "m3": "v4"
>> },
>> "onParseException": "skip"
>> }
>>
>> On Thu, Nov 3, 2022 at 2:08 PM sam k <sa...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
>>> great.
>>>
>>> When asking Tika to send documents to Solr, I can specify the document
>>> id as "emitKey" parameter:
>>>
>>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
>>> http://tika.server
>>>
>>> Is there a way to specify more custom fields for the Solr document being
>>> submitted, like:
>>>
>>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
>>> "anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
>>> http://tika.server
>>>
>>> We would like to set around 10 custom fields in each Solr document, such
>>> as the id of the user who created the PDF/Word, etc, so the values for the
>>> Solr fields would be different for each Solr document.
>>>
>>> Thanks,
>>> -Sam
>>>
>>
Re: Sending custom fields with SolrEmitter
Posted by sam k <sa...@gmail.com>.
Thanks Tim, this is great!
I was experimenting, whether
org.apache.tika.metadata.filter.FieldNameMappingFilter in tika-config.xml
can be used to also rename those custom metadata fields, but it seems to
let them go through without renaming. Not sure if it would be a very useful
feature anyhow.
Thanks,
-Sam
On Thu, Nov 3, 2022 at 12:52 PM Tim Allison <ta...@apache.org> wrote:
> Yes. We need to do a better job of documenting this. To inject
> custom/external metadata, do something like this:
>
> {
> "emitKey": "emitKey1",
> "emitter": "my_emitter",
> "fetchKey": "fetchKey1",
> "fetcher": "my_fetcher",
> "handlerConfig": {
> "maxEmbeddedResources": 10,
> "parseMode": "concatenate",
> "type": "xml",
> "writeLimit": 10000
> },
> "id": "my_id",
> "metadata": {
> "m1": [
> "v1",
> "v1"
> ],
> "m2": [
> "v2",
> "v3"
> ],
> "m3": "v4"
> },
> "onParseException": "skip"
> }
>
> On Thu, Nov 3, 2022 at 2:08 PM sam k <sa...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
>> great.
>>
>> When asking Tika to send documents to Solr, I can specify the document id
>> as "emitKey" parameter:
>>
>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
>> http://tika.server
>>
>> Is there a way to specify more custom fields for the Solr document being
>> submitted, like:
>>
>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
>> "anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
>> http://tika.server
>>
>> We would like to set around 10 custom fields in each Solr document, such
>> as the id of the user who created the PDF/Word, etc, so the values for the
>> Solr fields would be different for each Solr document.
>>
>> Thanks,
>> -Sam
>>
>
Re: Sending custom fields with SolrEmitter
Posted by Tim Allison <ta...@apache.org>.
Yes. We need to do a better job of documenting this. To inject
custom/external metadata, do something like this:
{
"emitKey": "emitKey1",
"emitter": "my_emitter",
"fetchKey": "fetchKey1",
"fetcher": "my_fetcher",
"handlerConfig": {
"maxEmbeddedResources": 10,
"parseMode": "concatenate",
"type": "xml",
"writeLimit": 10000
},
"id": "my_id",
"metadata": {
"m1": [
"v1",
"v1"
],
"m2": [
"v2",
"v3"
],
"m3": "v4"
},
"onParseException": "skip"
}
On Thu, Nov 3, 2022 at 2:08 PM sam k <sa...@gmail.com> wrote:
> Hi,
>
> I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
> great.
>
> When asking Tika to send documents to Solr, I can specify the document id
> as "emitKey" parameter:
>
> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
> http://tika.server
>
> Is there a way to specify more custom fields for the Solr document being
> submitted, like:
>
> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
> "anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
> http://tika.server
>
> We would like to set around 10 custom fields in each Solr document, such
> as the id of the user who created the PDF/Word, etc, so the values for the
> Solr fields would be different for each Solr document.
>
> Thanks,
> -Sam
>