You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by sam k <sa...@gmail.com> on 2022/11/03 18:08:18 UTC

Sending custom fields with SolrEmitter

Hi,

I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
great.

When asking Tika to send documents to Solr, I can specify the document id
as "emitKey" parameter:

curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
"fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
http://tika.server

Is there a way to specify more custom fields for the Solr document being
submitted, like:

curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
"fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
"anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
http://tika.server

We would like to set around 10 custom fields in each Solr document, such as
the id of the user who created the PDF/Word, etc, so the values for the
Solr fields would be different for each Solr document.

Thanks,
-Sam

Re: Sending custom fields with SolrEmitter

Posted by Tim Allison <ta...@apache.org>.
Exactly right.  You get what you get with no metadata filter.  The metadata
filter is applied before these are added (IIRC...?).

Still need to improve our documentation...

On Fri, Nov 4, 2022 at 4:30 PM sam k <sa...@gmail.com> wrote:

> Thanks Tim, this is great!
>
> I was experimenting, whether
> org.apache.tika.metadata.filter.FieldNameMappingFilter in tika-config.xml
> can be used to also rename those custom metadata fields, but it seems to
> let them go through without renaming. Not sure if it would be a very useful
> feature anyhow.
>
> Thanks,
> -Sam
>
> On Thu, Nov 3, 2022 at 12:52 PM Tim Allison <ta...@apache.org> wrote:
>
>> Yes.  We need to do a better job of documenting this. To inject
>> custom/external metadata, do something like this:
>>
>> {
>>     "emitKey": "emitKey1",
>>     "emitter": "my_emitter",
>>     "fetchKey": "fetchKey1",
>>     "fetcher": "my_fetcher",
>>     "handlerConfig": {
>>         "maxEmbeddedResources": 10,
>>         "parseMode": "concatenate",
>>         "type": "xml",
>>         "writeLimit": 10000
>>     },
>>     "id": "my_id",
>>     "metadata": {
>>         "m1": [
>>             "v1",
>>             "v1"
>>         ],
>>         "m2": [
>>             "v2",
>>             "v3"
>>         ],
>>         "m3": "v4"
>>     },
>>     "onParseException": "skip"
>> }
>>
>> On Thu, Nov 3, 2022 at 2:08 PM sam k <sa...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
>>> great.
>>>
>>> When asking Tika to send documents to Solr, I can specify the document
>>> id as "emitKey" parameter:
>>>
>>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
>>> http://tika.server
>>>
>>> Is there a way to specify more custom fields for the Solr document being
>>> submitted, like:
>>>
>>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
>>> "anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
>>> http://tika.server
>>>
>>> We would like to set around 10 custom fields in each Solr document, such
>>> as the id of the user who created the PDF/Word, etc, so the values for the
>>> Solr fields would be different for each Solr document.
>>>
>>> Thanks,
>>> -Sam
>>>
>>

Re: Sending custom fields with SolrEmitter

Posted by sam k <sa...@gmail.com>.
Thanks Tim, this is great!

I was experimenting, whether
org.apache.tika.metadata.filter.FieldNameMappingFilter in tika-config.xml
can be used to also rename those custom metadata fields, but it seems to
let them go through without renaming. Not sure if it would be a very useful
feature anyhow.

Thanks,
-Sam

On Thu, Nov 3, 2022 at 12:52 PM Tim Allison <ta...@apache.org> wrote:

> Yes.  We need to do a better job of documenting this. To inject
> custom/external metadata, do something like this:
>
> {
>     "emitKey": "emitKey1",
>     "emitter": "my_emitter",
>     "fetchKey": "fetchKey1",
>     "fetcher": "my_fetcher",
>     "handlerConfig": {
>         "maxEmbeddedResources": 10,
>         "parseMode": "concatenate",
>         "type": "xml",
>         "writeLimit": 10000
>     },
>     "id": "my_id",
>     "metadata": {
>         "m1": [
>             "v1",
>             "v1"
>         ],
>         "m2": [
>             "v2",
>             "v3"
>         ],
>         "m3": "v4"
>     },
>     "onParseException": "skip"
> }
>
> On Thu, Nov 3, 2022 at 2:08 PM sam k <sa...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
>> great.
>>
>> When asking Tika to send documents to Solr, I can specify the document id
>> as "emitKey" parameter:
>>
>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
>> http://tika.server
>>
>> Is there a way to specify more custom fields for the Solr document being
>> submitted, like:
>>
>> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
>> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
>> "anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
>> http://tika.server
>>
>> We would like to set around 10 custom fields in each Solr document, such
>> as the id of the user who created the PDF/Word, etc, so the values for the
>> Solr fields would be different for each Solr document.
>>
>> Thanks,
>> -Sam
>>
>

Re: Sending custom fields with SolrEmitter

Posted by Tim Allison <ta...@apache.org>.
Yes.  We need to do a better job of documenting this. To inject
custom/external metadata, do something like this:

{
    "emitKey": "emitKey1",
    "emitter": "my_emitter",
    "fetchKey": "fetchKey1",
    "fetcher": "my_fetcher",
    "handlerConfig": {
        "maxEmbeddedResources": 10,
        "parseMode": "concatenate",
        "type": "xml",
        "writeLimit": 10000
    },
    "id": "my_id",
    "metadata": {
        "m1": [
            "v1",
            "v1"
        ],
        "m2": [
            "v2",
            "v3"
        ],
        "m3": "v4"
    },
    "onParseException": "skip"
}

On Thu, Nov 3, 2022 at 2:08 PM sam k <sa...@gmail.com> wrote:

> Hi,
>
> I'm running a Tika server with HttpFetcher and SolrEmitter, and it works
> great.
>
> When asking Tika to send documents to Solr, I can specify the document id
> as "emitKey" parameter:
>
> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>"}'
> http://tika.server
>
> Is there a way to specify more custom fields for the Solr document being
> submitted, like:
>
> curl -X POST -H "Content-Type: application/json" -d '{"fetcher":"http",
> "fetchKey":"<URL>", "emitter":"solr", "emitKey":"<Document Id>",
> "anotherSolrField":"<Value>", "yetAnotherSolrField":"<Value>"}'
> http://tika.server
>
> We would like to set around 10 custom fields in each Solr document, such
> as the id of the user who created the PDF/Word, etc, so the values for the
> Solr fields would be different for each Solr document.
>
> Thanks,
> -Sam
>