You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Chetan Bikire <ch...@gmail.com> on 2022/11/08 10:37:32 UTC

Re: Apache Tika Server Relationship

Yes we get this if their are multiple eml files embedded i.e. eml1/eml2/eml3
But in case of message file format(.msg) not getting name of file and get
exception like unsupported attachment chunk property will be ignored.

Ex- (.msg1/msg2/embed.text)

Embedded_Resource_Path :
"/__substg1.0.3701000D/__substg1.0.3701000D/embed.txt"

And also, scenario where main file which is .eml and all embedded files are
.msg (eml/msg1/msg2/msg3) then the Tika- Content of main eml file includes
content of all .msg files as well ,instead treating these msg files as
seperate embedded file.

Please assist
Thank you.



On Mon, Oct 24, 2022, 20:06 Tim Allison <ta...@apache.org> wrote:

> X-TIKA:embedded_resource_path
>
> For example, "/embed1.zip/embed2.zip/embed2a.txt", says that there's a zip
> file (embed1.zip) embedded in the main file that contains another zip file
> (embed2.zip), which in turn contains a text file (embed2a.txt).
>
> On Mon, Oct 24, 2022 at 10:13 AM Chetan Bikire <ch...@gmail.com>
> wrote:
>
>> Hi Tim,
>>
>> Thank you for your response.
>> Yes, I am using /rmeta/form endpoint and I am getting info on embedded
>> files seperately but not getting information for which parent this embedded
>> file is belongs to so that I can track the chain of multilevel embedded
>> files.
>> So do have any meta property which tells us regarding this.
>>
>> On Sat, Oct 22, 2022, 16:06 Tim Allison <ta...@apache.org> wrote:
>>
>>> 1) If you're using the /tika endpoint, embedded files are marked up as
>>> such in the xhtml output with div tags.  If you want full info on embedded
>>> files, I'd strongly encourage using the /rmeta endpoint.
>>>
>>> 2) We don't offer content marked up with json, but we do offer a text
>>> option, which can be returned in the X-Tika-Content tag in the json output.
>>> See https://cwiki.apache.org/confluence/display/TIKA/TikaServer for
>>> details on how to request text.
>>>
>>> This might also be useful:
>>> https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared
>>>
>>>
>>> On Fri, Oct 21, 2022 at 11:12 PM Chetan Bikire <ch...@gmail.com>
>>> wrote:
>>>
>>>> 1) How does Tika server maintains Parent-Child relationship between
>>>> main document and it's embedded documents (i.e. Email with multiple
>>>> attachment) after parsing, so is their any property or tag using which we
>>>> come to know relationships?
>>>>
>>>> 2) After parsing any document we are getting all tags in JSON format
>>>> except *X-Tika-Content* tag which is in HTML format so is their any
>>>> way to get this in json format?
>>>>
>>>> Please Assist.
>>>> Thank You
>>>>
>>>

Re: Apache Tika Server Relationship

Posted by Tim Allison <ta...@apache.org>.
>But in case of message file format(.msg) not getting name of file and get exception like unsupported attachment chunk property will be ignored.

Is this an exception or a logged message?

>Embedded_Resource_Path : "/__substg1.0.3701000D/__substg1.0.3701000D/embed.txt"
This looks basically right to me given how embedded files are stored
in msg files.  Is this what you expect?  If not, what do you expect?

>And also, scenario where main file which is .eml and all embedded files are .msg (eml/msg1/msg2/msg3) then the Tika- Content of main eml file includes content of all .msg files as well ,instead treating these msg files as seperate embedded file.

This sounds bad.  Any chance you can share an example, even if
privately?  We can't do much without an example to work with.

Thank you and sorry for my delay.

On Wed, Nov 9, 2022 at 8:40 AM Chetan Bikire <ch...@gmail.com> wrote:
>
> Please let me know did I missing something.
>
> Any feedback here are welcome.
>
> Thanks & Regards
> Chetan
>
>
> On Tue, Nov 8, 2022, 16:07 Chetan Bikire <ch...@gmail.com> wrote:
>>
>> Yes we get this if their are multiple eml files embedded i.e. eml1/eml2/eml3
>> But in case of message file format(.msg) not getting name of file and get exception like unsupported attachment chunk property will be ignored.
>>
>> Ex- (.msg1/msg2/embed.text)
>>
>> Embedded_Resource_Path : "/__substg1.0.3701000D/__substg1.0.3701000D/embed.txt"
>>
>> And also, scenario where main file which is .eml and all embedded files are .msg (eml/msg1/msg2/msg3) then the Tika- Content of main eml file includes content of all .msg files as well ,instead treating these msg files as seperate embedded file.
>>
>> Please assist
>> Thank you.
>>
>>
>>
>> On Mon, Oct 24, 2022, 20:06 Tim Allison <ta...@apache.org> wrote:
>>>
>>> X-TIKA:embedded_resource_path
>>>
>>> For example, "/embed1.zip/embed2.zip/embed2a.txt", says that there's a zip file (embed1.zip) embedded in the main file that contains another zip file (embed2.zip), which in turn contains a text file (embed2a.txt).
>>>
>>> On Mon, Oct 24, 2022 at 10:13 AM Chetan Bikire <ch...@gmail.com> wrote:
>>>>
>>>> Hi Tim,
>>>>
>>>> Thank you for your response.
>>>> Yes, I am using /rmeta/form endpoint and I am getting info on embedded files seperately but not getting information for which parent this embedded file is belongs to so that I can track the chain of multilevel embedded files.
>>>> So do have any meta property which tells us regarding this.
>>>>
>>>> On Sat, Oct 22, 2022, 16:06 Tim Allison <ta...@apache.org> wrote:
>>>>>
>>>>> 1) If you're using the /tika endpoint, embedded files are marked up as such in the xhtml output with div tags.  If you want full info on embedded files, I'd strongly encourage using the /rmeta endpoint.
>>>>>
>>>>> 2) We don't offer content marked up with json, but we do offer a text option, which can be returned in the X-Tika-Content tag in the json output. See https://cwiki.apache.org/confluence/display/TIKA/TikaServer for details on how to request text.
>>>>>
>>>>> This might also be useful: https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared
>>>>>
>>>>>
>>>>> On Fri, Oct 21, 2022 at 11:12 PM Chetan Bikire <ch...@gmail.com> wrote:
>>>>>>
>>>>>> 1) How does Tika server maintains Parent-Child relationship between main document and it's embedded documents (i.e. Email with multiple attachment) after parsing, so is their any property or tag using which we come to know relationships?
>>>>>>
>>>>>> 2) After parsing any document we are getting all tags in JSON format except X-Tika-Content tag which is in HTML format so is their any way to get this in json format?
>>>>>>
>>>>>> Please Assist.
>>>>>> Thank You

Re: Apache Tika Server Relationship

Posted by Chetan Bikire <ch...@gmail.com>.
Please let me know did I missing something.

Any feedback here are welcome.

Thanks & Regards
Chetan


On Tue, Nov 8, 2022, 16:07 Chetan Bikire <ch...@gmail.com> wrote:

> Yes we get this if their are multiple eml files embedded i.e.
> eml1/eml2/eml3
> But in case of message file format(.msg) not getting name of file and get
> exception like unsupported attachment chunk property will be ignored.
>
> Ex- (.msg1/msg2/embed.text)
>
> Embedded_Resource_Path :
> "/__substg1.0.3701000D/__substg1.0.3701000D/embed.txt"
>
> And also, scenario where main file which is .eml and all embedded files
> are .msg (eml/msg1/msg2/msg3) then the Tika- Content of main eml file
> includes content of all .msg files as well ,instead treating these msg
> files as seperate embedded file.
>
> Please assist
> Thank you.
>
>
>
> On Mon, Oct 24, 2022, 20:06 Tim Allison <ta...@apache.org> wrote:
>
>> X-TIKA:embedded_resource_path
>>
>> For example, "/embed1.zip/embed2.zip/embed2a.txt", says that there's a
>> zip file (embed1.zip) embedded in the main file that contains another zip
>> file (embed2.zip), which in turn contains a text file (embed2a.txt).
>>
>> On Mon, Oct 24, 2022 at 10:13 AM Chetan Bikire <ch...@gmail.com>
>> wrote:
>>
>>> Hi Tim,
>>>
>>> Thank you for your response.
>>> Yes, I am using /rmeta/form endpoint and I am getting info on embedded
>>> files seperately but not getting information for which parent this embedded
>>> file is belongs to so that I can track the chain of multilevel embedded
>>> files.
>>> So do have any meta property which tells us regarding this.
>>>
>>> On Sat, Oct 22, 2022, 16:06 Tim Allison <ta...@apache.org> wrote:
>>>
>>>> 1) If you're using the /tika endpoint, embedded files are marked up as
>>>> such in the xhtml output with div tags.  If you want full info on embedded
>>>> files, I'd strongly encourage using the /rmeta endpoint.
>>>>
>>>> 2) We don't offer content marked up with json, but we do offer a text
>>>> option, which can be returned in the X-Tika-Content tag in the json output.
>>>> See https://cwiki.apache.org/confluence/display/TIKA/TikaServer for
>>>> details on how to request text.
>>>>
>>>> This might also be useful:
>>>> https://cwiki.apache.org/confluence/display/TIKA/TikaServerEndpointsCompared
>>>>
>>>>
>>>> On Fri, Oct 21, 2022 at 11:12 PM Chetan Bikire <ch...@gmail.com>
>>>> wrote:
>>>>
>>>>> 1) How does Tika server maintains Parent-Child relationship between
>>>>> main document and it's embedded documents (i.e. Email with multiple
>>>>> attachment) after parsing, so is their any property or tag using which we
>>>>> come to know relationships?
>>>>>
>>>>> 2) After parsing any document we are getting all tags in JSON format
>>>>> except *X-Tika-Content* tag which is in HTML format so is their any
>>>>> way to get this in json format?
>>>>>
>>>>> Please Assist.
>>>>> Thank You
>>>>>
>>>>