You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "George L. Yermulnik (JIRA)" <ji...@apache.org> on 2016/06/09 15:40:21 UTC

[jira] [Comment Edited] (TIKA-2001) Parsing XML outputs empty string

    [ https://issues.apache.org/jira/browse/TIKA-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322715#comment-15322715 ] 

George L. Yermulnik edited comment on TIKA-2001 at 6/9/16 3:39 PM:
-------------------------------------------------------------------

{quote}By default Tika only extracts the text between XML tags, not things like attribute values. Since all the content in this XML file is in the attributes, nothing gets extracted.{quote}
Oh! I see. I'm new to Tika and hadn't known that.

{quote}What kind of output would make sense in this case?{quote}
In my case the second variant would be more preferable. But I'm not sure if that's what Tika is intended to deal with.


was (Author: yermulnik):
> By default Tika only extracts the text between XML tags, not things like attribute values. Since all the content in this XML file is in the attributes, nothing gets extracted.
Oh! I see. I'm new to Tika and hadn't known that.

> What kind of output would make sense in this case?
In my case the second variant would be more preferable. But I'm not sure if that's what Tika is intended to deal with.

> Parsing XML outputs empty string
> --------------------------------
>
>                 Key: TIKA-2001
>                 URL: https://issues.apache.org/jira/browse/TIKA-2001
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11, 1.12, 1.13
>            Reporter: George L. Yermulnik
>            Priority: Minor
>
> Can't get Tika parse my xml files:
> {code}
> root@spring:/tmp# java -version
> java version "1.8.0_91"
> Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)
> root@spring:/tmp# cat /tmp/xml/5751061032fbd-7148.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <spocosy version="1.0"><subscription-update subscriptionid="0" requestid="0" last_push="2016-06-03 06:21:34" current_push="2016-06-03 06:21:37" exec="0.002"><lineup id="0" event_participantsFK="0" participantFK="0" lineup_typeFK="0" shirt_number="0" pos="0" enet_pos="0" n="0" ut="2016-06-03 06:21:37" del="no"/></subscription-update></spocosy>
> root@spring:/tmp# for i in 3 2 1; do
>     echo -n "tika-app-1.1${i}.jar: "
>     java -jar tika-app-1.1${i}.jar --text /tmp/xml/5751061032fbd-7148.xml
> done
> tika-app-1.13.jar:
> tika-app-1.12.jar:
> tika-app-1.11.jar:
> root@spring:/tmp#
> {code}
> Appreciate any help. Thanx.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)