You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2021/11/11 12:21:25 UTC

DcXMLParser to parse XML files

Hi,

when is the Dublin Core XML parser used to parse XML files?
Is there a configuration required to enable the DcXMLParser?

There is a difference between 1.27 and 2.1.0:

$> java -jar tika-app-1.27.jar -J \
      https://news.haltonhills.halinet.on.ca/dc.xml \
   | jq '.[0]."dc:title"'
"Deaths"
$> java -jar tika-app-2.1.0.jar ...
null

$> java -jar tika-app-1.27.jar -J \
      https://news.haltonhills.halinet.on.ca/dc.xml \
   | jq '.[0]."X-Parsed-By"'
[
  "org.apache.tika.parser.DefaultParser",
  "org.apache.tika.parser.xml.DcXMLParser"
]
$> java -jar tika-app-2.1.0.jar -J \
      https://news.haltonhills.halinet.on.ca/dc.xml \
   | jq '.[0]."X-TIKA:Parsed-By"'
[
  "org.apache.tika.parser.DefaultParser",
  "org.apache.tika.parser.xml.XMLParser"
]


Thanks,
Sebastian

Re: DcXMLParser to parse XML files

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Tim,

thanks!  For me the fix can wait until the next release.
The URL was by accident in the sample to verify that
upgrading Tika on Stormcrawler didn't break anything.
It was the only document out of 450 parsed by Tika
which could be a regression.

Best,
Sebastian

On 11/16/21 16:49, Tim Allison wrote:
> Hi Seb,
> 
> I'm sorry for taking forever to reply.  That's a bug.  Now fixed:
> https://issues.apache.org/jira/browse/TIKA-3593
> 
> If you specify the DcXMLParser in your tika-config after the default
> parser, it _should_ be selected instead of the XMLParser.  Let me know
> if I can help with this temporary workaround.
> 
> Thank you for identifying this problem!
> 
> Cheers,
> 
>       Tim
> 
> On Thu, Nov 11, 2021 at 7:21 AM Sebastian Nagel
> <wa...@googlemail.com> wrote:
>>
>> Hi,
>>
>> when is the Dublin Core XML parser used to parse XML files?
>> Is there a configuration required to enable the DcXMLParser?
>>
>> There is a difference between 1.27 and 2.1.0:
>>
>> $> java -jar tika-app-1.27.jar -J \
>>       https://news.haltonhills.halinet.on.ca/dc.xml \
>>    | jq '.[0]."dc:title"'
>> "Deaths"
>> $> java -jar tika-app-2.1.0.jar ...
>> null
>>
>> $> java -jar tika-app-1.27.jar -J \
>>       https://news.haltonhills.halinet.on.ca/dc.xml \
>>    | jq '.[0]."X-Parsed-By"'
>> [
>>   "org.apache.tika.parser.DefaultParser",
>>   "org.apache.tika.parser.xml.DcXMLParser"
>> ]
>> $> java -jar tika-app-2.1.0.jar -J \
>>       https://news.haltonhills.halinet.on.ca/dc.xml \
>>    | jq '.[0]."X-TIKA:Parsed-By"'
>> [
>>   "org.apache.tika.parser.DefaultParser",
>>   "org.apache.tika.parser.xml.XMLParser"
>> ]
>>
>>
>> Thanks,
>> Sebastian

Re: DcXMLParser to parse XML files

Posted by Tim Allison <ta...@apache.org>.
Hi Seb,

I'm sorry for taking forever to reply.  That's a bug.  Now fixed:
https://issues.apache.org/jira/browse/TIKA-3593

If you specify the DcXMLParser in your tika-config after the default
parser, it _should_ be selected instead of the XMLParser.  Let me know
if I can help with this temporary workaround.

Thank you for identifying this problem!

Cheers,

      Tim

On Thu, Nov 11, 2021 at 7:21 AM Sebastian Nagel
<wa...@googlemail.com> wrote:
>
> Hi,
>
> when is the Dublin Core XML parser used to parse XML files?
> Is there a configuration required to enable the DcXMLParser?
>
> There is a difference between 1.27 and 2.1.0:
>
> $> java -jar tika-app-1.27.jar -J \
>       https://news.haltonhills.halinet.on.ca/dc.xml \
>    | jq '.[0]."dc:title"'
> "Deaths"
> $> java -jar tika-app-2.1.0.jar ...
> null
>
> $> java -jar tika-app-1.27.jar -J \
>       https://news.haltonhills.halinet.on.ca/dc.xml \
>    | jq '.[0]."X-Parsed-By"'
> [
>   "org.apache.tika.parser.DefaultParser",
>   "org.apache.tika.parser.xml.DcXMLParser"
> ]
> $> java -jar tika-app-2.1.0.jar -J \
>       https://news.haltonhills.halinet.on.ca/dc.xml \
>    | jq '.[0]."X-TIKA:Parsed-By"'
> [
>   "org.apache.tika.parser.DefaultParser",
>   "org.apache.tika.parser.xml.XMLParser"
> ]
>
>
> Thanks,
> Sebastian