You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Vincent <vi...@openindex.io> on 2016/10/17 11:37:55 UTC

Error parsing PDFs

Hi all,

I have some trouble using Tika to parse some PDFs. I crawl them with 
Nutch 1.11, using parse-tika. Some documents will get parsed correctly, 
but most won't, and the error isn't very clear to me:

org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
         at 
org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199)
         at 
org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145)
         at 
org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136)
         at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
         at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
         at 
org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)

I tested the document with PDFBox ExtractText, and it works fine.

An example of a failing document is:

https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_archieven_br_raad.pdf 


Any suggestions?

Thanks in advance!
Vincent Slot

Re: Error parsing PDFs

Posted by Julien Nioche <li...@gmail.com>.

Hi Tim

On 17 October 2016 at 16:02, Allison, Timothy B. <ta...@mitre.org> wrote:

> Hmmm…Thank you, Julien.  I’m trying to find the exact version of nutch’s
> TikaParser that would result in that stacktrace…I don’t see one where line
> 167 is the call to tika’s parser to parse the pdf…any recommendations?
>

Weird, me neither.


>
>
> I’m at a loss to figure out how Tika would be adding xmpMM:DocumentID
> more than once.  Any ideas?
>
>
>
> We could change our code to “set”, and if there are multiples, that would
> overwrite the earlier ids, but there really should only be DocumentID.
>
>
>
> Also, any pointers to setting up nutch in Intellij aside from what Google
> returns?  Seems to be non-trivial.
>

No idea, sorry. I never used Intellij in my life

J.


>
>
>
>
>
>
>
>
> *From:* Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> *Sent:* Monday, October 17, 2016 9:39 AM
>
> *To:* user@tika.apache.org
> *Subject:* Re: Error parsing PDFs
>
>
>
> The Metadata object is brand new for each document parsed, see [
> https://github.com/apache/nutch/blob/master/src/plugin/
> parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102]
>
>
>
> On 17 October 2016 at 14:03, Allison, Timothy B. <ta...@mitre.org>
> wrote:
>
> New feature. :)   We didn't extract xmpMM in 1.11.
>
> Thank you for sharing a test file! I'm not able reproduce this with Tika
> trunk.
>
> The error means that a value for xmpMM:DocumentID was already set in the
> Metadata object, and you're trying to add another value.  xmpMM:DocumentID
> is "SIMPLE" and only allows one value.
>
> Is nutch reusing the Metadata object, not clearing it, or prepopulating it
> with xmp metadata?  I'll take a look at nutch.
>
>
>
> -----Original Message-----
> From: Vincent [mailto:vincent.slot@openindex.io]
> Sent: Monday, October 17, 2016 8:13 AM
> To: user@tika.apache.org
> Subject: Re: Error parsing PDFs
>
> Hi,
>
> After some additional trying I found that this error does not occur for
> this document in Tika 1.11. I forgot to mention in my last message that I
> was using Tika 1.13. So is this perhaps a bug in the new Tika version?
>
> Regards,
>
> Vincent
>
> On 17-10-16 13:37, Vincent wrote:
> > Hi all,
> >
> > I have some trouble using Tika to parse some PDFs. I crawl them with
> > Nutch 1.11, using parse-tika. Some documents will get parsed
> > correctly, but most won't, and the error isn't very clear to me:
> >
> > org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID :
> SIMPLE
> >         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> addMetadata(JempboxExtractor.java:199)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> extractXMPMM(JempboxExtractor.java:145)
> >         at
> > org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
> >         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.
> java:136)
> >         at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > I tested the document with PDFBox ExtractText, and it works fine.
> >
> > An example of a failing document is:
> >
> > https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar
> > chieven_br_raad.pdf
> >
> >
> > Any suggestions?
> >
> > Thanks in advance!
> > Vincent Slot
> >
>
>
>
>
>
> --
>
>
> *Open Source Solutions for Text Engineering*
>
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>



-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

RE: Error parsing PDFs

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Hmmm…Thank you, Julien.  I’m trying to find the exact version of nutch’s TikaParser that would result in that stacktrace…I don’t see one where line 167 is the call to tika’s parser to parse the pdf…any recommendations?

I’m at a loss to figure out how Tika would be adding xmpMM:DocumentID more than once.  Any ideas?

We could change our code to “set”, and if there are multiples, that would overwrite the earlier ids, but there really should only be DocumentID.

Also, any pointers to setting up nutch in Intellij aside from what Google returns?  Seems to be non-trivial.

From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
Sent: Monday, October 17, 2016 9:39 AM
To: user@tika.apache.org
Subject: Re: Error parsing PDFs

The Metadata object is brand new for each document parsed, see [https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102]

On 17 October 2016 at 14:03, Allison, Timothy B. <ta...@mitre.org>> wrote:
New feature. :)   We didn't extract xmpMM in 1.11.

Thank you for sharing a test file! I'm not able reproduce this with Tika trunk.

The error means that a value for xmpMM:DocumentID was already set in the Metadata object, and you're trying to add another value.  xmpMM:DocumentID is "SIMPLE" and only allows one value.

Is nutch reusing the Metadata object, not clearing it, or prepopulating it with xmp metadata?  I'll take a look at nutch.

-----Original Message-----
From: Vincent [mailto:vincent.slot@openindex.io<ma...@openindex.io>]
Sent: Monday, October 17, 2016 8:13 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Error parsing PDFs

Hi,

After some additional trying I found that this error does not occur for this document in Tika 1.11. I forgot to mention in my last message that I was using Tika 1.13. So is this perhaps a bug in the new Tika version?

Regards,

Vincent

On 17-10-16 13:37, Vincent wrote:
> Hi all,
>
> I have some trouble using Tika to parse some PDFs. I crawl them with
> Nutch 1.11, using parse-tika. Some documents will get parsed
> correctly, but most won't, and the error isn't very clear to me:
>
> org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
>         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
>         at
> org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199)
>         at
> org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145)
>         at
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136)
>         at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
>         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> I tested the document with PDFBox ExtractText, and it works fine.
>
> An example of a failing document is:
>
> https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar
> chieven_br_raad.pdf
>
>
> Any suggestions?
>
> Thanks in advance!
> Vincent Slot
>

--

Open Source Solutions for Text Engineering

http://www.digitalpebble.com<http://www.digitalpebble.com/>
http://digitalpebble.blogspot.com/
#digitalpebble<http://twitter.com/digitalpebble>

Re: Error parsing PDFs

Posted by Julien Nioche <li...@gmail.com>.

The Metadata object is brand new for each document parsed, see [
https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java#L102
]

On 17 October 2016 at 14:03, Allison, Timothy B. <ta...@mitre.org> wrote:

> New feature. :)   We didn't extract xmpMM in 1.11.
>
> Thank you for sharing a test file! I'm not able reproduce this with Tika
> trunk.
>
> The error means that a value for xmpMM:DocumentID was already set in the
> Metadata object, and you're trying to add another value.  xmpMM:DocumentID
> is "SIMPLE" and only allows one value.
>
> Is nutch reusing the Metadata object, not clearing it, or prepopulating it
> with xmp metadata?  I'll take a look at nutch.
>
>
> -----Original Message-----
> From: Vincent [mailto:vincent.slot@openindex.io]
> Sent: Monday, October 17, 2016 8:13 AM
> To: user@tika.apache.org
> Subject: Re: Error parsing PDFs
>
> Hi,
>
> After some additional trying I found that this error does not occur for
> this document in Tika 1.11. I forgot to mention in my last message that I
> was using Tika 1.13. So is this perhaps a bug in the new Tika version?
>
> Regards,
>
> Vincent
>
> On 17-10-16 13:37, Vincent wrote:
> > Hi all,
> >
> > I have some trouble using Tika to parse some PDFs. I crawl them with
> > Nutch 1.11, using parse-tika. Some documents will get parsed
> > correctly, but most won't, and the error isn't very clear to me:
> >
> > org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID :
> SIMPLE
> >         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> addMetadata(JempboxExtractor.java:199)
> >         at
> > org.apache.tika.parser.image.xmp.JempboxExtractor.
> extractXMPMM(JempboxExtractor.java:145)
> >         at
> > org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
> >         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.
> java:136)
> >         at
> > org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> >         at
> > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> >         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >         at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >         at java.lang.Thread.run(Thread.java:745)
> >
> > I tested the document with PDFBox ExtractText, and it works fine.
> >
> > An example of a failing document is:
> >
> > https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar
> > chieven_br_raad.pdf
> >
> >
> > Any suggestions?
> >
> > Thanks in advance!
> > Vincent Slot
> >
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

RE: Error parsing PDFs

Posted by "Allison, Timothy B." <ta...@mitre.org>.

New feature. :)   We didn't extract xmpMM in 1.11.

Thank you for sharing a test file! I'm not able reproduce this with Tika trunk.  

The error means that a value for xmpMM:DocumentID was already set in the Metadata object, and you're trying to add another value.  xmpMM:DocumentID is "SIMPLE" and only allows one value.

Is nutch reusing the Metadata object, not clearing it, or prepopulating it with xmp metadata?  I'll take a look at nutch.


-----Original Message-----
From: Vincent [mailto:vincent.slot@openindex.io] 
Sent: Monday, October 17, 2016 8:13 AM
To: user@tika.apache.org
Subject: Re: Error parsing PDFs

Hi,

After some additional trying I found that this error does not occur for this document in Tika 1.11. I forgot to mention in my last message that I was using Tika 1.13. So is this perhaps a bug in the new Tika version?

Regards,

Vincent

On 17-10-16 13:37, Vincent wrote:
> Hi all,
>
> I have some trouble using Tika to parse some PDFs. I crawl them with 
> Nutch 1.11, using parse-tika. Some documents will get parsed 
> correctly, but most won't, and the error isn't very clear to me:
>
> org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
>         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
>         at
> org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199)
>         at
> org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145)
>         at
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136)
>         at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
>         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>         at
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> I tested the document with PDFBox ExtractText, and it works fine.
>
> An example of a failing document is:
>
> https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_ar
> chieven_br_raad.pdf
>
>
> Any suggestions?
>
> Thanks in advance!
> Vincent Slot
>

Re: Error parsing PDFs

Posted by Vincent <vi...@openindex.io>.

Hi,

After some additional trying I found that this error does not occur for 
this document in Tika 1.11. I forgot to mention in my last message that 
I was using Tika 1.13. So is this perhaps a bug in the new Tika version?

Regards,

Vincent

On 17-10-16 13:37, Vincent wrote:
> Hi all,
>
> I have some trouble using Tika to parse some PDFs. I crawl them with 
> Nutch 1.11, using parse-tika. Some documents will get parsed 
> correctly, but most won't, and the error isn't very clear to me:
>
> org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
>         at org.apache.tika.metadata.Metadata.add(Metadata.java:338)
>         at 
> org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:199)
>         at 
> org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:145)
>         at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:216)
>         at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:136)
>         at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:167)
>         at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
>         at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> I tested the document with PDFBox ExtractText, and it works fine.
>
> An example of a failing document is:
>
> https://gemeente.groningen.nl/system/files/1._jaarstukken_groninger_archieven_br_raad.pdf 
>
>
> Any suggestions?
>
> Thanks in advance!
> Vincent Slot
>