You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Stefano Falconetti (JIRA)" <ji...@apache.org> on 2010/02/11 11:22:28 UTC
[jira] Created: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
----------------------------------------------------------------------------------------------------------------------------
Key: PDFBOX-617
URL: https://issues.apache.org/jira/browse/PDFBOX-617
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 0.8.0-incubator
Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
Reporter: Stefano Falconetti
Priority: Critical
Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
... 1 more
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
... 4 more
Caused by: java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(AbstractList.java:350)
at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefano Falconetti updated PDFBOX-617:
--------------------------------------
Attachment: Portogallo2010.pdf
StatiUniti2010_1.pdf
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
> Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefano Falconetti updated PDFBOX-617:
--------------------------------------
Attachment: Irlanda26-52pag.pdf
Same problem, same exception:
ava.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1f4bcf7
at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:199)
at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:453)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1f4bcf7
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:107)
... 1 more
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
... 4 more
Caused by: java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(AbstractList.java:350)
at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
... 8 more
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
> Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved PDFBOX-617.
----------------------------------
Resolution: Duplicate
Confirmed that this has already been fixed in PDFBox 1.1.0 (or earlier), thus resolving this as a duplicate of some earlier issue.
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
> Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855456#action_12855456 ]
Stefano Falconetti commented on PDFBOX-617:
-------------------------------------------
Source of method crashing:
/**
* Actually parse the file
* @throws IOException
*/
@SuppressWarnings("deprecation")
//TODO move to the non deprecated call
public void parse() throws IOException {
try{
if(this.myFileToParse != null){
this.inputStream = new FileInputStream(this.myFileToParse);
}else{
this.inputStream = new ByteArrayInputStream(this.rawDocument.getContent());
}
Parser parser = new AutoDetectParser();
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
this.documentTitle = null;
this.documentKeywords = null;
this.documentContent = new String();
//TODO Move to a non deprecated call
//parser.parse(inputStream, this, metadata, parseContext);
parser.parse(inputStream,
textHandler,
metadata);
//Get all arrays first
String[] tmpDocumentTitle = metadata.getValues(Metadata.TITLE);
String[] tmpKeywords = metadata.getValues(Metadata.KEYWORDS);
String[] tmpDocumentDescription = metadata.getValues(Metadata.DESCRIPTION);
//###############################################################
// Sequence of utility methods' calls must be:
// stringArrayToString -> cleanUpExtraChars -> tokenizeDeTokenize
//###############################################################
//Take keywords both from page keywords and description
int keywordsNum = 0;
int descriptionWordsNum = 0;
if(tmpKeywords != null){
keywordsNum = tmpKeywords.length;
}
if(tmpDocumentDescription != null){
descriptionWordsNum = tmpDocumentDescription.length;
}
int allKeywordsNum = keywordsNum + descriptionWordsNum;
String[] tmpAllKeywords = null;
//From title as last chance
if( (allKeywordsNum == 0) &&
(tmpDocumentTitle != null) &&
(tmpDocumentTitle.length != 0) ){
allKeywordsNum = 1;
logger.warn("No meta information found, using title");
tmpAllKeywords = new String[]{this.stringArrayToString(tmpDocumentTitle)};
}else{
tmpAllKeywords = new String[allKeywordsNum];
System.arraycopy(tmpKeywords,
0,
tmpAllKeywords,
0,
tmpKeywords.length);
System.arraycopy(tmpDocumentDescription,
0,
tmpAllKeywords,
tmpKeywords.length,
tmpDocumentDescription.length);
}
//Fill in public getters
this.documentTitle = this.stringArrayToString(tmpDocumentTitle);
this.documentTitle = this.cleanUpExtraChars(this.documentTitle);
this.documentTitle = this.tokenizeDeTokenize(this.documentTitle);
this.documentKeywords = this.stringArrayToString(tmpAllKeywords);
this.documentKeywords = this.cleanUpExtraChars(this.documentKeywords);
//TODO if this value is needed (5), put it in the configuration file
this.documentKeywords = this.tokenizeDeTokenize(this.documentKeywords, 5).toLowerCase();
this.documentDescription = this.stringArrayToString(tmpDocumentDescription);
this.documentDescription = this.cleanUpExtraChars(this.documentDescription);
this.documentDescription = this.tokenizeDeTokenize(this.documentDescription);
this.documentContent = this.cleanUpExtraChars(textHandler.toString().trim());
//#####################################################
//### Very special cases of very bad document found ###
//#####################################################
if((this.documentTitle == null) ||
(this.documentTitle.trim().equals(""))){
this.documentTitle = this.guessTitle(this.documentContent, this.rawDocument.getURL().getHost());
}
if((this.documentKeywords == null) ||
(this.documentKeywords.trim().equals(""))){
this.documentKeywords = this.guessKeywords(this.documentContent);
}
//##############################################à
//Semantic checks:
//Checking if keywords are appropriate, as being present in content also.
this.documentKeywords = this.contentKeywordsConsistencyCheck(this.documentKeywords,
this.documentContent);
}catch(FileNotFoundException fnfExc) {
throw new IOException(fnfExc);
}catch(SAXException sExc) {
throw new IOException(sExc);
}catch(TikaException tExc) {
throw new IOException(tExc);
}catch(Exception exc) {
throw new IOException(exc);
}
}
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
> Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefano Falconetti updated PDFBOX-617:
--------------------------------------
Attachment: Irlanda125pag.pdf
Hi, I could find the site hosting these pdf files crashing. This is a different one, not the same, but the bug is present for this pdf file as well. If you like, you can try with this attached, present at link:
http://www.cocktailviaggi.it/cataloghi_pdf.cfm?pkalbero=377&pknodo=30584
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
> Attachments: Irlanda125pag.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857704#action_12857704 ]
Stefano Falconetti commented on PDFBOX-617:
-------------------------------------------
That was the dependency indicated by Tika, that I'm using. If the 0.8.0 and 1.1.0 are fully compatible and let Tika run fine, no problem, I will give a try.
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
> Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857448#action_12857448 ]
Andreas Lehmkühler commented on PDFBOX-617:
-------------------------------------------
You are using an older version of PDFBox. Is it possible to use a more recent version like 1.1.0.? The irlanda-pdfs are working quite perfect with the current trunk version.
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
> Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855444#action_12855444 ]
Stefano Falconetti commented on PDFBOX-617:
-------------------------------------------
No copy. I'm sorry, The problem was present for several pdf files that were looking like this.
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-617) Crash parsing pdf file
(http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf)
from Tika
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855339#action_12855339 ]
Jukka Zitting commented on PDFBOX-617:
--------------------------------------
The PDF doesn't seem to exist at the given URL anymore. Do you have a local copy of the document that you could share?
> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-617
> URL: https://issues.apache.org/jira/browse/PDFBOX-617
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
> Reporter: Stefano Falconetti
> Priority: Critical
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> ... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> ... 4 more
> Caused by: java.util.NoSuchElementException
> at java.util.AbstractList$Itr.next(AbstractList.java:350)
> at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> ... 8 more
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.