You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Stefano Falconetti (JIRA)" <ji...@apache.org> on 2010/02/11 11:22:28 UTC

[jira] Created: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
----------------------------------------------------------------------------------------------------------------------------

                 Key: PDFBOX-617
                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 0.8.0-incubator
         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
            Reporter: Stefano Falconetti
            Priority: Critical


Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:

java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
	... 1 more
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
	... 4 more
Caused by: java.util.NoSuchElementException
	at java.util.AbstractList$Itr.next(AbstractList.java:350)
	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
	... 8 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefano Falconetti updated PDFBOX-617:
--------------------------------------

    Attachment: Portogallo2010.pdf
                StatiUniti2010_1.pdf

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>         Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefano Falconetti updated PDFBOX-617:
--------------------------------------

    Attachment: Irlanda26-52pag.pdf

Same problem, same exception:

ava.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1f4bcf7
	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:199)
	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:453)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1f4bcf7
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:107)
	... 1 more
Caused by: org.apache.pdfbox.exceptions.WrappedIOException
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
	... 4 more
Caused by: java.util.NoSuchElementException
	at java.util.AbstractList$Itr.next(AbstractList.java:350)
	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
	... 8 more


> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>         Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved PDFBOX-617.
----------------------------------

    Resolution: Duplicate

Confirmed that this has already been fixed in PDFBox 1.1.0 (or earlier), thus resolving this as a duplicate of some earlier issue.

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>         Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855456#action_12855456 ] 

Stefano Falconetti commented on PDFBOX-617:
-------------------------------------------

Source of method crashing:


      /**
	 * Actually parse the file
	 * @throws IOException
	 */
	@SuppressWarnings("deprecation")
	//TODO move to the non deprecated call
	public void parse() throws IOException {
		
		try{
			if(this.myFileToParse != null){
			   this.inputStream = new FileInputStream(this.myFileToParse);
			}else{
				  this.inputStream = new ByteArrayInputStream(this.rawDocument.getContent());
			}
			
			Parser parser = new AutoDetectParser();
			ContentHandler textHandler = new BodyContentHandler();
			Metadata metadata = new Metadata();
			
			this.documentTitle = null;
			this.documentKeywords = null;
			this.documentContent = new String();
			//TODO Move to a non deprecated call
			//parser.parse(inputStream, this, metadata, parseContext);
			parser.parse(inputStream, 
						 textHandler, 
						 metadata);
			//Get all arrays first
			String[] tmpDocumentTitle = metadata.getValues(Metadata.TITLE);
			String[] tmpKeywords = metadata.getValues(Metadata.KEYWORDS);
			String[] tmpDocumentDescription = metadata.getValues(Metadata.DESCRIPTION); 
			
			//###############################################################
			// Sequence of utility methods' calls must be: 
			// stringArrayToString -> cleanUpExtraChars -> tokenizeDeTokenize
			//###############################################################

			//Take keywords both from page keywords and description
			int keywordsNum = 0;
			int descriptionWordsNum = 0;
			
			if(tmpKeywords != null){
			   keywordsNum = tmpKeywords.length;
			}
			
			if(tmpDocumentDescription != null){
			   descriptionWordsNum = tmpDocumentDescription.length;
			}
			
			int allKeywordsNum = keywordsNum + descriptionWordsNum;
			
			String[] tmpAllKeywords = null;
			
			//From title as last chance
			if( (allKeywordsNum == 0) &&
				(tmpDocumentTitle != null) &&
				(tmpDocumentTitle.length != 0) ){
			   allKeywordsNum = 1;
			   logger.warn("No meta information found, using title");
			   tmpAllKeywords = new String[]{this.stringArrayToString(tmpDocumentTitle)};
			}else{
				tmpAllKeywords = new String[allKeywordsNum]; 
			
				  System.arraycopy(tmpKeywords, 
					   			   0, 
					   			   tmpAllKeywords, 
								   0, 
								   tmpKeywords.length);
					
				  System.arraycopy(tmpDocumentDescription, 
					 			   0, 
					 			   tmpAllKeywords, 
								   tmpKeywords.length, 
								   tmpDocumentDescription.length);
			}
			//Fill in public getters
			this.documentTitle = this.stringArrayToString(tmpDocumentTitle);
			this.documentTitle = this.cleanUpExtraChars(this.documentTitle);
			this.documentTitle = this.tokenizeDeTokenize(this.documentTitle);
			
			this.documentKeywords = this.stringArrayToString(tmpAllKeywords);
			this.documentKeywords = this.cleanUpExtraChars(this.documentKeywords);
			//TODO if this value is needed (5), put it in the configuration file
			this.documentKeywords = this.tokenizeDeTokenize(this.documentKeywords, 5).toLowerCase();
			
			this.documentDescription = this.stringArrayToString(tmpDocumentDescription);
			this.documentDescription = this.cleanUpExtraChars(this.documentDescription);
			this.documentDescription = this.tokenizeDeTokenize(this.documentDescription);
						
			this.documentContent = this.cleanUpExtraChars(textHandler.toString().trim());
						
			//#####################################################
			//### Very special cases of very bad document found ###
			//#####################################################
			if((this.documentTitle == null) ||
			   (this.documentTitle.trim().equals(""))){
				
				this.documentTitle = this.guessTitle(this.documentContent, this.rawDocument.getURL().getHost()); 	     												  
			}
			
			if((this.documentKeywords == null) ||
			   (this.documentKeywords.trim().equals(""))){
				this.documentKeywords = this.guessKeywords(this.documentContent);
			}
			//##############################################à
			
			//Semantic checks:
			//Checking if keywords are appropriate, as being present in content also.
			this.documentKeywords = this.contentKeywordsConsistencyCheck(this.documentKeywords, 
																	     this.documentContent);
			
		}catch(FileNotFoundException fnfExc) {
			   throw new IOException(fnfExc);
		}catch(SAXException sExc) {
			   throw new IOException(sExc);
		}catch(TikaException tExc) {
			   throw new IOException(tExc);
		}catch(Exception exc) {
			   throw new IOException(exc);
		}
	}

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>         Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefano Falconetti updated PDFBOX-617:
--------------------------------------

    Attachment: Irlanda125pag.pdf

Hi, I could find the site hosting these pdf files crashing. This is a different one, not the same, but the bug is present for this pdf file as well. If you like, you can try with this attached, present at link:

http://www.cocktailviaggi.it/cataloghi_pdf.cfm?pkalbero=377&pknodo=30584

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>         Attachments: Irlanda125pag.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857704#action_12857704 ] 

Stefano Falconetti commented on PDFBOX-617:
-------------------------------------------

That was the dependency indicated by Tika, that I'm using. If the 0.8.0 and 1.1.0 are fully compatible and let Tika run fine, no problem, I will give a try. 

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>         Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857448#action_12857448 ] 

Andreas Lehmkühler commented on PDFBOX-617:
-------------------------------------------

You are using an older version of PDFBox. Is it possible to use a more recent version like 1.1.0.? The irlanda-pdfs are working quite perfect with the current trunk version.

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>         Attachments: Irlanda125pag.pdf, Irlanda26-52pag.pdf, Portogallo2010.pdf, StatiUniti2010_1.pdf
>
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] Commented: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Stefano Falconetti (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855444#action_12855444 ] 

Stefano Falconetti commented on PDFBOX-617:
-------------------------------------------

No copy. I'm sorry, The problem was present for several pdf files that were looking like this. 

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-617) Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855339#action_12855339 ] 

Jukka Zitting commented on PDFBOX-617:
--------------------------------------

The PDF doesn't seem to exist at the given URL anymore. Do you have a local copy of the document that you could share?

> Crash parsing pdf file (http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf) from Tika
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-617
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-617
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux debian: Linux 2.6.18-6-686 #1 SMP i686 GNU/Linux 
> java version "1.6.0_13"
> Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
> Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
>            Reporter: Stefano Falconetti
>            Priority: Critical
>
> Parsing the file http://media.opentur.it/WEB/CHANNELS/COCKTAILVIAGGI/CMS/PDF/Irlanda%202009%2028-51pag.pdf the call to Tika "parse" fails with the followinf stack trace:
> java.io.IOException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:143)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.main(GenericDocumentParserTikaImpl.java:306)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@1578aab
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:114)
> 	at com.travelport.indexing.documentparser.GenericDocumentParserTikaImpl.parse(GenericDocumentParserTikaImpl.java:69)
> 	... 1 more
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
> 	at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
> 	at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
> 	... 4 more
> Caused by: java.util.NoSuchElementException
> 	at java.util.AbstractList$Itr.next(AbstractList.java:350)
> 	at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
> 	at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
> 	at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
> 	... 8 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.