You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pdfbox.apache.org by ju...@apache.org on 2008/03/08 15:24:46 UTC
svn commit: r634972 [2/5] - in /incubator/pdfbox/trunk/migration: ./ fontbox/ jempbox/ pdfbox/

Added: incubator/pdfbox/trunk/migration/pdfbox/bugs.xml
URL: http://svn.apache.org/viewvc/incubator/pdfbox/trunk/migration/pdfbox/bugs.xml?rev=634972&view=auto
==============================================================================
--- incubator/pdfbox/trunk/migration/pdfbox/bugs.xml (added)
+++ incubator/pdfbox/trunk/migration/pdfbox/bugs.xml Sat Mar  8 06:24:45 2008
@@ -0,0 +1,38684 @@
+<tracker version="1.0" xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://sourceforge.net/export/sf_tracker_export.xsd">
+	<artifact id="1901534">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-25 09:10</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot</summary>
+		<detail>I'm working with pdfbox 0.7.3
+
+I'm extracting text from pdf files and It's work fine. But I found a pdf file that crashes the extraction (pdf file attached).
+
+The code wrote is:
+
+stream = new FileInputStream(file);
+pdfDocument = PDDocument.load(stream);
+if (pdfDocument.isEncrypted()) {
+    pdfDocument.decrypt("");
+}
+StringWriter writer = new StringWriter();
+PDFTextStripper stripper = new PDFTextStripper();
+stripper.writeText(pdfDocument, writer);
+contents = writer.getBuffer().toString();
+
+
+When trying to extract text from this file I'm getting the following exception:
+
+
+java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionary
+        at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:70)
+        at org.pdfbox.cos.COSStream.doDecode(COSStream.java:319)
+        at org.pdfbox.cos.COSStream.doDecode(COSStream.java:261)
+        at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:173)
+        at org.pdfbox.pdfparser.PDFStreamParser.&lt;init&gt;(PDFStreamParser.java:91)
+        at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:135)
+        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:189)
+        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
+        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
+        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
+        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
+
+
+Thanks
+
+german.gf@gmail.com
+</detail>
+		<existingfiles>
+			<file>
+				<id>267915</id>
+				<name>attachment.pdf</name>
+				<description>pdf file that It does not work fine</description>
+				<filesize>85947</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1203959458</adddate>
+				<submitted_by>100</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>267915: attachment.pdf</old_value>
+				<date>1203959459</date>
+				<by>nobody</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1900699">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-23 21:33</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>JVM crashes in Unix AIX </summary>
+		<detail>JVM crashes in Unix AIX whenever we tried to extract text from PDF files..</detail>
+	</artifact>
+	<artifact id="1894664">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-15 13:28</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Exception when running PDTToImage class</summary>
+		<detail>When running the PDFToImage class (version 1.6 included in PDFBox-0.7.4-dev-20080215 build), I'm getting the following exception:
+java.awt.geom.IllegalPathStateException: missing initial moveto in path definition
+
+This is being thrown from the following method:
+PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream).
+The stack trace is:
+PageDrawer(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 230
+PageDrawer(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 177
+PageDrawer.drawPage(Graphics, PDPage, Dimension) line: 103
+PDPage.convertToImage() line: 685
+PDFImageWriter.WriteImage(PDDocument, String, String, int, int, String) line: 100
+PDFToImage.main(String[]) line: 179
+
+Thanks,
+
+Dave Belt
+dbelt@erac.com</detail>
+		<follow_ups>
+			<item>
+				<date>1203555378</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+I've done some additional debug tracing, and found that the exception is occurring when subPath.closePath() is called:
+  FillNonZeroRule.process(PDFOperator, List) line: 74
+  PageDrawer(PDFStreamEngine).processOperator(PDFOperator, List) line: 497
+  PageDrawer(PDFStreamEngine).processSubStream(PDPage, PDResources, COSStream) line: 218
+  PageDrawer(PDFStreamEngine).processStream(PDPage, PDResources, COSStream) line: 177
+  PageDrawer.drawPage(Graphics, PDPage, Dimension) line: 103
+  PDPage.convertToImage() line: 685
+  PDFImageWriter.WriteImage(PDDocument, String, String, int, int, String) line: 100
+  PDFToImage.main(String[]) line: 179
+
+At the time of the exception, variable subPath contains the following values:
+subPath= GeneralPath  (id=90)
+	numCoords= 0
+	numTypes= 0
+	pointCoords= float[40]  (id=96)
+		[0]= 79.68
+		[1]= 309.36
+		[2]= 486.4
+		[3]= 309.36
+		[4]= 486.4
+		[5]= 323.56
+		[6]= 79.68
+		[7]= 323.56
+		[8]= 79.68
+		[9]= 309.36
+		[10]= 0.0
+		[11]= 0.0
+		[12]= 0.0
+		[13]= 0.0
+		[14]= 0.0
+		[15]= 0.0
+		[16]= 0.0
+		[17]= 0.0
+		[18]= 0.0
+		[19]= 0.0
+		[20]= 0.0
+		[21]= 0.0
+		[22]= 0.0
+		[23]= 0.0
+		[24]= 0.0
+		[25]= 0.0
+		[26]= 0.0
+		[27]= 0.0
+		[28]= 0.0
+		[29]= 0.0
+		[30]= 0.0
+		[31]= 0.0
+		[32]= 0.0
+		[33]= 0.0
+		[34]= 0.0
+		[35]= 0.0
+		[36]= 0.0
+		[37]= 0.0
+		[38]= 0.0
+		[39]= 0.0
+	pointTypes= byte[20]  (id=97)
+		[0]= 0
+		[1]= 1
+		[2]= 1
+		[3]= 1
+		[4]= 1
+		[5]= 4
+		[6]= 0
+		[7]= 0
+		[8]= 0
+		[9]= 0
+		[10]= 0
+		[11]= 0
+		[12]= 0
+		[13]= 0
+		[14]= 0
+		[15]= 0
+		[16]= 0
+		[17]= 0
+		[18]= 0
+		[19]= 0
+	windingRule= 1
+
+I hope this helps!</text>
+			</item>
+			<item>
+				<date>1203110990</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+Sorry, the class name in the summary should have been PDFToImage.</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1894224">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-14 20:25</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="775780">PDModel.AcroForm</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Alters the security properties of the Acro form when saving</summary>
+		<detail>There is an error when saving the Acro form after modifying the form i.e. filling in the form fields once and than saving the pdf documents causes the original properties of the form to be altered.</detail>
+	</artifact>
+	<artifact id="1893204">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-13 16:00</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="566875">PDModel</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Trying to assign an action to a bookmark</summary>
+		<detail>I need to be able to assign an action to a bookmark, to open an external file.
+I am having a difficult time trying to figure this out, so any assistance would be greatly appreciated.
+
+Here is the section of code trying to accomplish this:
+Case "Master Agenda"
+
+        'Get the list of pages in the document
+        Dim pages As ArrayList = PdfDoc.getDocumentCatalog.getAllPages()
+
+
+        Dim i, pageNumber As Integer
+        pageNumber = 0
+        Dim row As DataRow = Nothing
+        Dim ParentChild As String = String.Empty
+        Dim bookmarkTitle As String = String.Empty
+        Dim pdfFileName As String = String.Empty
+        'loop thru the bookmark datatable and add bookmarks to the document accordingly
+        For i = 0 To rowCount - 1 Step 1
+            ' Dim rowArrayData() As String
+            'Read the row's data
+            row = bookmarkTable.Rows(i)
+            bookmarkTitle = row.Item("bookmarkTitle")
+
+            'Get the page at pageNumber from pages list
+            Dim page As PDPage = CType(pages.get(pageNumber), PDPage)
+            Dim dest As PDPageFitWidthDestination = New PDPageFitWidthDestination()
+            Dim externalFile As New interactive.action.type.PDWindowsLaunchParams()
+            externalFile.setFilename("C:\DBPRTestData\FINAL ACTION\Processed\FINAL ACTION-APRIL_2008 - Case # 0000000000100000000.PDF")
+            externalFile.setOperation("open")
+            '            externalFile.setF("C:\DBPRTestData\FINAL ACTION\Processed\FINAL ACTION-APRIL_2008 - Case # 0000000000100000000.PDF")
+            'externalFile.setExecuteParam(   shouldOpenInNewWindow()
+            'externalFile.setType("open")
+
+
+            dest.setPage(page)
+
+            'Determine the bookmark type (parent or child)
+            ' In this case:
+            '   Parent bookmark type is an Agenda Item
+            '   Child bookmark type are Supporting Documents
+
+
+            bookmark = New PDOutlineItem
+            bookmark.setDestination(dest)
+            bookmark.setTitle(bookmarkTitle.Trim)
+            Dim openAction As New interactive.action.type.PDActionLaunch
+            openAction.setType("open")
+            'Dim openActionStr As String = 
+            bookmark.setAction(openAction)
+
+
+            'Add this child bookmark to the parent's bookmark
+            pagesOutline.appendChild(bookmark)
+
+
+        Next i
+
+End Select
+'Expand the bookmark tree
+pagesOutline.openNode()
+outline.openNode()
+'Save the the document to a file
+PdfDoc.save(outFile)
+result = True</detail>
+		<follow_ups>
+			<item>
+				<date>1202965991</date>
+				<sender>maleman2045</sender>
+				<text>Logged In: YES 
+user_id=2008850
+Originator: NO
+
+Sorry, I posted this entry before I created my account.
+I'll be monitoring this for a response. 
+Thanks!</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1891856">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-12 03:00</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="520288">lucene</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>pdf page extraction</summary>
+		<detail>Hi, 
+I gone through this link
+http://studentclub.ro/lucians_weblog/archive/2007/03/22/read-from-a-pdf-file-using-c.aspx
+and found the solution to read text from pdf. this one is really great. 
+
+I am trying to read PDF using this functionality. It's working fine, but now I want to extract text from PDF page is there method in PDF Box through which we can get pdf data in text by the no of page passed, so that i don't read the whole PDF unnecessary.
+
+U can responsd me on singh.devendra@yahoo.com
+
+Thanks 
+Devendra</detail>
+		<follow_ups>
+			<item>
+				<date>1202814086</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+and i am using C#.</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1889131">
+		<submitted_by>bwingenroth</submitted_by>
+		<submitted_date>2008-02-07 14:27</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="520300">parsing</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Patch for two bugs: 1 in PDDocument.load 1 in PDFMergerUtili</summary>
+		<detail>I've come across two bugs that the attached patch addresses.  I don't claim to fully understand these bugs, nor do I know whether my patch fixes the root issues -- I only know that using this patch allows me to continue using PDFBox.
+
+I came across a ClassCastException in PDDocument.load.  It actually manifests as an IOWrapperException caused by the ClassCastException.  In getObjectsByType inside the file COSDocument.java, you grab the COSdictionary and attempt to cast each item as a COSName.  I've found a PDF that opens fine in Acrobat that apparently contains a COSObject inside that dictionary which throws the ClassCastException.  My patch for COSDocument.java wraps the cast in a try block and silently discards the ClassCastException.  This looks like it's safe, since the code seems to want to match a COSName against a type passed into the function.  If the item isn't castable as a COSName, then I assume it could never pass anyway, so my patch ignores it.
+
+The second issue -- in PDFMergerUtility.java -- is that srcNums can be null.  
+
+COSArray srcNums = (COSArray)srcLabels.getDictionaryObject( COSName.getPDFName( "Nums" ) );
+
+My patch tests srcNums to make sure it isn't null before iterating over it.
+
+Again, I don't know that I've addressed the underlying causes behind these issues, but I have successfully patched the nightly build from 20080205 with these changes and haven't noticed any ill effects.</detail>
+		<existingfiles>
+			<file>
+				<id>265468</id>
+				<name>PDFBox-0.7.4-dev-20080205.patch</name>
+				<description>Patch for bugs in PDDocument.load and PDFMergerUtility</description>
+				<filesize>2757</filesize>
+				<filetype>application/octet-stream</filetype>
+				<adddate>1202423259</adddate>
+				<submitted_by>843437</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>265468: PDFBox-0.7.4-dev-20080205.patch</old_value>
+				<date>1202423259</date>
+				<by>bwingenroth</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1888826">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-07 06:56</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Letter is shown twice</summary>
+		<detail>In the attached file, the letter 'H' after the number '07005072' is also (incorrectly) shown in the upper left corner.
+
+Sune Lindhe
+lindhe@regionh.dk</detail>
+		<existingfiles>
+			<file>
+				<id>265427</id>
+				<name>test.pdf</name>
+				<description>File illustrating the bug</description>
+				<filesize>4842</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1202396179</adddate>
+				<submitted_by>100</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>265427: test.pdf</old_value>
+				<date>1202396179</date>
+				<by>nobody</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1887932">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2008-02-06 06:30</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="520288">lucene</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Version incompatibility with Lucene?</summary>
+		<detail>It may seem that the code that does text extraction for Lucene is not compatible with the later versions of Lucene.
+
+I had to revert all the way back to to Lucene 2.0.0 from 2006 in order to make PDF Box be able to parse PDF documents for Lucene.
+
+Manual recompilation of the PDFBox source is not really an option due to a large number of dependencies which we don't have the resources to keep track of.
+
+A recompilation and release compatible with Lucene 2.3.0 would be much appreciated!
+
+knutf@firstventura.com</detail>
+	</artifact>
+	<artifact id="1865612">
+		<submitted_by>jalanthomas</submitted_by>
+		<submitted_date>2008-01-06 19:07</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>showCharacter Javadoc unclear and Method title ambiguous</summary>
+		<detail>   The JavaDoc for the showCharacter method of the PDFStreamEngine is inadequate.  The method name is misleading for a general event interface.  The documentation is not clear that the processStream method causes showCharacter to be invoked.  
+
+
+</detail>
+	</artifact>
+	<artifact id="1844126">
+		<submitted_by>kostia76</submitted_by>
+		<submitted_date>2007-12-04 07:04</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="817859">PDFReader</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Watermarks aren't correctly showed</summary>
+		<detail>When a PDF File has a gray Watermark, it isn't correctly showed.</detail>
+		<existingfiles>
+			<file>
+				<id>257215</id>
+				<name>gws_wechselgeldzuteilung_wasserzeichen.pdf</name>
+				<description>PDF with Watermark stamped</description>
+				<filesize>21921</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1196780677</adddate>
+				<submitted_by>1792375</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>257215: gws_wechselgeldzuteilung_wasserzeichen.pdf</old_value>
+				<date>1196780677</date>
+				<by>kostia76</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1843491">
+		<submitted_by>michel-onoff</submitted_by>
+		<submitted_date>2007-12-03 09:01</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>ExtractImages (hence PDXObjectImage) produces wrong PNGs.</summary>
+		<detail>ExtractImages (and hence PDXObjectImage) sometimes produces wrong PNGs.
+
+When getRGBImage() is called on a PDXObjectImage, a faulty BufferedImage is created.
+
+The same file is opened and displayed correctly with other tools like Acrobat Reader and GSView, so the file is not corrupt.
+
+I attach the .pdf. You can view its real content with Acrobat Reader and try to extract the image by running
+
+java -cp PDFBox-&lt;version&gt;.jar &lt;file.pdf&gt;
+
+I tried both the release 0.7.3 and the last nightly build of 20071203
+
+</detail>
+		<follow_ups>
+			<item>
+				<date>1203010528</date>
+				<sender>michel-onoff</sender>
+				<text>Logged In: YES 
+user_id=1951253
+Originator: YES
+
+I retried again with the newest nightly build and the attached pdf100.pdf file.
+The extracted image is still wrong.
+
+The pdf contains a black&amp;white image compressed with Flate and ASCII85 and no predictor.
+
+Any other viewer I tried (including AcroRead, GSView, Cabaret, jPedal) have no troubles.
+
+I have all jai libraries installed as well.
+
+Any idea?
+
+This is the command I use
+java -cp PDFBox-0.7.4-dev-20080214.jar org.pdfbox.ExtractImages pdf100.pdf
+
+It produces a PNG with wrong content.
+</text>
+			</item>
+			<item>
+				<date>1200327240</date>
+				<sender>michel-onoff</sender>
+				<text>Logged In: YES 
+user_id=1951253
+Originator: YES
+
+I retried with the newest nightly build and the attached pdf100.pdf file. The extracted image is still wrong.
+Any idea?
+
+java -cp PDFBox-0.7.4-dev-20080113.jar org.pdfbox.ExtractImages pdf100.pdf
+
+
+</text>
+			</item>
+		</follow_ups>
+		<existingfiles>
+			<file>
+				<id>257078</id>
+				<name>pdf100.pdf</name>
+				<description></description>
+				<filesize>11814</filesize>
+				<filetype>application/save-as</filetype>
+				<adddate>1196701301</adddate>
+				<submitted_by>1951253</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>257078: pdf100.pdf</old_value>
+				<date>1196701301</date>
+				<by>michel-onoff</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1838989">
+		<submitted_by>jeunes2</submitted_by>
+		<submitted_date>2007-11-26 13:08</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Argument parsing in ConvertColorspace broken</summary>
+		<detail>The argument parsing in ConvertColorspace is broken; arguments for the -password and -toColorspace options are handled as files. Changing the "if"s in lines 312 and 321 to "else if"s fixes this issue.</detail>
+	</artifact>
+	<artifact id="1835843">
+		<submitted_by>ibuzz</submitted_by>
+		<submitted_date>2007-11-21 07:25</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>TrueType and characterHorizontalDisplacement</summary>
+		<detail>PDFBox seem to have an issue with some PDF document using TrueType fonts with no width information stored in the font dictionary : 
+ 
+6 0 obj 
+&lt;&lt;/BaseFont /TimesNewRoman 
+/Subtype /TrueType 
+/Name /F0 
+/Encoding /WinAnsiEncoding 
+/Type /Font 
+&gt;&gt; 
+endobj 
+ 
+ 
+The value of "characterHorizontalDisplacement" in "showString()" (PDFTextStripper.java) is always 0. 
+ 
+Is PDFBox try to get the width information inside the TrueType font program, when this situation occur? 
+ 
+If the answer is no, how I can get the width value inside a TrueType font program? 
+ 
+Thanks! </detail>
+		<existingfiles>
+			<file>
+				<id>255309</id>
+				<name>CharDisplAndTTF.pdf</name>
+				<description></description>
+				<filesize>215054</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1195658751</adddate>
+				<submitted_by>964306</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>255309: CharDisplAndTTF.pdf</old_value>
+				<date>1195658751</date>
+				<by>ibuzz</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1835664">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2007-11-21 02:23</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>expected='/' actual='e'-101 </summary>
+		<detail>I'm getting following error during text extraction:
+
+expected='/' actual='e'-101 org.pdfbox.io.PushBackInputStream@172dc32</detail>
+		<follow_ups>
+			<item>
+				<date>1199092568</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+Is enybode working on this bug?</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1834581">
+		<submitted_by>feltonfee</submitted_by>
+		<submitted_date>2007-11-19 07:09</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>One rectangle missing when converting PDF to image</summary>
+		<detail>As forum issue "Box missing, LinePath not properly reset ?", id: 267205
+
+I get the latest code from CVS.
+Each time there is a RE in the PDF, in org.pdfbox.util.operator.pagedrawer.AppendRectangleToPath
+drawer.getLinePath().reset() is called to empty the list.
+That cause missing box for the following PDF:
+
+RE 
+RE 
+S 
+
+When I convert to image, Only one box is shown, under Acrobat reader, there are 2 boxes, please check the attached PDF.
+
+Thanks!
+
+
+ </detail>
+		<existingfiles>
+			<file>
+				<id>254961</id>
+				<name>2boxes.pdf</name>
+				<description>PDF with 2 boxes</description>
+				<filesize>10774</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1195484970</adddate>
+				<submitted_by>623314</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>254961: 2boxes.pdf</old_value>
+				<date>1195484970</date>
+				<by>feltonfee</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1833244">
+		<submitted_by>phenderson</submitted_by>
+		<submitted_date>2007-11-16 08:23</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="566875">PDModel</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Images with transparency are not rendered correctly</summary>
+		<detail>PDF documents that contain images with transparent regions are not rendered correctly. The transparent areas are painted black.
+
+Attached is an example pdf with text behind an image, generated via OpenOffice.
+
+Cheers
+
+Peter Henderson</detail>
+		<existingfiles>
+			<file>
+				<id>254618</id>
+				<name>test.pdf</name>
+				<description>example pdf</description>
+				<filesize>196955</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1195230219</adddate>
+				<submitted_by>764374</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>254618: test.pdf</old_value>
+				<date>1195230220</date>
+				<by>phenderson</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1827099">
+		<submitted_by>ibuzz</submitted_by>
+		<submitted_date>2007-11-06 11:53</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Don't want gibberish character when extracting text.</summary>
+		<detail>Hi,
+
+I'm not sure if its a bug, but I don't really like to get gibberish characters when extracting text using a custom encoding.  Its cool when debugging, but in real case situation, we don't necessarily want to extract and see them.
+
+It would be cool if the method "font.encode( string, i, codeLength );" had an option to return nothing (if the font use a custom encoding).
+
+Another way would be to give us a flag to be able to identify easily custom encoding for a font (in PDFont?).
+
+Thanks!
+
+
+</detail>
+	</artifact>
+	<artifact id="1824739">
+		<submitted_by>gdeshors</submitted_by>
+		<submitted_date>2007-11-02 09:12</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="556660">writing</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>PDF printing or conversion : lines are too thick - SOLVED ?</summary>
+		<detail>I'm starting to use PDFBox, and I find it a great tool. 
+The problem I detected is the following : when printing a PDF or converting it into an image, the table borders become too thick, which is unpretty. It probably is the case for all the lines drawn in the document.
+ 
+To illustrate this, I've uploaded an screenshot of  
+- the original document : http://glcomp.free.fr/images/pdfbox/orig.png  
+- and the document rendered by PDFBox (printed via PDF995) : http://glcomp.free.fr/images/pdfbox/final.png 
+ 
+(nota : I've put the PDF documents as well : http://glcomp.free.fr/images/pdfbox/ ) 
+ 
+Steps to reproduce : 
+1/ take my document and convert it to image, or print it via PDFBox (let's say "PDFToImage -imageTypeJpg orig.pdf")
+2/ on produced image orig1.jpg the lines are thicker than on original document.
+
+--------- SOLUTION ?
+
+I seem to have found the problem, though it seems a little too evident and maybe I'm afraid to be missing a trick... 
+In the class org.pdfbox.util.operator.pagedrawer.AppendRectangleToPath#process method, which actually draws the rectangles, the x2 and y2 coordinates have an apparently useless '+1' on lines 67 and 68. I didn't see a reason for this, and when I remove these it works much better, for all the documents I tried.
+
+I'm uploading a patch to be crystal clear about the changes I propose.
+
+Sincerely,
+Guillaume Deshors</detail>
+		<existingfiles>
+			<file>
+				<id>252546</id>
+				<name>patch_AppendRectangleToPath.txt</name>
+				<description>Patch to fix bug of thick lines</description>
+				<filesize>296</filesize>
+				<filetype>text/plain</filetype>
+				<adddate>1194019977</adddate>
+				<submitted_by>849198</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>252546: patch_AppendRectangleToPath.txt</old_value>
+				<date>1194019977</date>
+				<by>gdeshors</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1787501">
+		<submitted_by>gjniewenhuijse</submitted_by>
+		<submitted_date>2007-09-04 00:24</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="817859">PDFReader</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="9">9</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>layout print problem</summary>
+		<detail>When i print the attached file, some things are not printed well.
+
+- The gray box at the top
+- and the fonts are printed bold and thats not right.
+
+Is there any solution for now, or for later? 
+
+When i open and print this file with adobe reader, everything is fine, but with pdfbox i've got a layout problem.
+
+I used the newest pdfbox version (also tested the nightly build)</detail>
+		<existingfiles>
+			<file>
+				<id>244104</id>
+				<name>orarrp.pdf</name>
+				<description>pdf with print problem</description>
+				<filesize>7871</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1188890656</adddate>
+				<submitted_by>1873119</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>priority</field>
+				<old_value>5</old_value>
+				<date>1193821091</date>
+				<by>gjniewenhuijse</by>
+			</item>
+			<item>
+				<field>File Added</field>
+				<old_value>244104: orarrp.pdf</old_value>
+				<date>1188890659</date>
+				<by>gjniewenhuijse</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1822993">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2007-10-30 12:41</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>PDF Splitt Query with VB.NET</summary>
+		<detail>Hi,
+I am working for xyz comapny. I have task like I want to split PDF file using vb.net.
+
+I need perfect working code. Any or All help would be highly appreciated. I tried a lot but couldnt succedd.
+
+i am using pdfbOX 7.3 and vb.net 2005 
+
+Problem like I have pdf (pages in range 1 to 150) and I want to split it into 15 pdf (like 1 to 10, 11 to 20,21 to 30.....)
+
+pls its urgent.......I am eagerly waiting for working code in vb.net or c#.</detail>
+	</artifact>
+	<artifact id="1822452">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2007-10-29 17:37</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="520300">parsing</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Error getting pdf version</summary>
+		<detail>java.io.IOException: Error getting pdf version:java.lang.NumberFormatException: For input string: "-"
+	at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:166)
+	at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
+	at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
+	at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:633)
+	at test.pdfbox.pdfparser.TestPDFParser.test_exception_version1(TestPDFParser.java:112)
+	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
+	at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
+	at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
+	at java.lang.reflect.Method.invoke(Unknown Source)
+	at junit.framework.TestCase.runTest(TestCase.java:154)
+	at junit.framework.TestCase.runBare(TestCase.java:127)
+	at junit.framework.TestResult$1.protect(TestResult.java:106)
+	at junit.framework.TestResult.runProtected(TestResult.java:124)
+	at junit.framework.TestResult.run(TestResult.java:109)
+	at junit.framework.TestCase.run(TestCase.java:118)
+</detail>
+		<follow_ups>
+			<item>
+				<date>1201209735</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+Someone can put a better more throughtful fix in.
+Here is what I did to fix it.
+
+PDFParser.java:
+
+    public void parse() throws IOException
+    {
+        try
+        {
+            if ( raf == null )
+            {
+                checktmpDir();
+                document = new COSDocument( tempDirectory );
+            }
+            else
+            {
+                document = new COSDocument( raf );
+            }
+            setDocument( document );
+            findVersion();   // New method see below.
+            // Code to find version moved to method findVersion();
+            skipHeaderFillBytes();
+            Object nextObject;
+            [...]
+
+----
+
+    /**
+     * Attempt to find version in the following form %PDF-&lt;number&gt;&lt;0a|0d&gt;%
+     * @throws IOException
+     */
+    private void findVersion() throws IOException
+    {
+        String header = null;
+        // try 5 lines to get PDF Version.
+        for ( int i = 0; i &lt; 5; i++) {
+            header = readLine();
+            
+            //sometimes there are some garbage bytes in the header before the header
+            //actually starts, so lets try to find the header first.
+            int headerStart = header.indexOf( PDF_HEADER );
+
+            //greater than zero because if it is zero then
+            //there is no point of trimming            
+            if( headerStart &gt; 0 )
+            {
+                //trim off any leading characters
+                header = header.substring( headerStart, header.length() );
+            } else if (headerStart &lt; 0)
+                continue;  // Did not find the Header Go look at next line
+            
+            document.setHeaderString( header );  
+            try
+            {
+                float pdfVersion = Float.parseFloat( 
+                    header.substring( PDF_HEADER.length(), Math.min( header.length(), PDF_HEADER.length()+3) ) );
+                document.setVersion( pdfVersion );
+                return;  // Express return.
+            }
+            catch( NumberFormatException e )
+            {
+                throw new IOException( "Error getting pdf version: " + header + "\n" + e );
+            }            
+        }
+        throw new IOException( "Unable to find version");            
+    }
+----</text>
+			</item>
+			<item>
+				<date>1193778760</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+Debugged it with a hex dump on the submitted file 
+---
+Appears that the Version started at office 0x80 instead of the first line.
+AdobeReader 7.x appears to have skipped to the right version and display the rest properly.
+
+So I think something needs to be done with PDFParser::parse() version checking.
+
+00000000: 001f 3339 3339 202d 2057 4648 202d 2050  ..3939 - WFH - P
+00000010: 7265 7020 666f 2331 3533 3245 332e 7064  rep fo#1532E3.pd
+00000020: 6600 0000 0000 0000 0000 0000 0000 0000  f...............
+00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
+00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
+00000050: 0000 0000 0300 2100 0000 00c2 550d 05c2  ......!.....U...
+00000060: 550d 0500 0000 0000 0000 0000 0000 0000  U...............
+00000070: 0000 0000 0000 0000 0000 8181 af49 0000  .............I..
+00000080: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7  %PDF-1.3.%......
+00000090: f3a0 d0c4 c60a 3220 3020 6f62 6a0a 3c3c  ......2 0 obj.&lt;&lt;</text>
+			</item>
+			<item>
+				<date>1193704718</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+Tested on 0.7.2, 0.7.3, latest 0.7.4-2007-10-22</text>
+			</item>
+		</follow_ups>
+		<existingfiles>
+			<file>
+				<id>251894</id>
+				<name>exception_version1.pdf</name>
+				<description></description>
+				<filesize>196864</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1193704672</adddate>
+				<submitted_by>100</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>251894: exception_version1.pdf</old_value>
+				<date>1193704672</date>
+				<by>nobody</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1819754">
+		<submitted_by>brettpowley</submitted_by>
+		<submitted_date>2007-10-24 23:59</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>PDFont.getStringWidth() returns incorrect values</summary>
+		<detail>For some text in some documents, getStringWidth() returns an incorrect value.  In some cases it returns zero, which is clearly not correct.  In others, it returns something that is too short.  An example of this follows:
+
+On the page, this text is part of text that reads "Cash flows from".  The text in question is delivered to flushText in PDFTextStripper as multiple TextPositions, and the ones below are those containing "w" and the next one containing "s fr".
+
+The first one looks like this:
+
+TextPosition: "w"  
+getX=62.824474 
+getWidth=6.731968 
+getWordSpacing=0.000000 
+getWidthOfSpace=2.224000 
+getXScale=1.000000
+glyphFactor=999.999939, getXScale=1.000000, getStringWidth=814.000000, calculatedFontWidth=0.814000 
+averageWidth=0.546769, 
+widthUsingSpaces=2.224000  
+widthUsingFont=0.546769
+
+Note that, according to getStringWidth(), the width of this text is 0.841 meaning it would end at 62.82 + 0.841 = 63.66.
+According to getWidth(), it ought to end at 62.82 + 6.73 = 69.55.
+
+When we look at the next chunk of text:
+
+TextPosition: "s fr" 
+getX=69.336563 getWidth=12.518410 
+
+we see that it does in fact start immediately after the previous one -- so the width from getStringWidth() for the first one was incorrect.
+
+The font is a PDType1Font and its name appears to be "YOTPKO+HelveticaNeue-Bold*1".
+</detail>
+	</artifact>
+	<artifact id="1818588">
+		<submitted_by>astonishing1</submitted_by>
+		<submitted_date>2007-10-23 07:20</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Extracting number show empty string</summary>
+		<detail>hi,
+I want to extract the text which is a number 10 digit long and is at fix place on each page of PDF file.
+I used  PrintTextLocations &amp; PDFTextStripper to extrac t   that id number from the PDFs .
+
+The PDF is Arabic but I want the number to extract only.
+The problem is that when I use PrintTextLocations utility when it prints the number it always misses one or two numbers and insert empty space instead of that numbers.
+
+Example 
+String[730.10004,116.75003 ft=Times-New-Roman+2 fs=200.0 xscale=0.05 height=5.000001 width=911.2002]    text: 16/10/2007
+String[775.7,32.75 ft=Times-New-Roman-Bold+1 fs=200.0 xscale=0.05 height=5.000001 width=933.4004]  text: RBKPI011
+String[786.15,116.75003 ft=Times-New-Roman-Bold+1 fs=200.0 xscale=0.05 height=5.000001 width=739.0] text:????? ?????
+String[375.85,89.10004 ft=Times-New-Roman-Bold+1 fs=240.0 xscale=0.05 height=6.000001 width=1057.6797] text:?????? - 004
+String[330.9,101.10004 ft=Times-New-Roman-Bold+1 fs=240.0 xscale=0.05 height=6.000001 width=3023.04]  text:??????  ??? ???? ????  -  1 4 58           (the number is 194758, 9 &amp; 7 is missing)  
+The last number is  some Arabic word after â is this 194758 number but 9 and 7 is missing 
+Similarly as the big PDF file is generated daily so I parsed the new one as following 
+String[329.75,101.10004 ft=Times-New-Roman-Bold+1 fs=240.0 xscale=0.05 height=6.000001 width=3068.6406]?????? ????  ??  ????  -  1 06 14     No.is  1906914, 9  is missing)  
+
+So it is not fixed .
+So can anyone help ,tanks in advance .
+
+
+
+</detail>
+		<existingfiles>
+			<file>
+				<id>251007</id>
+				<name>194758.pdf</name>
+				<description>pdf file to extract data using PrintTextLocations utility</description>
+				<filesize>103183</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1193149258</adddate>
+				<submitted_by>1919558</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>251007: 194758.pdf</old_value>
+				<date>1193149258</date>
+				<by>astonishing1</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1808718">
+		<submitted_by>dougcook</submitted_by>
+		<submitted_date>2007-10-06 11:49</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>NullPointerException in trunk version</summary>
+		<detail>This with the latest code versions (0.7.4-dev). 
+
+In doing text extraction of the file:
+
+http://www.crcv.com.au/resources/Grape%20and%20Wine%20Quality/Posters/Use%20of%20NIR%20to%20measure%20oxidation%20in%20white%20wine.pdf
+
+I get:
+java.lang.NullPointerException
+at org.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:95)
+at org.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:72)
+at org.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:136)
+at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:192)
+at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
+at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
+at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
+at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
+at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
+
+</detail>
+	</artifact>
+	<artifact id="1807062">
+		<submitted_by>bernie9998</submitted_by>
+		<submitted_date>2007-10-03 10:02</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="566875">PDModel</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>StandardSecurityHandler not setting currentAccessPermission</summary>
+		<detail>I've noticed that if I try to get the currentAccessPermission from a protected document, what I actually get is a null pointer.
+
+Looking through the source, it seems the issue is that StandardSecurityHandler never actually sets currentAccessPermission after it decrypts the document.  Thus, when PDDocument tries to get the currentAccessPermission, it gets the initial null pointer currentAccessPermission is initialized to, as securityHandler is not null.
+
+I believe that StandardSecurityHandler should set the currentAccessPermission with the permissions obtained in the encryptionDictionary.</detail>
+		<follow_ups>
+			<item>
+				<date>1191431242</date>
+				<sender>bernie9998</sender>
+				<text>Logged In: YES 
+user_id=1752908
+Originator: YES
+
+I am attaching a patch which sets currentAccessPermissions from the encryptionDictionary in the decrypt method.
+
+It allows involves a new AccessPermission constructor that allows permissions to be set from an int rather than a byte array.
+
+I've tested this with a pdf document that had setCanPrint to false and found this to work in this case.
+File Added: PDFBox-SetAcessPermissions.patch</text>
+			</item>
+		</follow_ups>
+		<existingfiles>
+			<file>
+				<id>248232</id>
+				<name>PDFBox-SetAcessPermissions.patch</name>
+				<description>patch that sets currentAccessPermission in StandardSecurityHandler</description>
+				<filesize>1372</filesize>
+				<filetype>text/x-diff</filetype>
+				<adddate>1191431242</adddate>
+				<submitted_by>1752908</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>248232: PDFBox-SetAcessPermissions.patch</old_value>
+				<date>1191431242</date>
+				<by>bernie9998</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1805929">
+		<submitted_by>tdonohue</submitted_by>
+		<submitted_date>2007-10-01 13:51</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>OutOfMemoryError for larger PDF text extraction</summary>
+		<detail>Hello,
+
+I'm using PDFBox 0.7.3, which is distributed with DSpace (www.dspace.org) version 1.4.2.   Currently, I'm running into OutOfMemoryError exceptions whenever I attempt text extraction from a few larger PDFs (&gt;10MB).  I've also just tried replacing PDFBox 0.7.3 with your latest nightly-build (from Oct 1), and the error still seems to be happening.
+
+My JVM options are currently set to:
+-Xmx1024M -Xms1024M -XX:NewRatio=2 -Dfile.encoding=UTF-8
+
+Here's a few of the problem PDFs:
+
+15MB PDF:
+https://test.ideals.uiuc.edu/bitstream/2142/2050/1/tr05.pdf
+
+13MB PDF:
+https://test.ideals.uiuc.edu/bitstream/2142/1936/1/RRE06.PDF
+
+Here's an example error stacktrace:
+
+Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
+        at java.util.HashMap.addEntry(HashMap.java:753)
+        at java.util.HashMap.put(HashMap.java:385)
+        at org.fontbox.cmap.CMap.addMapping(CMap.java:131)
+        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:202)
+        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
+        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
+        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:343)
+        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
+        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:497)
+        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:218)
+        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:177)
+        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:339)
+        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:263)
+        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:219)
+        at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:152)
+        at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:114)
+        at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:602)
+        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:513)
+        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:461)
+        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersItem(MediaFilterManager.java:428)
+        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersCollection(MediaFilterManager.java:417)
+        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
+
+Finally, here's how the DSpace API is calling PDFBox:
+
+        PDFTextStripper pts = new PDFTextStripper();
+        PDFParser parser = null;
+        String extractedText = null;
+
+        try
+        {
+            parser = new PDFParser(source);
+        parser.parse();
+            extractedText = pts.getText(new PDDocument(parser.getDocument()));
+        }
+        finally
+        {
+            try
+            {
+                parser.getDocument().close();
+            }
+            catch(Exception e)
+            {
+               log.error("Error closing temporary PDF file: " + e.getMessage(), e);
+            }
+        }</detail>
+		<follow_ups>
+			<item>
+				<date>1191274549</date>
+				<sender>tdonohue</sender>
+				<text>Logged In: YES 
+user_id=1320825
+Originator: YES
+
+I neglected to mention both of these PDFs were initially image-based and were recently OCRed using Adobe Acrobat 8 Pro.  I'm not sure that would matter for PDFBox to perform text extraction, but it's another commonality between these PDFs.
+
+Thanks in advance for any help you can provide!
+
+- Tim</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1801070">
+		<submitted_by>poorani</submitted_by>
+		<submitted_date>2007-09-24 04:44</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Expected to read '%%EOF' instead started reading 'e'</summary>
+		<detail>Hi,
+
+I am trying to search a particular word in PDF document. Am getting the following exception when i use the statement : 
+PDDocument.load(fileName)
+
+Error :
+java.io.IOException: Error: Expected to read '%%EOF' instead started reading 'e'
+	at org.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1022)
+	at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:364)
+	at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
+	at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
+	at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
+
+Can anybody help me in this.
+
+Thanks in advance.
+cheers,
+Poorni</detail>
+		<change_log>
+			<item>
+				<field>assigned_to</field>
+				<old_value>100</old_value>
+				<date>1190704171</date>
+				<by>poorani</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1800993">
+		<submitted_by>poorani</submitted_by>
+		<submitted_date>2007-09-24 02:24</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Expected to read '%%EOF' instead started reading 'e'</summary>
+		<detail>Hi,
+
+I am trying to search a particular word in PDF document. Am getting the following exception when i use the statement : 
+PDDocument.load(fileName)
+
+Error :
+java.io.IOException: Error: Expected to read '%%EOF' instead started reading 'e'
+	at org.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1022)
+	at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:364)
+	at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
+	at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
+	at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
+
+Can anybody help me in this.
+
+Thanks in advance.
+cheers,
+Poorni</detail>
+	</artifact>
+	<artifact id="1797094">
+		<submitted_by>balvinder_dang</submitted_by>
+		<submitted_date>2007-09-18 07:08</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Unknown encoding for 'H'</summary>
+		<detail>Unknown encoding for 'H'
+	at org.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.java:83)
+	at org.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:627)
+	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:476)
+	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:332)
+	at org.pdfbox.util.operator.ShowText.process(ShowText.java:66)
+	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:494)
+	at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:207)
+	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
+	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
+	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
+	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
+	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)</detail>
+	</artifact>
+	<artifact id="1797093">
+		<submitted_by>balvinder_dang</submitted_by>
+		<submitted_date>2007-09-18 07:07</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Unknown encoding for 'UniJIS-UCS2-H'</summary>
+		<detail>Unknown encoding for 'UniJIS-UCS2-H'
+	at org.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.java:83)
+	at org.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:627)
+	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:476)
+	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:332)
+	at org.pdfbox.util.operator.ShowText.process(ShowText.java:66)
+	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:494)
+	at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:207)
+	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
+	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
+	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
+	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
+	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)</detail>
+	</artifact>
+	<artifact id="1797092">
+		<submitted_by>balvinder_dang</submitted_by>
+		<submitted_date>2007-09-18 07:07</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Unknown encoding for 'KSCms-UHC-H'</summary>
+		<detail>Unknown encoding for 'KSCms-UHC-H'
+	at org.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.java:83)
+	at org.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:627)
+	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:476)
+	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:332)
+	at org.pdfbox.util.operator.ShowText.process(ShowText.java:66)
+	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:494)
+	at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:207)
+	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
+	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
+	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
+	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
+	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)</detail>
+	</artifact>
+	<artifact id="1797091">
+		<submitted_by>balvinder_dang</submitted_by>
+		<submitted_date>2007-09-18 07:03</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>ava.io.IOException: Unknown encoding for 'H'</summary>
+		<detail>java.io.IOException: Unknown encoding for 'H'
+	at org.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.java:83)
+	at org.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:627)
+	at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:476)
+	at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:332)
+	at org.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:90)
+	at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:494)
+	at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:207)
+	at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:160)
+	at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:355)
+	at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:268)
+	at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:220)
+	at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:140)</detail>
+	</artifact>
+	<artifact id="1786901">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2007-09-02 22:57</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="520300">parsing</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>IOException when parsing 1.3 PDF</summary>
+		<detail>JAVA CODE:PDDocument pdfDoc =  PDDocument.load("d:/pdf/1201198.PDF");
+
+PDFBOX VERSION:0.7.3
+
+PDF File version:1.3
+
+EXCEPTION:java.io.IOException: expected='endobj' firstReadAttempt='' secondReadAttempt='' org.pdfbox.io.PushBackInputStream@13e8d89
+	at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
+	at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
+	at ecwise.pdf.TPDFParse.loadPDF(TPDFParse.java:19)
+	at ecwise.pdf.TPDFParse.main(TPDFParse.java:28)
+
+but if pdf file version is 1.4,it works well.
+
+can anybody help me?
+
+thanks.
+</detail>
+		<follow_ups>
+			<item>
+				<date>1188808560</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+what is the root cause of this exceptionï¼how to solve this?</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1775060">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2007-08-15 16:47</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>PDFTextStripper - unwanted removal of spaces between words</summary>
+		<detail>I have been trying to extract text from a pdf document but the output is fine except that no spaces are left between many of the words in the original document.  Can you advise please?
+
+An example of the source document I am using can be found heree.
+
+https://www.hcrregister.com/ReportDownload?key=8493-6026-4010-7237-5096
+
+Regards
+
+Geoff</detail>
+		<follow_ups>
+			<item>
+				<date>1194643202</date>
+				<sender>carlemac_2007</sender>
+				<text>Logged In: YES 
+user_id=1933815
+Originator: NO
+
+I've similar problem, except multiple spaces are replaced by single space...</text>
+			</item>
+			<item>
+				<date>1190385715</date>
+				<sender>kameroliefant</sender>
+				<text>Logged In: YES 
+user_id=1895791
+Originator: NO
+
+I have the same problem...</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1771748">
+		<submitted_by>php-king</submitted_by>
+		<submitted_date>2007-08-10 08:28</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Printing fails</summary>
+		<detail>I tried printing the attached PDF document using the printSilent() Method and it gave me the following Error:
+
+org.pdfbox.exceptions.WrappedIOException: Font name not found
+        at org.pdfbox.pdmodel.font.PDTrueTypeFont.drawString(PDTrueTypeFont.java:382)
+        at org.pdfbox.pdfviewer.PageDrawer.showCharacter(PageDrawer.java:160)
+        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:409)
+        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
+        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
+        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
+        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
+        at org.pdfbox.pdfviewer.PageDrawer.drawPage(PageDrawer.java:104)
+        at org.pdfbox.pdmodel.PDPage.print(PDPage.java:741)
+        at sun.print.RasterPrinterJob.printPage(Unknown Source)
+        at sun.print.RasterPrinterJob.print(Unknown Source)
+        at sun.print.RasterPrinterJob.print(Unknown Source)
+        at ch.inteco.wegas.xmpu.nodes.PrintNode.printDoc(PrintNode.java:173)
+        at ch.inteco.wegas.xmpu.nodes.PrintNode.printFile(PrintNode.java:102)
+        at ch.inteco.wegas.xmpu.nodes.PrintNode.write(PrintNode.java:79)
+        at ch.inteco.wegas.xmpu.nodes.XMPUNode.write(XMPUNode.java:87)
+        at ch.inteco.wegas.xmpu.XMPUParser.parseXmpu(XMPUParser.java:43)
+        at ch.inteco.wegas.xmpu.XMPU.main(XMPU.java:36)
+
+There was no Text printed, but the Tables were.
+Thank you in advance</detail>
+		<existingfiles>
+			<file>
+				<id>240722</id>
+				<name>Printing-Fonts.pdf</name>
+				<description>The Pdf-Document</description>
+				<filesize>22116</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1186759735</adddate>
+				<submitted_by>1474996</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>240722: Printing-Fonts.pdf</old_value>
+				<date>1186759735</date>
+				<by>php-king</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1766623">
+		<submitted_by>tomas_vondra</submitted_by>
+		<submitted_date>2007-08-03 01:30</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Incorrectly printed frames / rectangles</summary>
+		<detail>When printing a PDF using org.pdfbox.PrintPDF class, the text is printed correctly, but frames (rectangles around the text) are scaled up and shifted (moved to a different position). When printed from Acrobat Reader or other tools the PDF is printed correctly (as well as displayed).
+
+This was observed on PDFBox-0.7.3, running on Windows XP. The example of incorrectly printed PDF (result.pdf) was produced using PDFCreator but the same problems happen when printing to a printer.</detail>
+		<existingfiles>
+			<file>
+				<id>239732</id>
+				<name>examples.zip</name>
+				<description>example of an incorrectly printed PDF</description>
+				<filesize>222013</filesize>
+				<filetype>application/zip</filetype>
+				<adddate>1186129803</adddate>
+				<submitted_by>1847944</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>assigned_to</field>
+				<old_value>100</old_value>
+				<date>1186660118</date>
+				<by>tomas_vondra</by>
+			</item>
+			<item>
+				<field>File Added</field>
+				<old_value>239732: examples.zip</old_value>
+				<date>1186129803</date>
+				<by>tomas_vondra</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1770188">
+		<submitted_by>orthellojw</submitted_by>
+		<submitted_date>2007-08-08 11:01</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Extreme memory usage while extracting text from one pdf</summary>
+		<detail>When I use a -Xmx value of 64m or 128m the following exception is thrown:
+
+Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
+        at java.lang.StringCoding.set(StringCoding.java:53)
+        at java.lang.StringCoding.decode(StringCoding.java:171)
+        at java.lang.String.&lt;init&gt;(String.java:444)
+        at java.lang.String.&lt;init&gt;(String.java:516)
+        at org.fontbox.cmap.CMapParser.createStringFromBytes(CMapParser.java:467)
+        at org.fontbox.cmap.CMapParser.parse(CMapParser.java:201)
+        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
+        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
+        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
+        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
+        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
+        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
+        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
+        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
+        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
+        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
+        at us.fed.nmcourt.common.pdfbox.NmdLucenePDFDocument.addContent(NmdLucenePDFDocument.java:433)
+        at us.fed.nmcourt.common.pdfbox.NmdLucenePDFDocument.convertDocument(NmdLucenePDFDocument.java:292)
+        at us.fed.nmcourt.drs.daemonmanager.handler.AbstractDaemonManagerHandler.writeIncomingDocumentsToIndex(AbstractDaemonManagerHandler.java:355)
+        at us.fed.nmcourt.drs.daemonmanager.handler.CaddHandler.handle(CaddHandler.java:112)
+        at us.fed.nmcourt.drs.daemonmanager.MainDaemonRunnable.run(MainDaemonRunnable.java:367)
+        at java.lang.Thread.run(Thread.java:619)
+
+When I use a -Xmx value of 256m it works.  Is there any way that you can reduce the memory requirement for extracting the text from this pdf.
+
+The pdf is only 1.1MB in size.
+
+Arhhhh.  The pdf is larger than 256KB so it cannot be attached below.  I'll email it to you if you give your email address.
+
+Thanks in advance,
+
+James
+jwilson@nmcourt.fed.us
+</detail>
+		<follow_ups>
+			<item>
+				<date>1187820562</date>
+				<sender>orthellojw</sender>
+				<text>Logged In: YES 
+user_id=1832000
+Originator: YES
+
+Where using the server vm I have to bump up the value for -Xmx to 512m.</text>
+			</item>
+			<item>
+				<date>1186597602</date>
+				<sender>benlitchfield</sender>
+				<text>Logged In: YES 
+user_id=601708
+Originator: NO
+
+Yes, please upload(ftp.pdfbox.org) or email(ben@benlitchfield.com) me the PDF, just add a quick note here with the filename.
+
+Thanks,
+Ben</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1768715">
+		<submitted_by>dougcook</submitted_by>
+		<submitted_date>2007-08-06 10:16</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Text extraction gives incorrect results for attached PDF</summary>
+		<detail>The attached PDF does not render correctly in 0.7.3 -- extracted text is simply a bunch of garbage characters.
+</detail>
+		<follow_ups>
+			<item>
+				<date>1194648291</date>
+				<sender>dougcook</sender>
+				<text>Logged In: YES 
+user_id=1851816
+Originator: YES
+
+Yes, text extraction fails, and I can't find snippets from this doc in either Google or Yahoo!, meaning that their text extraction libraries probably also fail on this doc.
+
+Yet it renders perfectly visibly -- not sure if this is simply a limitation of all the existing text extraction algorithms or that the doc is constructed in some way which makes text extraction impossible.</text>
+			</item>
+			<item>
+				<date>1194643135</date>
+				<sender>carlemac_2007</sender>
+				<text>Logged In: YES 
+user_id=1933815
+Originator: NO
+
+Try the same file with free Adobe Acrobat Reader 8.1.1
+
+1. Save as text - the resulting text file has "unprintable" characters also.  
+
+A PDF construction error?</text>
+			</item>
+		</follow_ups>
+		<existingfiles>
+			<file>
+				<id>240109</id>
+				<name>11.pdf</name>
+				<description>Example for extraction problems</description>
+				<filesize>99627</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1186420611</adddate>
+				<submitted_by>1851816</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>240109: 11.pdf</old_value>
+				<date>1186420611</date>
+				<by>dougcook</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1761581">
+		<submitted_by>nalundgaard</submitted_by>
+		<submitted_date>2007-07-26 14:11</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Text Extraction strips 1 char when extracting a twin pair</summary>
+		<detail>See attached file. We found a bug in PDFBox where it appears to randomly delete 1 character of a twin pair of characters. 
+
+For example, we've noticed that what shows up as 1001 in a PDF file (in Acrobat) may become 101 in the text output of TextStripper.exe. This appears to happen to a large number of twin pairs, as evidenced by the attached file. 
+
+Note that the file was created using Microsoft Word 11.3.5 on Mac OS X using the "print to PDF" feature of Mac OS X 10.4.10. </detail>
+		<follow_ups>
+			<item>
+				<date>1193434271</date>
+				<sender>ibuzz</sender>
+				<text>Logged In: YES 
+user_id=964306
+Originator: NO
+
+I had the same issue with a PDF document created with Microsoft Word 2004 for Mac OS X. No problem with Word X.</text>
+			</item>
+		</follow_ups>
+		<existingfiles>
+			<file>
+				<id>238687</id>
+				<name>doublesTest.zip</name>
+				<description>This zip file contains a test PDF file and the text output from running ExtractText.exe on it, in version 0.7.2 and 0.7.3</description>
+				<filesize>40221</filesize>
+				<filetype>application/zip</filetype>
+				<adddate>1185484285</adddate>
+				<submitted_by>1854351</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>238687: doublesTest.zip</old_value>
+				<date>1185484286</date>
+				<by>nalundgaard</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1761570">
+		<submitted_by>nalundgaard</submitted_by>
+		<submitted_date>2007-07-26 13:51</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Text Extraction strips 1 char when extracting a twin pair</summary>
+		<detail>See attached file. We found a bug in PDFBox where it appears to randomly delete 1 character of a twin pair of characters. 
+
+For example, we've noticed that what shows up as 1001 in a PDF file (in Acrobat) may become 101 in the text output of TextStripper.exe. This appears to happen to a large number of twin pairs, as evidenced by the attached file. 
+
+Note that the file was created using Microsoft Word 11.3.5 on Mac OS X using the "print to PDF" feature of Mac OS X 10.4.10. </detail>
+		<existingfiles>
+			<file>
+				<id>238686</id>
+				<name>doublesTest.zip</name>
+				<description>This zip file contains a test PDF file and the text output from running ExtractText.exe on it, in version 0.7.2 and 0.7.3</description>
+				<filesize>40221</filesize>
+				<filetype>application/zip</filetype>
+				<adddate>1185483093</adddate>
+				<submitted_by>1854351</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>238686: doublesTest.zip</old_value>
+				<date>1185483093</date>
+				<by>nalundgaard</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1755543">
+		<submitted_by>taywin</submitted_by>
+		<submitted_date>2007-07-17 08:52</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="520300">parsing</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>java.lang.NoSuchMethodError in fontbox</summary>
+		<detail>I have been trying to extract text content from some PDF files which were created by Omnipage CaptureSDK V12.7 but the Java threw this exception. Is the problem from my files or it is from fontbox library? Any comment would be appreciated. Thank you.
+
+Microsoft Windows XP [Version 5.1.2600]
+(C) Copyright 1985-2001 Microsoft Corp.
+
+C:\Documents and Settings\Owner&gt;cd\intern-search\mysearch
+
+C:\Intern-Search\mysearch&gt;jrun-lucene MyPDF2
+PDF File name (no extension): test
+
+ ------ Text Data ------
+
+Page number: 1
+Exception in thread "main" java.lang.NoSuchMethodError: org.fontbox.cmap.CMapPar
+ser.parse(Ljava/lang/String;Ljava/io/InputStream;)Lorg/fontbox/cmap/CMap;
+        at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:509)
+        at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:380)
+        at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
+        at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
+        at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:
+452)
+        at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java
+:215)
+        at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:17
+4)
+        at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
+
+        at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259
+)
+        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
+        at MyPDF2.processDocument(MyPDF2.java:133)
+        at MyPDF2.main(MyPDF2.java:82)
+
+C:\Intern-Search\mysearch&gt;</detail>
+		<existingfiles>
+			<file>
+				<id>237366</id>
+				<name>test.pdf</name>
+				<description>Omnipage CaptureSDK V12.7 patent file</description>
+				<filesize>24097</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1184687550</adddate>
+				<submitted_by>1823463</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>237366: test.pdf</old_value>
+				<date>1184687550</date>
+				<by>taywin</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1755412">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2007-07-17 04:08</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Null pointer exception in scanned PDF's</summary>
+		<detail>We are trying to extract text and images form PDF's. I have 2 PDF's which cause PDFBox to crash.  I get  null pointer exception.  The first PDF is a scanned document. The second PDF is a document that is surrounded by a table and contains images as well. These two documnets are causing PDF to get an exception.  Does anyone have any suggestions? </detail>
+		<follow_ups>
+			<item>
+				<date>1197241293</date>
+				<sender>dgakhar</sender>
+				<text>Logged In: YES 
+user_id=1937173
+Originator: NO
+
+Hi, my name is Deepika and trying to resolve exceptions related to image extractions. It would be great if you could upload the files that caused PDFBox to break.
+
+Regards,
+Deepika</text>
+			</item>
+			<item>
+				<date>1188428865</date>
+				<sender>weiju</sender>
+				<text>Logged In: YES 
+user_id=186935
+Originator: NO
+
+I get the same problem, getting a
+
+java.lang.NullPointerException
+        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:194)
+        at org.pdfbox.pdmodel.PDPageNode.getKids(PDPageNode.java:171)
+
+A quick peek in the source code shows me that the PDPageNode is created with 
+
+PDDocumentCatalog.java, line 208:
+public PDPageNode getPages()
+{
+  return new PDPageNode((COSDictionary)root.getDictionaryObject(COSName.PAGES));
+}
+
+But my root node dictionary only seems to contain the "Type" key, so the above source code probably leads to a PDPageNode with a null dictionary.
+</text>
+			</item>
+			<item>
+				<date>1188397337</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+I have found the same problem. The released version of the code gets a class cast exception. This appears to be fixed in the latest development build, but now it gets a NULL pointer exception.</text>
+			</item>
+		</follow_ups>
+	</artifact>
+	<artifact id="1753348">
+		<submitted_by>fuji4</submitted_by>
+		<submitted_date>2007-07-13 03:16</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>PDF to Image: Problem with text</summary>
+		<detail>I have some problems when converting PDF pages to images:
+The spaces between the letters are not correct!!!
+
+</detail>
+		<follow_ups>
+			<item>
+				<date>1194946565</date>
+				<sender>nobody</sender>
+				<text>Logged In: NO 
+
+Using org.pdfbox.pdfviewer.PageDrawer I noticed the same thing.
+I think the spaces are not wrong, but the text is too big. The next piece of text is thereby overlapping the previous text, because the previous text runs to far.
+
+Looking in the PDFont-classess, I noticed the line:
+g2d.drawString( string, (int)x, (int)y );
+by casting the float-value to int, we loose precission. It makes sence, because not all the text runs too far.
+
+I suggest to multiply it by 1000 and than scale it back or something, to prevent loss of such precission.
+
+b.t.w. drawString in PDType3Font is not overriding but overloading, I presume that has not done on purpose either?
+
+the_tzar@users.sourceforge.net (anonymous: cannot get to my password right now :) )</text>
+			</item>
+		</follow_ups>
+		<existingfiles>
+			<file>
+				<id>236851</id>
+				<name>demo.jpg</name>
+				<description>Demo Article</description>
+				<filesize>243495</filesize>
+				<filetype>image/jpeg</filetype>
+				<adddate>1184321819</adddate>
+				<submitted_by>1424621</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>236851: demo.jpg</old_value>
+				<date>1184321819</date>
+				<by>fuji4</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1745046">
+		<submitted_by>nobody</submitted_by>
+		<submitted_date>2007-06-28 13:10</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>PDF document renders incorrectly</summary>
+		<detail>The attached PDF is rendered not rendered correctly with PDFBox 0.7.3. 
+
+If you gave me the correct pointers I could try to find out myself what is happening with this document, but currently im just lost in PDFBox code I do not understand.</detail>
+		<existingfiles>
+			<file>
+				<id>234982</id>
+				<name>DAYvid Demo2.pdf</name>
+				<description>PDF illustrating the problem</description>
+				<filesize>179270</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1183061419</adddate>
+				<submitted_by>100</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>234982: DAYvid Demo2.pdf</old_value>
+				<date>1183061420</date>
+				<by>nobody</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1745009">
+		<submitted_by>jakep36</submitted_by>
+		<submitted_date>2007-06-28 11:30</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="100">None</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Nobody/Anonymous</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>overlay shifting the document slightly</summary>
+		<detail>I'm trying to use the overlay method to put a pre-printed form over top of print data converted to pdf. Overlay seems to shift my print data slightly down and to the left. Both the overlay pdf and the print data are 8.5 by 11. I was using 0.7.3 but switched to 0.7.4-dev-20070628, but it is still happening. Has anyone else experienced this issue and have you discovered a work around or fix? 
+
+With the sample files, I'm only trying to overlay the first page of overlay.pdf with the first page of the print file.
+ 
+code looks something like this: 
+newOverlay = PDDocument.new 
+newOverlay.importPage(overlayPdf.page(0)) 
+overlayer = Overlay.new 
+overlayer.overlay(newDocument,newOverlay) 
+@pdf.importPage(newOverlay.getDocumentCatalog().getAllPages()[0]) </detail>
+		<follow_ups>
+			<item>
+				<date>1183058653</date>
+				<sender>jakep36</sender>
+				<text>Logged In: YES 
+user_id=1768812
+Originator: YES
+
+File Added: Archive.zip</text>
+			</item>
+			<item>
+				<date>1183058653</date>
+				<sender>jakep36</sender>
+				<text>Logged In: YES 
+user_id=1768812
+Originator: YES
+
+File Added: Archive.zip</text>
+			</item>
+			<item>
+				<date>1183057286</date>
+				<sender>benlitchfield</sender>
+				<text>Logged In: YES 
+user_id=601708
+Originator: NO
+
+see issue_1745009.zip</text>
+			</item>
+		</follow_ups>
+		<existingfiles>
+			<file>
+				<id>234976</id>
+				<name>Archive.zip</name>
+				<description>elec_new.pdf is the overlay, print file, and my attempt to combine them.</description>
+				<filesize>86002</filesize>
+				<filetype>application/zip</filetype>
+				<adddate>1183058652</adddate>
+				<submitted_by>1768812</submitted_by>
+			</file>
+			<file>
+				<id>234975</id>
+				<name>Archive.zip</name>
+				<description>elec_new.pdf is the overlay, print file, and my attempt to combine them.</description>
+				<filesize>86002</filesize>
+				<filetype>application/zip</filetype>
+				<adddate>1183058649</adddate>
+				<submitted_by>1768812</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>234975: Archive.zip</old_value>
+				<date>1183058653</date>
+				<by>jakep36</by>
+			</item>
+			<item>
+				<field>File Added</field>
+				<old_value>234976: Archive.zip</old_value>
+				<date>1183058653</date>
+				<by>jakep36</by>
+			</item>
+			<item>
+				<field>File Deleted</field>
+				<old_value>234967: </old_value>
+				<date>1183056237</date>
+				<by>jakep36</by>
+			</item>
+			<item>
+				<field>File Added</field>
+				<old_value>234967: Archive.zip</old_value>
+				<date>1183055430</date>
+				<by>jakep36</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1744777">
+		<submitted_by>verraf</submitted_by>
+		<submitted_date>2007-06-28 05:48</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="519373">text extraction</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Could not find font(COSName{F0}) in map={}</summary>
+		<detail>Hi,  
+trying to extract text from a PDF file, I've receiving this error:  
+java.io.IOException: Error: Could not find font(COSName{F0}) in map={} </detail>
+		<existingfiles>
+			<file>
+				<id>234915</id>
+				<name>0598647270004.pdf</name>
+				<description>Sample file</description>
+				<filesize>49218</filesize>
+				<filetype>application/pdf</filetype>
+				<adddate>1183034937</adddate>
+				<submitted_by>1820955</submitted_by>
+			</file>
+		</existingfiles>
+		<change_log>
+			<item>
+				<field>File Added</field>
+				<old_value>234915: 0598647270004.pdf</old_value>
+				<date>1183034937</date>
+				<by>verraf</by>
+			</item>
+		</change_log>
+	</artifact>
+	<artifact id="1735902">
+		<submitted_by>scop</submitted_by>
+		<submitted_date>2007-06-12 10:23</submitted_date>
+		<artifact_type id="552832">552832</artifact_type>
+		<category id="775780">PDModel.AcroForm</category>
+		<artifact_group id="100"></artifact_group>
+		<assigned_to>Ben Litchfield</assigned_to>
+		<priority id="5">5</priority>
+		<status>Open</status>
+		<resolution>None</resolution>
+		<summary>Character encoding/appearance issues when filling forms</summary>
+		<detail>When filling a text field with non-ASCII characters such as in my surname "SkyttÃ¤" and saving the document in a UTF-8 environment, something goes wrong with the appearance of the text.
+
+The value itself seems to be stored correctly, but when opening the doc, the appearance of "Ã¤" is not that, but rather something which happens when UTF-8 is mistakenly treated as ISO-8859-1 (two garbage characters).
+
+PDAppearance uses the platform default encoding in quite a few places which apparently has potential to mess things up.  In particular, insertGeneratedAppearance() generates a PrintWriter from an OutputStream without specifying the encoding.  In fact, if I hack that to use ISO-8859-1, the appearance of my "Ã¤" case is correct, but that won't obviously work with anything else than chars that are valid ISO-8859-1.
+
+In which char encoding should the value be written to the appearance stream (at end of insertGeneratedAppearance())?</detail>
+	</artifact>
+	<artifact id="1730509">

[... 36453 lines stripped ...]