You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pdfbox.apache.org by ms...@apache.org on 2013/05/03 19:07:33 UTC

svn commit: r1478875 - in /pdfbox/cmssite/trunk: content/getting-started.mdtext content/ideas.mdtext templates/skeleton.html

Author: msahyoun
Date: Fri May  3 17:07:33 2013
New Revision: 1478875

URL: http://svn.apache.org/r1478875
Log:
add ideas page to CMS site

Added:
    pdfbox/cmssite/trunk/content/ideas.mdtext   (with props)
Removed:
    pdfbox/cmssite/trunk/content/getting-started.mdtext
Modified:
    pdfbox/cmssite/trunk/templates/skeleton.html

Added: pdfbox/cmssite/trunk/content/ideas.mdtext
URL: http://svn.apache.org/viewvc/pdfbox/cmssite/trunk/content/ideas.mdtext?rev=1478875&view=auto
==============================================================================
--- pdfbox/cmssite/trunk/content/ideas.mdtext (added)
+++ pdfbox/cmssite/trunk/content/ideas.mdtext Fri May  3 17:07:33 2013
@@ -0,0 +1,68 @@
+Title: Ideas
+
+There are several ideas to enhance PDFBox. These are outlined below together with 
+comments and te releases they are planned for as soon as there is agreement to do the
+implementation.
+
+## Enhance type safety
+
+Enhance the type safety of PDFBox and add more generic collections and code cleanup.
+
+## Remove all deprecated methods ...
+
+# handle large pdf files
+in addition to the pdf parsing pdfbox does not always handle large pdf files well as some 
+of the references are implemented as int instead of long
+
+## Switch to Java 1.6
+
+## Break PDFBox into modules
+
+In order to support different use cases and provide a minimal toolset PDFBox should be 
+separated into different modules. This goes inline with rearranging some of the code
+e.g. remove awt from PDDocument.
+
+## Replace/enhance PDF parsing
+
+The old "classic" PDF parser in PDFBox is not in line with the PDF specification as it parses
+a PDF from top to bottom instead of respecting the XRef information. The NonSequentialParser
+enhanced that situation but there is a need to have a cleaner foundation broken into several levels
+
+- io
+- tokenization
+- parsing according to structure
+- COS level document
+- PD level document
+
+In addition handling documents which are not conforming shouldn't be part of the core parser
+but of a extentable approach e.g. by adding hooks to allow for handling parsing exceptions.
+
+
+## Rearchitect the COS level objects
+
+The COS level objects need to be refactored to be in line with the new parser. In addition
+method signatures, constructing ... should be made similar across the COS objects
+
+## Parsing on demand
+
+Instead of always parsing the complete document PDFs should be parsable on demand making
+objects only available as they are needed to enhance performance and minimize memory footprint.
+
+This might be achieved by providing a layered approach where a base (non caching) parser provides
+the on demand parsing and a caching parser built on top caches objects for use cases where
+this is beneficial e.g. rendering, debugging ...
+
+o the lexer would be the low level component delivering tokens to the parser.
+  A sample implementation exists as part of PDFBOX-1000. The benefit would be a clean low
+  level handling of tokens. The current implementation needs to be (slightly ?) revised though
+o the incremental (non caching) parser would allow for page by page processing moving forward 
+  only to support text extraction, merging, splitting … - the benefit would be a lower memory 
+  consumption as well as a potential faster processing
+o the caching parser would support applications such a PDFDebugger or PDFReader 
+
+# handling of pdf versions
+the current implementation is a mix of PDF 1.4 and some adhoc additions without a clear 
+distinction what is and is not supported. We could ad some support for explicitly handling
+versions in pdfbox e.g. my marking certain methods and properties to the pdf version support
+level. This could in addition be a good basis for PDF/A and other compliance checks. 
+

Propchange: pdfbox/cmssite/trunk/content/ideas.mdtext
------------------------------------------------------------------------------
    svn:eol-style = native

Modified: pdfbox/cmssite/trunk/templates/skeleton.html
URL: http://svn.apache.org/viewvc/pdfbox/cmssite/trunk/templates/skeleton.html?rev=1478875&r1=1478874&r2=1478875&view=diff
==============================================================================
--- pdfbox/cmssite/trunk/templates/skeleton.html (original)
+++ pdfbox/cmssite/trunk/templates/skeleton.html Fri May  3 17:07:33 2013
@@ -124,9 +124,11 @@
                     </ul>
                 </li>
                 <li  class="nav-header">For Developers</li>
-                <li><a href="/building.html">
                     <i class="icon-chevron-right"></i>
                     Building PDFBox</a></li>
+                <li><a href="/ideas.html">
+                    <i class="icon-chevron-right"></i>
+                    Ideas</a></li>
                 <li><a href="/codingconventions.html">
                     <i class="icon-chevron-right"></i>
                     Coding Conventions</a></li>