You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/06/17 22:07:14 UTC

[jira] [Closed] (PDFBOX-83) Processing horizontally first then horizontally

     [ https://issues.apache.org/jira/browse/PDFBOX-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-83.
-----------------------------

    Resolution: Not a Problem

We already have this kind of sorting, see TextPositionComparator. See PDFTextStripper#setSortByPosition

> Processing horizontally first then horizontally
> -----------------------------------------------
>
>                 Key: PDFBOX-83
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-83
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1269072
> Originally submitted by tanvinguyen on 2005-08-24 13:11.
> I would like to see the implementation of coalescing 
> where all words will be appended horizontally first then 
> vertically.  If this features is implemented properly all the 
> fields of a table will be extracted and printed correctly 
> as in the original PDF document.
> Sample: Page 2 of PDFBox References. All Content of 
> column Project Name will be extracted before Colum 
> License.
> ===========
> Centric CRM
> (http://www.centriccrm.com)
> Free To Use But
> Restricted/Commercial
> The Most Advanced Open
> Source CRM Software.
> =============
> Thanks,
> -tan
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1269072&file_id=146953
> HtmlOutputDev.h (text/plain), 8329 bytes
> This is the header file from PDFtoHTML
> [comment on SourceForge]
> Originally sent by tanvinguyen.
> Logged In: YES 
> user_id=683822
> I uploaded an RTF file converted from PDF file using my 
> applicatin  developed in C++.
> [comment on SourceForge]
> Originally sent by tanvinguyen.
> Logged In: YES 
> user_id=683822
> Ben,
> Thanks for quick response. Generally speaking, I highly 
> appreciated your effort in developing such a wonderful open-
> source package.
> I am interesting in developing a PDF to RTF converter.  Its 
> main features include keeping all text attributes such as 
> strikethru, underlined, fonts attributes, and spacing.  In the 
> past, I successfully developed an application in C++ using 
> XPDF package and added code to do what I want.
> Now I would like to implement these features using PDFBox 
> to deploy the application in a J2EE environment.
> Here's the basic algorithm they use in XPDF.  First, they 
> build a link list of string nodes. These string nodes contain x-
> y coordinates of text strings. Like your TextPosition 
> instance, however their string nodes also contain all 
> information about their coordinates including LowerLeft X,Y 
> and UpperRight X-Y.  They call yMin, yMax and xMin, xMax.
> They store all these Strings nodes in major y-x axis.
> Then they coalesce and merge all string nodes with the 
> same Y-coordinate first, therefore I was able to extract and 
> convert into RTF and maintain the same content and format 
> of PDF file.
> I am trying to figure out how to add extra information to your 
> TextPosition class, so later on, I will be able to traverse thru 
> major y-axis and build a list of these string nodes.
> If you can provide me information needed to obtain all 
> information about coordinates or position of a text string, I 
> think I will be able to implement these features. I will 
> contribute these codes to your project.
> I uploaded a header file from XPDF, a sample PDF file which I 
> tried to convert and an RTF file.
> I am not trying to convert "TABLE" from PDF file.  I 
> understand that concept does not exist in PDF.
>  
> Thanks,
>  
> Tan V. Nguyen
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> text in a pdf document is drawn at x/y locations.  Which 
> means there is no relationship to text drawn in a column.  If 
> you can propose an algorithm to determine columns of text 
> then I will implement it.  As a side note, there is no such 
> thing as a 'table' in a pdf document, only lines drawn between 
> two points and text drawn at x/y locations.  The only way 
> a 'column' of could be determined is by analyzing lines on the 
> PDF document, not an easy thing to do.
> Ben Litchfield



--
This message was sent by Atlassian JIRA
(v6.2#6252)