You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "George Van Treeck (JIRA)" <ji...@apache.org> on 2009/05/24 21:01:45 UTC
[jira] Issue Comment Edited: (PDFBOX-83) Processing horizontally first then horizontally

    [ https://issues.apache.org/jira/browse/PDFBOX-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712589#action_12712589 ] 

George Van Treeck edited comment on PDFBOX-83 at 5/24/09 12:00 PM:
-------------------------------------------------------------------

I just tried the latest version and ran into the issue here (jumbled text in a PDF table). I think the following alogrithm might work to fix the problem.

First sort all text items into sets having the same x coordinate, i.e., assume all vertically adjacent text items with same x coordiante are all part of table cell. For each set, select a text item and locate a horizontally adjacent text item (same y coordinate), if the adjacent text item is part of another set of text items all sharing an x coordinate, then the adjacent item is part of a different table cell, which means you should concatenate all the text items in the first set and then concatenate all the text items in the adjacent set.

      was (Author: treeck@yahoo.com):
    I just tried the latest version and ran into the issue here (jumbled text in a PDF table). I think the following alogrithm might work to fix the problem.

First sort all text items into sets having the same y coordinate, i.e., assume all vertically adjacent text items with same y coordiante are all part of table cell. For each set, select a text item and locate a horizontally adjacent text item, if the adjacent text item is part of another set of text items all sharing a y coordinate, then the adjacent item is part of a different table cell, which means you should concatenate all the text items in the first set and then concatenate all the text items in the adjacent set.
  
> Processing horizontally first then horizontally
> -----------------------------------------------
>
>                 Key: PDFBOX-83
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-83
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1269072
> Originally submitted by tanvinguyen on 2005-08-24 13:11.
> I would like to see the implementation of coalescing 
> where all words will be appended horizontally first then 
> vertically.  If this features is implemented properly all the 
> fields of a table will be extracted and printed correctly 
> as in the original PDF document.
> Sample: Page 2 of PDFBox References. All Content of 
> column Project Name will be extracted before Colum 
> License.
> ===========
> Centric CRM
> (http://www.centriccrm.com)
> Free To Use But
> Restricted/Commercial
> The Most Advanced Open
> Source CRM Software.
> =============
> Thanks,
> -tan
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1269072&file_id=146953
> HtmlOutputDev.h (text/plain), 8329 bytes
> This is the header file from PDFtoHTML
> [comment on SourceForge]
> Originally sent by tanvinguyen.
> Logged In: YES 
> user_id=683822
> I uploaded an RTF file converted from PDF file using my 
> applicatin  developed in C++.
> [comment on SourceForge]
> Originally sent by tanvinguyen.
> Logged In: YES 
> user_id=683822
> Ben,
> Thanks for quick response. Generally speaking, I highly 
> appreciated your effort in developing such a wonderful open-
> source package.
> I am interesting in developing a PDF to RTF converter.  Its 
> main features include keeping all text attributes such as 
> strikethru, underlined, fonts attributes, and spacing.  In the 
> past, I successfully developed an application in C++ using 
> XPDF package and added code to do what I want.
> Now I would like to implement these features using PDFBox 
> to deploy the application in a J2EE environment.
> Here's the basic algorithm they use in XPDF.  First, they 
> build a link list of string nodes. These string nodes contain x-
> y coordinates of text strings. Like your TextPosition 
> instance, however their string nodes also contain all 
> information about their coordinates including LowerLeft X,Y 
> and UpperRight X-Y.  They call yMin, yMax and xMin, xMax.
> They store all these Strings nodes in major y-x axis.
> Then they coalesce and merge all string nodes with the 
> same Y-coordinate first, therefore I was able to extract and 
> convert into RTF and maintain the same content and format 
> of PDF file.
> I am trying to figure out how to add extra information to your 
> TextPosition class, so later on, I will be able to traverse thru 
> major y-axis and build a list of these string nodes.
> If you can provide me information needed to obtain all 
> information about coordinates or position of a text string, I 
> think I will be able to implement these features. I will 
> contribute these codes to your project.
> I uploaded a header file from XPDF, a sample PDF file which I 
> tried to convert and an RTF file.
> I am not trying to convert "TABLE" from PDF file.  I 
> understand that concept does not exist in PDF.
>  
> Thanks,
>  
> Tan V. Nguyen
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES 
> user_id=601708
> text in a pdf document is drawn at x/y locations.  Which 
> means there is no relationship to text drawn in a column.  If 
> you can propose an algorithm to determine columns of text 
> then I will implement it.  As a side note, there is no such 
> thing as a 'table' in a pdf document, only lines drawn between 
> two points and text drawn at x/y locations.  The only way 
> a 'column' of could be determined is by analyzing lines on the 
> PDF document, not an easy thing to do.
> Ben Litchfield

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.