You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/06/17 22:07:14 UTC
[jira] [Closed] (PDFBOX-83) Processing horizontally first then
horizontally
[ https://issues.apache.org/jira/browse/PDFBOX-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson closed PDFBOX-83.
-----------------------------
Resolution: Not a Problem
We already have this kind of sorting, see TextPositionComparator. See PDFTextStripper#setSortByPosition
> Processing horizontally first then horizontally
> -----------------------------------------------
>
> Key: PDFBOX-83
> URL: https://issues.apache.org/jira/browse/PDFBOX-83
> Project: PDFBox
> Issue Type: New Feature
> Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1269072
> Originally submitted by tanvinguyen on 2005-08-24 13:11.
> I would like to see the implementation of coalescing
> where all words will be appended horizontally first then
> vertically. If this features is implemented properly all the
> fields of a table will be extracted and printed correctly
> as in the original PDF document.
> Sample: Page 2 of PDFBox References. All Content of
> column Project Name will be extracted before Colum
> License.
> ===========
> Centric CRM
> (http://www.centriccrm.com)
> Free To Use But
> Restricted/Commercial
> The Most Advanced Open
> Source CRM Software.
> =============
> Thanks,
> -tan
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552835&aid=1269072&file_id=146953
> HtmlOutputDev.h (text/plain), 8329 bytes
> This is the header file from PDFtoHTML
> [comment on SourceForge]
> Originally sent by tanvinguyen.
> Logged In: YES
> user_id=683822
> I uploaded an RTF file converted from PDF file using my
> applicatin developed in C++.
> [comment on SourceForge]
> Originally sent by tanvinguyen.
> Logged In: YES
> user_id=683822
> Ben,
> Thanks for quick response. Generally speaking, I highly
> appreciated your effort in developing such a wonderful open-
> source package.
> I am interesting in developing a PDF to RTF converter. Its
> main features include keeping all text attributes such as
> strikethru, underlined, fonts attributes, and spacing. In the
> past, I successfully developed an application in C++ using
> XPDF package and added code to do what I want.
> Now I would like to implement these features using PDFBox
> to deploy the application in a J2EE environment.
> Here's the basic algorithm they use in XPDF. First, they
> build a link list of string nodes. These string nodes contain x-
> y coordinates of text strings. Like your TextPosition
> instance, however their string nodes also contain all
> information about their coordinates including LowerLeft X,Y
> and UpperRight X-Y. They call yMin, yMax and xMin, xMax.
> They store all these Strings nodes in major y-x axis.
> Then they coalesce and merge all string nodes with the
> same Y-coordinate first, therefore I was able to extract and
> convert into RTF and maintain the same content and format
> of PDF file.
> I am trying to figure out how to add extra information to your
> TextPosition class, so later on, I will be able to traverse thru
> major y-axis and build a list of these string nodes.
> If you can provide me information needed to obtain all
> information about coordinates or position of a text string, I
> think I will be able to implement these features. I will
> contribute these codes to your project.
> I uploaded a header file from XPDF, a sample PDF file which I
> tried to convert and an RTF file.
> I am not trying to convert "TABLE" from PDF file. I
> understand that concept does not exist in PDF.
>
> Thanks,
>
> Tan V. Nguyen
> [comment on SourceForge]
> Originally sent by benlitchfield.
> Logged In: YES
> user_id=601708
> text in a pdf document is drawn at x/y locations. Which
> means there is no relationship to text drawn in a column. If
> you can propose an algorithm to determine columns of text
> then I will implement it. As a side note, there is no such
> thing as a 'table' in a pdf document, only lines drawn between
> two points and text drawn at x/y locations. The only way
> a 'column' of could be determined is by analyzing lines on the
> PDF document, not an easy thing to do.
> Ben Litchfield
--
This message was sent by Atlassian JIRA
(v6.2#6252)