You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pdfbox.apache.org by ms...@apache.org on 2016/12/11 09:04:55 UTC

[1/2] pdfbox-docs git commit: PDFBOX-3330: add information about setting the text order for text extraction

Repository: pdfbox-docs
Updated Branches:
  refs/heads/master f34701e19 -> f03ca583e


PDFBOX-3330: add information about setting the text order for text extraction


Project: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/commit/7e9ff660
Tree: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/tree/7e9ff660
Diff: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/diff/7e9ff660

Branch: refs/heads/master
Commit: 7e9ff6603fb5266ca9510406e9d703c21b08ae2b
Parents: f34701e
Author: Maruan Sahyoun <sa...@fileaffairs.de>
Authored: Sun Dec 11 09:57:21 2016 +0100
Committer: Maruan Sahyoun <sa...@fileaffairs.de>
Committed: Sun Dec 11 09:57:21 2016 +0100

----------------------------------------------------------------------
 content/2.0/faq.md | 11 +++++++++++
 1 file changed, 11 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/7e9ff660/content/2.0/faq.md
----------------------------------------------------------------------
diff --git a/content/2.0/faq.md b/content/2.0/faq.md
index d42750f..f1a9739 100644
--- a/content/2.0/faq.md
+++ b/content/2.0/faq.md
@@ -38,6 +38,7 @@ title:   Frequently Asked Questions (FAQ)
 
 ### Text Extraction
 
+ - [Why does the extracted text appear in the wrong sequence?](#textorder)
  - [How come I am not getting any text from the PDF document?](#notext)
  - [How come I am getting gibberish(G38G43G36G51G5) when extracting text?](#gibberish)
  - [What does "java.io.IOException: Can't handle font width" mean?](#fontwidth)
@@ -127,6 +128,16 @@ Make sure that you closed your content stream before saving.
 
 ## Text Extraction
 
+<a name="textorder"></a>
+
+
+## Why does the extracted text appear in the wrong sequence?
+
+By default, text extraction is done in the same sequence as the text in the PDF page content stream.
+PDF is a graphic format, not a text format, and unlike HTML, it has no requirements that text one on page
+be rendered in a certain order. The order is the one that was determined by the software that created the PDF.
+To get text sorted from left to right and top to botton, use `setSortByPosition(true)`.
+
 <a name="notext"></a>
 
 ### How come I am not getting any text from the PDF document? ###


[2/2] pdfbox-docs git commit: PDFBOX-3330: add information about text antialiasing for rendering

Posted by ms...@apache.org.
PDFBOX-3330: add information about text antialiasing for rendering


Project: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/commit/f03ca583
Tree: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/tree/f03ca583
Diff: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/diff/f03ca583

Branch: refs/heads/master
Commit: f03ca583e4645ba5a08c63d9edfb73005fba5f22
Parents: 7e9ff66
Author: Maruan Sahyoun <sa...@fileaffairs.de>
Authored: Sun Dec 11 10:04:38 2016 +0100
Committer: Maruan Sahyoun <sa...@fileaffairs.de>
Committed: Sun Dec 11 10:04:38 2016 +0100

----------------------------------------------------------------------
 content/2.0/faq.md | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/f03ca583/content/2.0/faq.md
----------------------------------------------------------------------
diff --git a/content/2.0/faq.md b/content/2.0/faq.md
index f1a9739..4fce7b0 100644
--- a/content/2.0/faq.md
+++ b/content/2.0/faq.md
@@ -47,7 +47,8 @@ title:   Frequently Asked Questions (FAQ)
 
 ### PDF rendering
 
- - [A drop shadow is missing or at the wrong position when rendering a page](#dropshadow)  
+ - [A drop shadow is missing or at the wrong position when rendering a page](#dropshadow)
+ - [Why are some texts in poor quality and not antialiased?](#textantialias)
 
 ## General Questions
 
@@ -131,7 +132,7 @@ Make sure that you closed your content stream before saving.
 <a name="textorder"></a>
 
 
-## Why does the extracted text appear in the wrong sequence?
+### Why does the extracted text appear in the wrong sequence?
 
 By default, text extraction is done in the same sequence as the text in the PDF page content stream.
 PDF is a graphic format, not a text format, and unlike HTML, it has no requirements that text one on page
@@ -197,4 +198,14 @@ the word "Hello" is drawn.
 
 ### A drop shadow is missing or at the wrong position when rendering a page
 
-Please attach your file in the [PDFBOX-3000](https://issues.apache.org/jira/browse/PDFBOX-3000) issue
+Please attach your file in the [PDFBOX-3000](https://issues.apache.org/jira/browse/PDFBOX-3000) issue.
+
+<a name="textantialias"></a>
+
+### Why are some texts in poor quality and not antialiased?
+
+This is because in some PDFs (e.g. the one in PDFBOX-2814 <https://issues.apache.org/jira/browse/PDFBOX-2814>), text is not
+rendered directly, but as a shaped clipping from a background. Java graphics does not support "soft clipping"
+<https://bugs.openjdk.java.net/browse/JDK-4212743>, and because of that, the edges are not looking smooth.
+Soft clipping could be achieved with some extra steps <https://community.oracle.com/blogs/campbell/2006/07/19/java-2d-trickery-soft-clipping>,
+but these would cost additional time and memory space. You can have a higher quality by rendering at a higher dpi and then downscale the image.