You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2009/01/20 19:48:11 UTC

DO NOT REPLY [Bug 46568] New: PPTX text extraction works incorrectly, spaces line carriages removed in some cases

https://issues.apache.org/bugzilla/show_bug.cgi?id=46568

           Summary: PPTX text extraction works incorrectly, spaces line
                    carriages removed in some cases
           Product: POI
           Version: 3.5-dev
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: critical
          Priority: P2
         Component: POI Overall
        AssignedTo: dev@poi.apache.org
        ReportedBy: sreeni@sendmail.com


The PPTX issue manifests itself when a document is being decomposed and
searched for a string.  For some reason, some whitespace and line carriages are
being deleted.

If you try to match a Friday that is concatenated with another string (such as
"otherFriday"), it will fail.  Note that a regular expression match will work,
however.  This
behavior has been observed in 3 of 8 randomly selected pptx downloaded from the
internet.  However, document identification seems to work just fine, so the
only way that some one using the new POI engine would be affected is if they
were decomposing attachments and searching for a simple string in them (and
they would only be affected on PowerPoint 2007 documents).  As noted above,
regular expression matching is a workaround that could be employed.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 46568] PPTX text extraction works incorrectly, spaces line carriages removed in some cases

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=46568





--- Comment #1 from sreeni <sr...@sendmail.com>  2009-01-20 10:53:22 PST ---
Created an attachment (id=23143)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=23143)
PPTX file to be extracted

Please use this PPTX to extract the text.  The spaces and carriage returns are
removed.


-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


DO NOT REPLY [Bug 46568] PPTX text extraction works incorrectly, spaces line carriages removed in some cases

Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=46568


Yegor Kozlov <ye...@dinom.ru> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




--- Comment #2 from Yegor Kozlov <ye...@dinom.ru>  2009-04-20 11:06:44 PST ---
Fixed in r766775 ( https://svn.apache.org/viewcvs.cgi?view=rev&rev=766775 )CTTextLineBreak were not properly processed resulting in missing line
carriages.

Yegor

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org