You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2009/08/25 07:50:22 UTC
DO NOT REPLY [Bug 47731] New: Word Extractor considers text copied
from some website as an embedded object
https://issues.apache.org/bugzilla/show_bug.cgi?id=47731
Summary: Word Extractor considers text copied from some website
as an embedded object
Product: POI
Version: 3.2-FINAL
Platform: PC
OS/Version: Windows Server 2003
Status: NEW
Severity: major
Priority: P2
Component: HWPF
AssignedTo: dev@poi.apache.org
ReportedBy: gi.bijlani@sap.com
--- Comment #0 from Gitu <gi...@sap.com> 2009-08-24 22:50:21 PDT ---
Hi,
I have copied some text from some web page and pasted that in a word document.
Now, when I use WordExtractor to extract the content of that document, then
complete content gets extracted but the summary information comes multiple
times.
After investigating I came to know that each part in that document is
considered as an embedded object and hence for each embedded object, summary is
getting extracted ie. same value is coming those many times.
I also wanted to know if considering an HTML content as an Embedded object is a
valid behaviour.
I have attached a document which can reproduce the scenario.
Many thanks in advance,
Gitu
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 47731] Word Extractor considers text copied from
some website as an embedded object
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47731
Gitu <gi...@sap.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEEDINFO |NEW
CC| |yegor@dinom.ru
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 47731] Word Extractor considers text copied from
some website as an embedded object
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47731
Gitu <gi...@sap.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |gi.bijlani@sap.com
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 47731] Word Extractor considers text copied from
some website as an embedded object
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47731
Sergey Vladimirov <vl...@gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Severity|major |enhancement
--- Comment #3 from Sergey Vladimirov <vl...@gmail.com> 2011-07-24 18:55:07 UTC ---
Text extractor does extract all text from document, but not from included OLE
objects. Those objects can be actually other Word documents and/or Excel
stylesheet and/or vector images.
There is can be an enchancement to TextExtractor to allow extracting text from
OLE objects, but surely current behaviour not a bug. Changing importance to
"enchancement".
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 47731] Word Extractor considers text copied from
some website as an embedded object
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47731
Yegor Kozlov <ye...@dinom.ru> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |NEEDINFO
--- Comment #1 from Yegor Kozlov <ye...@dinom.ru> 2009-08-31 10:05:14 PDT ---
You seem to forget to attach the file. Please re-attach.
Yegor
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 47731] Word Extractor considers text copied from
some website as an embedded object
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47731
Sergey Vladimirov <vl...@gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--- Comment #4 from Sergey Vladimirov <vl...@gmail.com> 2011-08-09 12:43:12 UTC ---
Fixed/improved in r1155337, will be part of 3.8-beta4 release.
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
DO NOT REPLY [Bug 47731] Word Extractor considers text copied from
some website as an embedded object
Posted by bu...@apache.org.
https://issues.apache.org/bugzilla/show_bug.cgi?id=47731
--- Comment #2 from Gitu <gi...@sap.com> 2009-08-31 22:04:02 PDT ---
Created an attachment (id=24197)
This attachment contains text copied from a web page
Attached the document!!
Thanks,
Gitu
--
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org