You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/08/29 17:42:19 UTC

[Bug 61470] New: Text with phonetic runs aren't extracted in docx

https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

            Bug ID: 61470
           Summary: Text with phonetic runs aren't extracted in docx
           Product: POI
           Version: unspecified
          Hardware: PC
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XWPF
          Assignee: dev@poi.apache.org
          Reporter: tallison@mitre.org
  Target Milestone: ---

Created attachment 35269
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35269&action=edit
example file

Over on TIKA-2448, I found that our DOM model is not extracting runs within
"ruby" sections.  This means that neither the primary text ("東京") nor the
phonetic text ("とうきょう") is extracted.

The more general point is that a run can contain a run...ugh!


  <w:body>
    <w:p>
      <w:r>
        <w:rPr>
           ...
        </w:rPr>
        <w:ruby>
          <w:rt>
            <w:r>
              <w:rPr>
               .....
              </w:rPr>
              <w:t>とうきょう</w:t>
            </w:r>
          </w:rt>
          <w:rubyBase>
            <w:r w:rsidR="001B7DA3">
              <w:rPr>
               ....
              </w:rPr>
              <w:t>東京</w:t>
            </w:r>
          </w:rubyBase>
        </w:ruby>
      </w:r>
    </w:p>
  </w:body>

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61470] Text with phonetic runs aren't extracted in docx

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

studio test <te...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |major
                 OS|All                         |Windows 7

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61470] Text with phonetic runs aren't extracted in docx

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

Tim Allison <ta...@mitre.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #3 from Tim Allison <ta...@mitre.org> ---
Spam?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61470] Text with phonetic runs aren't extracted in docx

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

--- Comment #5 from Tim Allison <ta...@mitre.org> ---
Created attachment 35275
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35275&action=edit
reason to cache...for posterity and a later issue

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61470] Text with phonetic runs aren't extracted in docx

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

--- Comment #2 from studio test <te...@gmail.com> ---
Created attachment 35271
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35271&action=edit
Test script

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61470] Text with phonetic runs aren't extracted in docx

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

Tim Allison <ta...@mitre.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #35275|0                           |1
        is obsolete|                            |

--- Comment #6 from Tim Allison <ta...@mitre.org> ---
Created attachment 35276
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=35276&action=edit
reason to cache for posterity and a later issue

correct file attached this time

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61470] Text with phonetic runs aren't extracted in docx

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

--- Comment #4 from Tim Allison <ta...@mitre.org> ---
Given this example:
162.242.228.174/docs/commoncrawl2/WI/WIFC2FI3QH64A6KHOBEDNQKLN5O5EYSS

I wonder if we should cache the phonetic content as we read through the
document and then dump it at the end.  This would allow for a document to be
found via the phonetic info, and it wouldn't completely wreck nlp applications.

For another issue...

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 61470] Text with phonetic runs aren't extracted in docx

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=61470

Tim Allison <ta...@mitre.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED
                 OS|                            |All

--- Comment #1 from Tim Allison <ta...@mitre.org> ---
r1806712

Added extraction of runs within ruby elements; added ability for users to
select whether or not to concatenate phonetic runs; set default toString()
behavior to include phonetic runs.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org