You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Som Satpathy <so...@gmail.com> on 2009/04/28 06:06:29 UTC

Advice needed regarding embedded ole extraction

Hi

I needed some advice from you regarding embedded ole extraction from
microsoft documents like word, excel etc.

Is there any way by which we can *exclude* embedded ole information which we
get on calling *wordExtractor.getText() ?

*For example, I get the following as output when I call *Apache POI
WordExtractor's getText* on test word document with other embedded documents
inside -


Extracted Text--------> We have an excel sheet embedded in this doc. Test
test test test. Blah blah.blah



EMBED Excel.Sheet.8

EMBED PowerPoint.Show.8
EMBED Word.Document.8 \s



EMBED AcroExch.Document.7


I don't want the information with the 'EMBED' tag mentioned above. Is there
any way to sort this out using the existing Apache HWPF poi?


Thanks & Regards
Som Ranjan

Re: Advice needed regarding embedded ole extraction

Posted by Som Satpathy <so...@gmail.com>.

Thanks a lot for the suggestion Mark..I will try the code and see if it
helps. Actually the 'EMBED' + progid information was creating problem for me
while trying to add the extracted text including the embedded information
into an XML file of mine..
As the wordextractor.getText() was returning some 'mojibake' for the EMBEDED
information, the XML never accepted it..

My word document has an embedded excel, powerpoint and a pdf..may be that's
why the stripfields() didn't work..
While working with the stripfields(), I came to know that it would remove
only one set of \u0013,14 and 15 at a time..

I still didnt understand though as to why the wordextractor.getText()
returns unrecognized stuff for embedded information. In fact it should have
been omitted as we can still read for embedded text through event
listeners..

But thanks for your input, I will do some investigation with it..


Cheers
Som Ranjan

On Sun, May 10, 2009 at 2:32 PM, MSB <ma...@tiscali.co.uk> wrote:

>
> Over the last day or so, I have had the opportunity to dig around a little.
> Firstly, I made myself a test document by embedding one Word document into
> another. To do this, I used the Insert...Object...Create Object From File
> menu options to insert the EMBED field into my test document.
>
> Firstly, I used WordExtractor to recover the contents of the document and
> found that the EMBED field was returned. Next, I called the stripFields()
> method and it worked as Nick suggested it should; the EMBED field was
> removed from the paragraph text. It seems, therefore, as thoughthere is
> something different about the EMBED fields in your document.
>
> When we were working on 'our' code, Christian and I used a very simple
> piece
> of code to look at the structure of the fields;
>
> StringBuffer charString = new StringBuffer();
> StringBuffer intString = new StringBuffer();
> StringBuffer hexString = new StringBuffer();
>
> FileInputStream fis = new FileInputStream(new
> File("C:\\temp\\embedded.doc"));
> org.apache.poi.hwpf.extractor.WordExtractor we =
>    new org.apache.poi.hwpf.extractor.WordExtractor(fis);
> String[] text = we.getParagraphText();
> String tempString = null;
>
> for(String item : text) {
>    char[] charArray = item.toCharArray();
>    for(char aChar : charArray) {
>        charString.append(aChar + "      ");
>        tempString = String.valueOf((int)aChar);
>        if(tempString.length() == 1) {
>            tempString = tempString + "      ";
>        }
>        else if(tempString.length() == 2) {
>            tempString = tempString + "     ";
>        }
>        else if(tempString.length() == 3) {
>            tempString = tempString + "    ";
>        }
>
>        intString.append(tempString);
>
>        tempString = Integer.toHexString((int)aChar);
>        if(tempString.length() == 1) {
>            tempString = "\\u000" + tempString + " ";
>        }
>        else if(tempString.length() == 2) {
>            tempString = "\\u00" + tempString + " ";
>        }
>        else if(tempString.length() == 3) {
>            tempString = "\\u0" + tempString + " ";
>        }
>        else if (tempString.length() == 4) {
>            tempString = "\\u" + tempString + " ";
>        }
>
>         hexString.append(tempString);
>     }
>  }
>  System.out.println("Characters:     [" +
>     charString.toString() +
>     " ]");
>  System.out.println("Numeric Values: [" +
>     intString.toString() +
>     " ]");
>  System.out.println("Hex Values:     [" +
>     hexString.toString() +
>     " ]");
>
> Running that against my test file showed us that the fields had the
> following structure;
>
> { INSTRUCTION } CURRENT VALUE }
>
> The opening and closing braces were in fact control characters with the
> following unicode values, \u0013, \u0014 and \u0015 respectively. Between
> \u0013 and \u0014 was the instruction - EMBED Word.Document.8 for example -
> and between \u0014 and \u0015 was the current value if any. As you no doubt
> know, fields can be used to insert a very wide range of values such as the
> date the document was created which may be stored when the user saves the
> file.
>
> If you run the simple code above against your file with the EMBED fields,
> then it may help to identify whether there are any differences in the filed
> structure.
>
>
>
> Som Satpathy wrote:
> >
> > Yes I tried using stripFields(). It strips some part of the unwanted text
> > (with the EMBED tag), but some part still remains.
> >
> > I suspect the problem might be with the encoding format of the "embedded
> > object strings" (the ones starting with EMBED tag and ending with
> embedded
> > doc's progID).
> >
> > The stripFields() does not strip all of the encoded text.
> >
> >
> > Regards
> > Som Ranjan
> >
> >
> > On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <ni...@torchbox.com> wrote:
> >
> >> On Tue, 28 Apr 2009, Som Satpathy wrote:
> >>
> >>> Is there any way by which we can *exclude* embedded ole information
> >>> which
> >>> we
> >>> get on calling *wordExtractor.getText() ?
> >>>
> >>
> >> Did you try stripFields?
> >>
> >>
> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29>
> <
> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29
> >
> >>
> >> Nick
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> >> For additional commands, e-mail: user-help@poi.apache.org
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Advice-needed-regarding-embedded-ole-extraction-tp23269803p23468348.html
> Sent from the POI - User mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Advice needed regarding embedded ole extraction

Posted by MSB <ma...@tiscali.co.uk>.

Over the last day or so, I have had the opportunity to dig around a little.
Firstly, I made myself a test document by embedding one Word document into
another. To do this, I used the Insert...Object...Create Object From File
menu options to insert the EMBED field into my test document.

Firstly, I used WordExtractor to recover the contents of the document and
found that the EMBED field was returned. Next, I called the stripFields()
method and it worked as Nick suggested it should; the EMBED field was
removed from the paragraph text. It seems, therefore, as thoughthere is
something different about the EMBED fields in your document.

When we were working on 'our' code, Christian and I used a very simple piece
of code to look at the structure of the fields;

StringBuffer charString = new StringBuffer();
StringBuffer intString = new StringBuffer();
StringBuffer hexString = new StringBuffer();
            
FileInputStream fis = new FileInputStream(new
File("C:\\temp\\embedded.doc"));
org.apache.poi.hwpf.extractor.WordExtractor we = 
    new org.apache.poi.hwpf.extractor.WordExtractor(fis);
String[] text = we.getParagraphText();
String tempString = null;
            
for(String item : text) {
    char[] charArray = item.toCharArray();
    for(char aChar : charArray) {
        charString.append(aChar + "      ");
        tempString = String.valueOf((int)aChar);
        if(tempString.length() == 1) {
            tempString = tempString + "      ";
        }
        else if(tempString.length() == 2) {
            tempString = tempString + "     ";
        }
        else if(tempString.length() == 3) {
            tempString = tempString + "    ";
        }
        
        intString.append(tempString);
        
        tempString = Integer.toHexString((int)aChar);
        if(tempString.length() == 1) {
            tempString = "\\u000" + tempString + " ";
        }
        else if(tempString.length() == 2) {
            tempString = "\\u00" + tempString + " ";
        }
        else if(tempString.length() == 3) {
            tempString = "\\u0" + tempString + " ";
        }
        else if (tempString.length() == 4) {
            tempString = "\\u" + tempString + " ";
        }
        
         hexString.append(tempString);
     }
 }
 System.out.println("Characters:     [" + 
     charString.toString() +
     " ]");
 System.out.println("Numeric Values: [" + 
     intString.toString() + 
     " ]");
 System.out.println("Hex Values:     [" + 
     hexString.toString() + 
     " ]");

Running that against my test file showed us that the fields had the
following structure;

{ INSTRUCTION } CURRENT VALUE }

The opening and closing braces were in fact control characters with the
following unicode values, \u0013, \u0014 and \u0015 respectively. Between
\u0013 and \u0014 was the instruction - EMBED Word.Document.8 for example -
and between \u0014 and \u0015 was the current value if any. As you no doubt
know, fields can be used to insert a very wide range of values such as the
date the document was created which may be stored when the user saves the
file.

If you run the simple code above against your file with the EMBED fields,
then it may help to identify whether there are any differences in the filed
structure.



Som Satpathy wrote:
> 
> Yes I tried using stripFields(). It strips some part of the unwanted text
> (with the EMBED tag), but some part still remains.
> 
> I suspect the problem might be with the encoding format of the "embedded
> object strings" (the ones starting with EMBED tag and ending with embedded
> doc's progID).
> 
> The stripFields() does not strip all of the encoded text.
> 
> 
> Regards
> Som Ranjan
> 
> 
> On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <ni...@torchbox.com> wrote:
> 
>> On Tue, 28 Apr 2009, Som Satpathy wrote:
>>
>>> Is there any way by which we can *exclude* embedded ole information
>>> which
>>> we
>>> get on calling *wordExtractor.getText() ?
>>>
>>
>> Did you try stripFields?
>>
>> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29>
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Advice-needed-regarding-embedded-ole-extraction-tp23269803p23468348.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: Advice needed regarding embedded ole extraction

Posted by MSB <ma...@tiscali.co.uk>.

This is deeply embarassing because I completely forgot about a piece of code
that Christian and I worked on some months ago. It was designed to extract
the merge fields from a Word document and it may provide a template that you
could use to address this problem - assuming it still exists that is.

You can download the code from here;

http://rapidshare.com/files/208737669/MailMerge.rar

Just copy and paste the address into a browser, choose the free download
option and once you have the archive unzip it into a folder somewhere. The
first thing to do is have a look at the FieldDelimiters class. As Nick's
last reply suggested, Word uses delimiters placed within the files contents
to indicate that what follows is not text but something 'special'. Christian
and I used the POIFSViewer class that is part of POI to identify the
delimiters that surrounded a field, and you can do something similar to
identify those that surround the OLE insertion. As you will see, I chose to
use the numeric value of the 'special' characters in my searches whilst I
think that the stripFields() method probably uses their hex value.

Next have a look at the MergeMasterCheck class because this is where the
action occurs. It uses the field delimiters to identify and extract the
merge fields from the documents text. I guess that you want to do the
reverse - get at the text and leave everything else behind - but it should
be easy enough to modify the existing code to accomplish this.

Sorry it took so long for me to remember about this work and I hope that it
can help now. It had been our intention to submit it to the developers of
POI but I was uncertain about it and simply decided not to - silly in
retrospect I suppose. If you want to discuss it further, just drop me an
email.

Som Satpathy wrote:
> 
> Yes I tried using stripFields(). It strips some part of the unwanted text
> (with the EMBED tag), but some part still remains.
> 
> I suspect the problem might be with the encoding format of the "embedded
> object strings" (the ones starting with EMBED tag and ending with embedded
> doc's progID).
> 
> The stripFields() does not strip all of the encoded text.
> 
> 
> Regards
> Som Ranjan
> 
> 
> On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <ni...@torchbox.com> wrote:
> 
>> On Tue, 28 Apr 2009, Som Satpathy wrote:
>>
>>> Is there any way by which we can *exclude* embedded ole information
>>> which
>>> we
>>> get on calling *wordExtractor.getText() ?
>>>
>>
>> Did you try stripFields?
>>
>> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29>
>>
>> Nick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Advice-needed-regarding-embedded-ole-extraction-tp23269803p23457846.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: Advice needed regarding embedded ole extraction

Posted by Nick Burch <ni...@torchbox.com>.

On Tue, 28 Apr 2009, Som Satpathy wrote:
> Yes I tried using stripFields(). It strips some part of the unwanted 
> text (with the EMBED tag), but some part still remains.

If you pass the problem text through a hexdump, do you see any special 
characters surrounding the EMBED text? It's possible that there's another 
set of field markers, other than 0x13-0x15 with stripFields already 
handles

(It'd be good to look both before and after stripFields has been called)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: Advice needed regarding embedded ole extraction

Posted by Som Satpathy <so...@gmail.com>.

Yes I tried using stripFields(). It strips some part of the unwanted text
(with the EMBED tag), but some part still remains.

I suspect the problem might be with the encoding format of the "embedded
object strings" (the ones starting with EMBED tag and ending with embedded
doc's progID).

The stripFields() does not strip all of the encoded text.

Regards
Som Ranjan

On Tue, Apr 28, 2009 at 2:44 PM, Nick Burch <ni...@torchbox.com> wrote:

> On Tue, 28 Apr 2009, Som Satpathy wrote:
>
>> Is there any way by which we can *exclude* embedded ole information which
>> we
>> get on calling *wordExtractor.getText() ?
>>
>
> Did you try stripFields?
>
> http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)<http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29>
>
> Nick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Advice needed regarding embedded ole extraction

Posted by Nick Burch <ni...@torchbox.com>.

On Tue, 28 Apr 2009, Som Satpathy wrote:
> Is there any way by which we can *exclude* embedded ole information which we
> get on calling *wordExtractor.getText() ?

Did you try stripFields?
http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields(java.lang.String)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org