You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Fazekas Imre <Im...@it-services.hu> on 2011/02/18 11:48:05 UTC

how-to extract textual embedded content

Dear all,

 

I tried to use the POI library to extract textual document from an excel
file. It is not a word document, nor an excel file or image, simple text
files embedded into an excel document. 

Could anyone give me a tip how to extract it?

 

 

 

Thank you in advance!

 

Kind regards,

 

Imre Fazekas


RE: how-to extract textual embedded content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
Thanks for that. I will look at some code as soon as I have the opportunity
but cannot promise when that will be. We have a bit of an emregnecy here at
the moment as an act of vandalism has left a dipping platform we made in a
parlous state and our client wants it repaired as soon as possible, i.e.
now. That means all day tomorrow I suspect if I am able to get the materials
ordered this evening. Anyway, enough of my problems, I will post as soon as,
and if, I manage to find any information.

Yours

Mark B
-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p3395821.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Nina <sn...@gmail.com>.
Hi,
I would like to use xssf Eventusermodel to get all embedded content.
Should i first get all sheetparts or the parts of OPCPackage container?
Getting sheetpart doesnt seem to help :(...Could anybody provide a sample
code for the same?
The apache site has sample code only for xssf usermodel :( Kindly helppp

--
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p4801808.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
Not good news I am afraid. I have not yet been able to dig out from the file
the location information for the embedded files. In a previous message, you
indicated that you were able to get at the icons used to represent the file
in the worksheet. To save me time, can I ask how you did this please? It may
be that the location information is tied to the icon and not the file.

Yours

Mark B
-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p3398768.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Fazekas Imre <Im...@it-services.hu>.
Dear Mark,

Thank you for your time spared to this effort. :)
Yes, my worksheet is simple. A simple sheet with text files embedded
into cells. I can save the text files, but the semantics of the text
files can be interpreted by find the cell too. I was unable to do that. 
An excel file may have multiple text file embedded content. You can
consider an average 10 files in general i guess ...


Best regards,

Imre


-----Original Message-----
From: Mark Beardsley [mailto:markbrdsly@tiscali.co.uk] 
Sent: Monday, February 21, 2011 5:02 PM
To: user@poi.apache.org
Subject: RE: how-to extract textual embedded content


I may get the chance to play around with some code tomorrow, most likely
during the evening here. Can I just check that all you have is a
worksheet
with one or more text files embedded into it and that each file is
represented by an icon? This will allow me to create a similar test
workbook
to have a play with.

Yours

Mark B
-- 
View this message in context:
http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-
content-tp3390878p3394333.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
I may get the chance to play around with some code tomorrow, most likely
during the evening here. Can I just check that all you have is a worksheet
with one or more text files embedded into it and that each file is
represented by an icon? This will allow me to create a similar test workbook
to have a play with.

Yours

Mark B
-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p3394333.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Fazekas Imre <Im...@it-services.hu>.
OK, let me attache a simple document.



I'm able to extract the content of the embedded textual file, the icon
from the sheet, but cannot find a way to get the position of the
embedded content. :(
Any help is much appreciated!


Best regards,

Imre


-----Original Message-----
From: Mark Beardsley [mailto:markbrdsly@tiscali.co.uk] 
Sent: Monday, February 21, 2011 8:30 AM
To: user@poi.apache.org
Subject: RE: how-to extract textual embedded content


If I am correct - and that is another if - then you will have to get the
root
element of the document from the POIFS object, recover a refernece to
the
list of directory/document entry object it manages and iterate through
them
to locate the anchor or anchors for the embedded objects. I have done
this
for pictures as I knew they had an associated anchor but never for
embedded
objects as I am unsure whether they do. The problem we hit when getting
at
the image's anchor is that while it did contain the location information
it
was then impossible to tie a specific anchor back to one image and I
fear
you will encounter this problem again. If I have the chance, I wil play
with
some code to see what happens but I would like to ask whether you could
upload a sample file for me to work with please?

Yours

Mark B
-- 
View this message in context:
http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-
content-tp3390878p3393750.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
If I am correct - and that is another if - then you will have to get the root
element of the document from the POIFS object, recover a refernece to the
list of directory/document entry object it manages and iterate through them
to locate the anchor or anchors for the embedded objects. I have done this
for pictures as I knew they had an associated anchor but never for embedded
objects as I am unsure whether they do. The problem we hit when getting at
the image's anchor is that while it did contain the location information it
was then impossible to tie a specific anchor back to one image and I fear
you will encounter this problem again. If I have the chance, I wil play with
some code to see what happens but I would like to ask whether you could
upload a sample file for me to work with please?

Yours

Mark B
-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p3393750.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Fazekas Imre <Im...@it-services.hu>.
Dear Mark,



Thank you for the answer. Yes, it is pure xls file I'm working with.
The glasspane example was good, thanks. Anywway, the coordinate of this
floating embedded text file is attached somehow to the cell. If I
rescale the cells, scroll the screen, the attached object will remain
still at the same local position of the cell graphical element. I
haven't found any way to get this coordinate yet... :(




Best regards,

Imre



-----Original Message-----
From: Mark Beardsley [mailto:markbrdsly@tiscali.co.uk] 
Sent: Friday, February 18, 2011 4:59 PM
To: user@poi.apache.org
Subject: RE: how-to extract textual embedded content


No, you were not careless at all. I think - and this is think - that
embedded
documents are a little like pictures. By this, I mean that they are not
actually inserted into a cell but they 'float' above the worksheet and
are
anchored to it. 

To try and explain what I mean, an example is useful. I do not know how
familiar you are with Java's Swing components, those used to create
graphical user interfaces. Each component - a textbox for example -
consists
of a series of objects and a couple of these are called panes. One pane
is
invisible, lies over the textbox object, glories in the name the glass
pane
and you can use it to check whether the user has clicked the mouse
cursor
whilst they are within the box for example. Now, imagine that there is a
glass pane positioned above the worksheet and that you can view the rows
and
columns through it. Images, and I think embedded documents, are actually
attached to the equivalent of a glass pane and their location expressed
in
terms of the cell(s) their corners line within. Of course, Excel does
not
have the glass pane but it serves to explain what I mean by saying that
embedded objects 'float' above the worksheet.

It ought to be possible to get at the imformation but I am not certain
where
it is stored in the file. Also, it will be stored differently for each
file
type; the older binary .xls fileas and the newer OOXML based ones. Did
you
mention which file format your application is targetting by the way?

Yours

Mark B
-- 
View this message in context:
http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-
content-tp3390878p3391311.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
No, you were not careless at all. I think - and this is think - that embedded
documents are a little like pictures. By this, I mean that they are not
actually inserted into a cell but they 'float' above the worksheet and are
anchored to it. 

To try and explain what I mean, an example is useful. I do not know how
familiar you are with Java's Swing components, those used to create
graphical user interfaces. Each component - a textbox for example - consists
of a series of objects and a couple of these are called panes. One pane is
invisible, lies over the textbox object, glories in the name the glass pane
and you can use it to check whether the user has clicked the mouse cursor
whilst they are within the box for example. Now, imagine that there is a
glass pane positioned above the worksheet and that you can view the rows and
columns through it. Images, and I think embedded documents, are actually
attached to the equivalent of a glass pane and their location expressed in
terms of the cell(s) their corners line within. Of course, Excel does not
have the glass pane but it serves to explain what I mean by saying that
embedded objects 'float' above the worksheet.

It ought to be possible to get at the imformation but I am not certain where
it is stored in the file. Also, it will be stored differently for each file
type; the older binary .xls fileas and the newer OOXML based ones. Did you
mention which file format your application is targetting by the way?

Yours

Mark B
-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p3391311.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Fazekas Imre <Im...@it-services.hu>.
Dear Mark,


I see in my excel file the embedded documents associated to certain
cells. A .txt to cell D4, another one to cell E7.

This information would be required by the application. Reading the API i
haven't found any related information for cell assiciation for embedded
resources. Was I too careless?




Best regards,

Imre

-----Original Message-----
From: Mark Beardsley [mailto:markbrdsly@tiscali.co.uk] 
Sent: Friday, February 18, 2011 1:15 PM
To: user@poi.apache.org
Subject: RE: how-to extract textual embedded content


Good news! Thanks for letting us know and all the best with your
project.

Yours

Mark B
-- 
View this message in context:
http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-
content-tp3390878p3390986.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
Good news! Thanks for letting us know and all the best with your project.

Yours

Mark B
-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p3390986.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Fazekas Imre <Im...@it-services.hu>.
Thank you! 


It was a great boost, and working now! :)


Regards,

Imre

-----Original Message-----
From: Mark Beardsley [mailto:markbrdsly@tiscali.co.uk] 
Sent: Friday, February 18, 2011 12:28 PM
To: user@poi.apache.org
Subject: RE: how-to extract textual embedded content


Would any of this help?

http://poi.apache.org/spreadsheet/quick-guide.html#Embedded

Yours

Mark B
-- 
View this message in context:
http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-
content-tp3390878p3390931.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Mark Beardsley <ma...@tiscali.co.uk>.
Would any of this help?

http://poi.apache.org/spreadsheet/quick-guide.html#Embedded

Yours

Mark B
-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/how-to-extract-textual-embedded-content-tp3390878p3390931.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 18 Feb 2011, Fazekas Imre wrote:
> I found the embedded directories with the "MBD" prefix, but was unable
> to get the content of the file with the available POIDocument classes.

If it isn't a office file, but is instead a plain text file (or an image, 
pdf etc), then you can't use a POIDocument to read it. You need to use 
POIFS methods yourself.

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


RE: how-to extract textual embedded content

Posted by Fazekas Imre <Im...@it-services.hu>.
Yes, i did.

I found the embedded directories with the "MBD" prefix, but was unable
to get the content of the file with the available POIDocument classes. 




Regards,

Imre

-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com] 
Sent: Friday, February 18, 2011 11:52 AM
To: POI Users List
Subject: Re: how-to extract textual embedded content

On Fri, 18 Feb 2011, Fazekas Imre wrote:
> I tried to use the POI library to extract textual document from an
excel 
> file. It is not a word document, nor an excel file or image, simple
text 
> files embedded into an excel document.

Have you read http://poi.apache.org/poifs/embeded.html ?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: how-to extract textual embedded content

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 18 Feb 2011, Fazekas Imre wrote:
> I tried to use the POI library to extract textual document from an excel 
> file. It is not a word document, nor an excel file or image, simple text 
> files embedded into an excel document.

Have you read http://poi.apache.org/poifs/embeded.html ?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org