You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Aram Mirzadeh <ar...@hotmail.com> on 2012/09/16 03:16:57 UTC

Parsing an XHTML document and extracting img files, and a div title

Hi,

I am trying to parse the the following large XHTML document using 1.2
and Java 6.  I need to go through and grab out some <div> tags that
contain a title for a graphic.  Then rename the graphic's actual name to
that title tag.

Here is the sample html:

<div class="s8a6d62e8" style="">Top 10 ARP sources in terms of bits.</div>
<div class="sbeea9846" style="">
      <img style="width: 701px; height: 526px; border: 0px" src="Final
Test Report_3.files\Final Test Report_34.Png"></img>
</div>
<div class="s306f0049" style="">Figure 3 - Top Ten ARP MAC Sources</div>
<div class="s12d95b95" style="">
      <a name="Top Ten ARP MAC Destinations"><br></a>
</div>
<div class="s1a75bf07" style="">Top Ten ARP MAC Destinations</div>
<div class="s8a6d62e8" style="">Top 10 ARP destinations in terms of
bits.</div>
<div class="sbeea9846" style="">
      <img style="width: 701px; height: 526px; border: 0px" src="Final
Test Report_3.files\Final Test Report_35.Png"></img>
</div>
<div class="s306f0049" style="">Figure 4 - Top Ten ARP MAC
Destinations</div>
<div class="s1a75bf07" style="">ARP MAC Conversations</div>
<div class="s8a6d62e8" style="">Conversation ring with ARP endpoints and
conversations.</div>
<div class="sbeea9846" style="">
      <img style="width: 701px; height: 526px; border: 0px" src="Final
Test Report_3.files\Final Test Report_36.Png"></img>
</div>
<div class="s306f0049" style="">Figure 5 - ARP MAC Conversations</div>


What I would like to do is rename:

Test Report_3.files\Final Test Report_35.Png --> Figure 4 - Top Ten ARP
MAC Destinations.png -- or some other unique name that I can then pull out.

Any help or sample code would be appreciated.

Thanks.

Aram