You are viewing a plain text version of this content. The canonical link for it is here.
Posted to regexp-user@jakarta.apache.org by Cheryl! <ch...@lyrehcworld.com> on 2003/03/24 22:48:52 UTC
nested tags

  I have the need to parse a Word doc that was saved as HTML.  Word 
creates a bunch of junk when saving as HTML and I have been successful 
in getting most of it out through string replacements.

There is one remaining item that seems to have me baffled.  It is in the 
situation of nested span tags.  Word seems to create a lot of nested 
empty span tags.  This seems easy enough with regex, however soon proves 
to be difficult.

For example, I have the following piece of HTML from the document.  
<span style='font-size:11.0pt;mso-bidi-font-size: 
9.0pt;font-weight:normal'>sentence number one.<span 
style=\"mso-spacerun: yes\">  </span>Sentence number 2.<span 
style=\"mso-spacerun: yes\">  </span>Sentence number 3.<span 
style=\"mso-spacerun: yes\">        </span>Sentence Number 4.<span 
style=\"mso-spacerun: yes\">  </span></span>

Notice that there are opening and closing span tags nested with blank 
spaces between the open and close tag.  I want these removed so that my 
final version would look like this.

<span style='font-size:11.0pt;mso-bidi-font-size: 
9.0pt;font-weight:normal'>sentence number one.  Sentence number 2. 
 Sentence number 3.  Sentence Number 4.</span>

Problem is, when I search for a pattern like this: <span.*?>\\s*?</span>
it matches from the first opening span tag through the first closing 
span tag.  So, it matches this:
<span style='font-size:11.0pt;mso-bidi-font-size: 
9.0pt;font-weight:normal'>sentence number one.<span 
style=\"mso-spacerun: yes\">  </span>

And if you think about it, that makes sense because of the fact that I 
am saying match any number of characters after the word span in the 
opening tag and do this until you find a greater than sign followed by 
some spaces followed by a closing span.  So, it is doing what I am 
asking it to do, but not really what I want it to do.

It seems that this would be a simple thing to do, so I am assuming that 
I am just not familiar enough with the regex commands.  Any help on this 
matter would be hugely appreciated.   Here are some of the other 
patterns I have tried.  Some of them might be far-fetched, but I was 
just trying anything out of desperation.

         RE r = new RE("(<span(.*>\\s*?)</span>)");
         RE r = new RE("(<span(.*?)>\\s*</span>)");
         RE r = new RE("(<span(.*?)(>\\s*?<)/span>)");
         RE r = new RE("(<span.+?>\\s+?</span>)");
         RE r = new RE("(<span[^<span]*>\\s*</span>)");
         RE r = new RE("<span[^>]*(>\\s*<)/span>");
         RE r = new RE("<span.*?>(.*?)</span>"); //no good. pulls out 
the double span if nested.
         RE r = new RE("<span.*?>([^abc]*?)</span>");
         RE r = new RE("<span.*?[^<span.*>](></span>)"); //try to force 
it to not have a nested span
         RE r = new RE(">\\s*</span>"); //ok, let's just look for the 
empty span stuff without opening span

Thanks,

Cheryl!