You are viewing a plain text version of this content. The canonical link for it is here.
Posted to regexp-user@jakarta.apache.org by Cheryl! <ch...@lyrehcworld.com> on 2003/03/24 22:48:52 UTC
nested tags
I have the need to parse a Word doc that was saved as HTML. Word
creates a bunch of junk when saving as HTML and I have been successful
in getting most of it out through string replacements.
There is one remaining item that seems to have me baffled. It is in the
situation of nested span tags. Word seems to create a lot of nested
empty span tags. This seems easy enough with regex, however soon proves
to be difficult.
For example, I have the following piece of HTML from the document.
<span style='font-size:11.0pt;mso-bidi-font-size:
9.0pt;font-weight:normal'>sentence number one.<span
style=\"mso-spacerun: yes\"> </span>Sentence number 2.<span
style=\"mso-spacerun: yes\"> </span>Sentence number 3.<span
style=\"mso-spacerun: yes\"> </span>Sentence Number 4.<span
style=\"mso-spacerun: yes\"> </span></span>
Notice that there are opening and closing span tags nested with blank
spaces between the open and close tag. I want these removed so that my
final version would look like this.
<span style='font-size:11.0pt;mso-bidi-font-size:
9.0pt;font-weight:normal'>sentence number one. Sentence number 2.
Sentence number 3. Sentence Number 4.</span>
Problem is, when I search for a pattern like this: <span.*?>\\s*?</span>
it matches from the first opening span tag through the first closing
span tag. So, it matches this:
<span style='font-size:11.0pt;mso-bidi-font-size:
9.0pt;font-weight:normal'>sentence number one.<span
style=\"mso-spacerun: yes\"> </span>
And if you think about it, that makes sense because of the fact that I
am saying match any number of characters after the word span in the
opening tag and do this until you find a greater than sign followed by
some spaces followed by a closing span. So, it is doing what I am
asking it to do, but not really what I want it to do.
It seems that this would be a simple thing to do, so I am assuming that
I am just not familiar enough with the regex commands. Any help on this
matter would be hugely appreciated. Here are some of the other
patterns I have tried. Some of them might be far-fetched, but I was
just trying anything out of desperation.
RE r = new RE("(<span(.*>\\s*?)</span>)");
RE r = new RE("(<span(.*?)>\\s*</span>)");
RE r = new RE("(<span(.*?)(>\\s*?<)/span>)");
RE r = new RE("(<span.+?>\\s+?</span>)");
RE r = new RE("(<span[^<span]*>\\s*</span>)");
RE r = new RE("<span[^>]*(>\\s*<)/span>");
RE r = new RE("<span.*?>(.*?)</span>"); //no good. pulls out
the double span if nested.
RE r = new RE("<span.*?>([^abc]*?)</span>");
RE r = new RE("<span.*?[^<span.*>](></span>)"); //try to force
it to not have a nested span
RE r = new RE(">\\s*</span>"); //ok, let's just look for the
empty span stuff without opening span
Thanks,
Cheryl!