Monday, 18 December 2017

RegEx match open tags except XHTML self-contained tags

RegEx match open tags except XHTML self-contained tags

ContentMiddleAd


While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression.
The suggested regex is wrong, though:
<([a-z]+) *[^/]*?>

ContentMiddleAd

If you add something to the regex, by backtracking it can be forced to match silly things like <a >>[^/] is too permissive. Also note that <space>*[^/]* is redundant, because the [^/]* can also match spaces.
My suggestion would be

ContentMiddleAd

<([a-z]+)[^>]*(?<!/)>
Where (?<! ... ) is (in Perl regexes) the negative look-behind. It reads "a <, then a word, then anything that's not a >, the last of which may not be a /, followed by >".
Note that this allows things like <a/ > (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.





No comments:

Post a Comment