WordPress: Output Clean and Valid HTML Content


Clients generate a lot of garbage html in the post content with things like divs, p , a, etc. tags which haven’t been closed or opened properly. This makes the website fail to validate against W3C standards which in between other things, could potentially affect the site’s ranking and its accessibility. Of course asking clients to switch to the HTML view to clean the content is out of the question.

Clean The Content Function: Download

After some research found htmLawed, a PHP class to purify and filter HTML that works incredibly well. Wrote a little function to merge it with WordPress which automatically filters the content coming from the post. The results are amazing, htmLawed cleans and filters all the bad content and makes the pages validate again right away.

	include_once ( TEMPLATEPATH . '/htmLawed.php' ); // THIS FILE SHOULD RESIDE IN THE THEME FOLDER.
	function clean_the_content( $content )
	{
		$szPostContent = $content;
		$szRemoveFilter = array( "~<p[^>]*>s?</p>~", "~<a[^>]*>s?</a>~", "~<font[^>]*>~", "~</font>~", "~<span[^>]*>s?</span>~" );
		$szPostContent = preg_replace( $szRemoveFilter, '' , $szPostContent);
		$szPostContent = htmLawed($szPostContent);
		return $szPostContent;
	}
	add_filter('the_content', 'clean_the_content');

To use this function simple drop the two files in your theme directory and you are set to go! Cheers!

Entry by | Original code is licensed under the GNU GPLv2 license.

Reactions (13)

  1. Pingback CSS Brigit | WordPress: Output Clean and Valid HTML Content

  2. Nice idea. I find it quite funny how the default theme doesn’t validate yet they put a link to the validator in the blogroll :)

  3. Pingback WordPress: Output Clean and Valid HTML Code | ScriptRemix.com Scripts

  4. Pingback WordPress: Output Clean and Valid HTML Code

  5. Hi, Nice solution guys. This is something that really annoys me with content management systems, you work on a site perfecting every detail then you hand over to the client and they wrack it with rubbish content. Obviously part of the designer brief is to ensure an end to end solution but sometimes it doesnt work out that way, but this could be a real time saver in educating the client.

    Don’t suppose anyone has found a similar solution for drupal ? as drupal is even worse than wordpress for this sort of problem.

    Nice work !

  6. FoO Iskandar

    Wow thanks for the solution … this really help me :D

  7. @Shane,

    it’s not only the clients, who wreck the valid code, unfortunately a lot of modules do that as well and there this nifty tool won’t help…

    must I name Joomla! ? These thousand shitty modules with spell-errors, code-mistakes, wrong charactersets? ;=(

  8. Pingback Logra un codigo HTML valido en todas tus entradas | Eliseos.net

  9. Great post. But most of the time I see errors for ampersands in urls. Is it possible to also replace these unencoded ampersands with this code?

    Thanks,
    Pierre

  10. OMG! You just saved me from a whole night of headaches of rewriting a client’s broken content into WordPress :)