About Duffbert...

Duffbert's Random Musings is a blog where I talk about whatever happens to be running through my head at any given moment... I'm Thomas Duff, and you can find out more about me here...

Email Me!

Search This Site!

Custom Search

I'm published!

Co-author of the book IBM Lotus Sametime 8 Essentials: A User's Guide
SametimeBookCoverImage.jpg

Purchase on Amazon

Co-author of the book IBM Sametime 8.5.2 Administration Guide
SametimeAdminBookCoverImage.jpg

Purchase on Amazon

MiscLinks

Visitor Count...



View My Stats

« Great ESPN SportsCenter program this evening... | Main| XP and Refactoring (it's cool!), and more bloglinks added... »

HTML cleanup tools?

Category Software Development

Good morning, everyone...  I'm wondering if anyone has suggestions on HTML "clean up" tools (for lack of a better name).  Let me explain...

We have an application written in PHP and MySQL that has an embedded HTML editor in it.  You can cut and paste text from Microsoft Word documents (which is how we get the changes from the user), and the HTML is automagically generated.  But when you switch over to the HTML view, you see the most horrendous HTML in the world.  Word puts all sorts of MS-specific tags out there, and it makes it a PITA to try and find what you need to update.

Does anyone know of any tools that you can use to paste in HTML code and easily clean it up?  I'm trying to remove the MS Word garbage whenever possible, but it's time-consuming...

Comments

Gravatar Image1 - one more option, you might try homesite's "codesweeper" function if you have that around (it comes with dreamweaver). you can tell it how you want the html to look, as in, if you want indents in front of this tag, but not in front of this one, kinda thing. or, tags that are contained inside one another can be indented or not. pretty cool little tool.

Gravatar Image2 - The HTML Tidy tool from the W3c.org site does a good job and cleans up the Word tags very well.

Gravatar Image3 - Macromedia Dreamweaver also has the ability to quickly clean-up Word html.

Gravatar Image4 - So far, the winner seems to be Tidy. The HTMLArea process is very similar to what I use now. When I paste in the Word code, it keeps all the garbage MS-Word HTML there. The Dreamweaver stuff appears to be interesting, but I don't have that package.

I downloaded the Tidy UI app from the SourceForge site, and it rocks! Free is good, of course. I took the HTML from the worst page I have and dropped it into the editor (didn't even need to read the instructions). Tidy did the cleanup and formatting automatically. Even if it didn't do ANY cleanup, the formatting feature allowed me to find what I was looking for much quicker than my prior hunting method...

Cool stuff, everyone, and thanks for the help!

Gravatar Image5 - I was wondering if anyone knows how to make Dreamweaver auto-clean it's template comments when doing a "Put" on the server?

The template feature is a huge time and error saver, but having download all resulting HTMLs back from the server, to manually open each one and run the cleanup it's a daunting task.

If anyone has an idea or knows of a Dreamweaver extension that can do on-the-fly comment cleanup when doing a "Put", I'd appreciate a line or two.

Gravatar Image6 - (I think I'm starting to sound like a one-trick-pony here) but have you tried copy/pasting the actual page into the htmlarea editor?
http://www.interactivetools.com/products/htmlarea/index.html#demo
I've downloaded the free example, and have been able to copy-paste the webpage itself, hit the <> button and view the raw W3C-compliant HTML. It's worth a shot anyway...

(They should start paying me advert kickbacks )

-Chris

Gravatar Image7 - Via PHP, rough but it works:
<?php
function decraper($html, $delstyles=false) {
$whitespace = array("\t","\n","\r");
$spaces = array(' ',' ',' ',' ');
$html = str_replace($whitespace, '', $html);
for ($t = 1; $t <= 5; $t++) {
$html = (str_replace($spaces, ' ', $html));
}
$commoncrap = array('&quot;'
,'font-weight: normal;'
,'font-style: normal;'
,'line-height: normal;'
,'font-size-adjust: none;'
,'font-stretch: normal;'); //If it is so normal, why they bother?
$replace = array("'");
$html = str_replace($commoncrap, $replace, $html);
$patterns = array();
$replacements = array();
$patterns[0] = '/(<table\s.*)(width=)(\d+%)(\D)/i'; # Fix unquoted non-alphanumeric characters in table tags
$patterns[1] = '/(<td\s.*)(width=)(\d+%)(\D)/i';
$patterns[2] = '/(<th\s.*)(width=)(\d+%)(\D)/i';
$patterns[3] = '/<td( colspan="[0-9]+")?( rowspan="[0-9]+")?( width="[0-9]+")?( height="[0-9]+")?.*?>/i';
$patterns[4] = '/<tr.*?>/i';
$patterns[5] = '/<\/st1:address>(<\/st1:\w*>)?<\/p>[\n\r\s]*<p[\s\w="\']*>/i';
$patterns[6] = '/<o:p.*?>/i';
$patterns[7] = '/<\/o:p>/i';
$patterns[8] = '/<o:SmartTagType[^>]*>/i';
$patterns[9] = '/<st1:[\w\s"=]*>/i';
$patterns[10] = '/<\/st1:\w*>/i';
$patterns[11] = '/<p class="MsoNormal"[^>]*>(.*?)<\/p>/i';
$patterns[12] = '/ style="margin-top: 0cm;"/i';
$patterns[13] = '/<(\w[^>]*) class=([^ |>]*)([^>]*)/i';
$patterns[14] = '/<ul(.*?)>/i';
$patterns[15] = '/<ol(.*?)>/i';
$patterns[17] = '/<br \/>&nbsp;<br \/>/i';
$patterns[18] = '/&nbsp;<br \/>/i';
$patterns[19] = '/<!-.*?>/';
$patterns[20] = '/\s*style=(""|\'\')/';
$patterns[21] = '/ style=[\'"]tab-interval:[^\'"]*[\'"]/i';
$patterns[22] = '/behavior:[^;\'"]*;*(\n|\r)*/i';
$patterns[23] = '/mso-[^:]*:"[^"]*";/i';
$patterns[24] = '/mso-[^;\'"]*;*(\n|\r)*/i';
$patterns[25] = '/\s*font-family:[^;"]*;?/i';
$patterns[26] = '/margin[^"\';]*;?/i';
$patterns[27] = '/text-indent[^"\';]*;?/i';
$patterns[28] = '/tab-stops:[^\'";]*;?/i';
$patterns[29] = '/border-color: *([^;\'"]*)/i';
$patterns[30] = '/border-collapse: *([^;\'"]*)/i';
$patterns[31] = '/page-break-before: *([^;\'"]*)/i';
$patterns[32] = '/font-variant: *([^;\'"]*)/i';
$patterns[33] = '/<span [^>]*><br \/><\/span><br \/>/i';
$patterns[34] = '/" "/';
$patterns[35] = '/[\t\r\n]/';
$patterns[36] = '/\s\s/s';
$patterns[37] = '/ style=""/';
$patterns[38] = '/<span>(.*?)<\/span>/i';
$patterns[39] = '/<span>(.*?)<\/span>/i';//twice, nested spans
$patterns[40] = '/(;\s|\s;)/';
$patterns[41] = '/;;/';
$patterns[42] = '/";/';
$patterns[43] = '/<li(.*?)>/i';
$replacements[0] = '$1$2"$3"$4';
$replacements[1] = '$1$2"$3"$4';
$replacements[2] = '$1$2"$3"$4';
$replacements[3] = '<td$1$2$3$4>';
$replacements[4] = '<tr>';
$replacements[5] = '<br />';
$replacements[6] = '';
$replacements[7] = '<br />';
$replacements[8] = '';
$replacements[9] = '';
$replacements[10] = '';
$replacements[11] = '$1<br />';
$replacements[12] = '';
$replacements[13] = '<$1$3';
$replacements[14] = '<ul>';
$replacements[15] = '<ol>';
$replacements[17] = '<br />';
$replacements[18] = '<br />';
$replacements[19] = '';
$replacements[20] = '';
$replacements[21] = '';
$replacements[22] = '';
$replacements[23] = '';
$replacements[24] = '';
$replacements[25] = '';
$replacements[26] = '';
$replacements[27] = '';
$replacements[28] = '';
$replacements[29] = '';
$replacements[30] = '';
$replacements[31] = '';
$replacements[32] = '';
$replacements[33] = '<br />';
$replacements[34] = '""';
$replacements[35] = '';
$replacements[36] = '';
$replacements[37] = '';
$replacements[38] = '$1';
$replacements[39] = '$1';
$replacements[40] = ';';
$replacements[41] = ';';
$replacements[42] = '"';
$replacements[43] = '<li>';
if($delstyles===true){
$patterns[44] = '/ style=".*?"/';
$replacements[44] = '';
}
ksort($patterns);
ksort($replacements);
$html = preg_replace($patterns, $replacements, $html);
for ($t=1;$t<=3;$t++) {
$html = (str_replace($spaces, ' ', $html));
}
return $html;
}
or via javascript:
function demoroniser(html) {
html = html.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\&nbsp\;/g,' ');
html = html.replace(/ class=[^\s|>]*/gi,'').
//replace(/<p [^>]*TEXT-ALIGN: justify[^>]*>/gi,'<p align="justify">').
replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');
html = html.replace(/<b [^>]*>/gi,'<b>').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');
html = html.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');
html = html.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');
html = html.replace(/<\?xml:[^>]*>/g, '').
replace(/<\/?st1:[^>]*>/g,'').
replace(/<\/?[a-z]\:[^>]*>/g,'').
replace(/<\/?font[^>]*>/gi,'').
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');
oldlen = html.length + 1;
while(oldlen > html.length) {
oldlen = html.length;
html = html.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
html = html.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');
html = html.replace(/ */gi,' ');
return html;
}

Gravatar Image8 - Great PHP function for cleaning up Microsoft Work HTML from copy and paste.

VX

Gravatar Image9 - I know I'm late to the game here, but Dreamweaver is the best I've seen at cleaning up Microsoft code. They actually have that as a feature built in. Very cool. HTML Tidy is nice too, and much cheaper than Dreamweaver... However, if you want the BEST HTML editor ever created... Dreamweaver is IT.

Post A Comment

:-D:-o:-p:-x:-(:-):-\:angry::cool::cry::emb::grin::huh::laugh::lips::rolleyes:;-)

Want to support this blog or just say thanks?

When you shop Amazon, start your shopping experience here.

When you do that, all your purchases during that session earn me an affiliate commission via the Amazon Affiliate program. You don't have to buy the book I linked you to (although I wouldn't complain!). Simply use that as your starting point.

Thanks!

Thomas "Duffbert" Duff

Ads of Relevance...