HTML cleanup tools?
Category Software Development
Good morning, everyone... I'm wondering if anyone has suggestions on HTML "clean up" tools (for lack of a better name). Let me explain...
We have an application written in PHP and MySQL that has an embedded HTML editor in it. You can cut and paste text from Microsoft Word documents (which is how we get the changes from the user), and the HTML is automagically generated. But when you switch over to the HTML view, you see the most horrendous HTML in the world. Word puts all sorts of MS-specific tags out there, and it makes it a PITA to try and find what you need to update.
Does anyone know of any tools that you can use to paste in HTML code and easily clean it up? I'm trying to remove the MS Word garbage whenever possible, but it's time-consuming...
Good morning, everyone... I'm wondering if anyone has suggestions on HTML "clean up" tools (for lack of a better name). Let me explain...
We have an application written in PHP and MySQL that has an embedded HTML editor in it. You can cut and paste text from Microsoft Word documents (which is how we get the changes from the user), and the HTML is automagically generated. But when you switch over to the HTML view, you see the most horrendous HTML in the world. Word puts all sorts of MS-specific tags out there, and it makes it a PITA to try and find what you need to update.
Does anyone know of any tools that you can use to paste in HTML code and easily clean it up? I'm trying to remove the MS Word garbage whenever possible, but it's time-consuming...



Comments
Posted by jonvon At 11:35:49 On 31/07/2003 | - Website - |
Posted by Ben Langhinrichs At 10:52:14 On 31/07/2003 | - Website - |
Posted by richard At 12:52:59 On 31/07/2003 | - Website - |
I downloaded the Tidy UI app from the SourceForge site, and it rocks! Free is good, of course. I took the HTML from the worst page I have and dropped it into the editor (didn't even need to read the instructions). Tidy did the cleanup and formatting automatically. Even if it didn't do ANY cleanup, the formatting feature allowed me to find what I was looking for much quicker than my prior hunting method...
Cool stuff, everyone, and thanks for the help!
Posted by Tom Duff At 13:42:13 On 31/07/2003 | - Website - |
The template feature is a huge time and error saver, but having download all resulting HTMLs back from the server, to manually open each one and run the cleanup it's a daunting task.
If anyone has an idea or knows of a Dreamweaver extension that can do on-the-fly comment cleanup when doing a "Put", I'd appreciate a line or two.
Posted by Catalin At 14:17:52 On 24/09/2005 | - Website - |
http://www.interactivetools.com/products/htmlarea/index.html#demo
I've downloaded the free example, and have been able to copy-paste the webpage itself, hit the <> button and view the raw W3C-compliant HTML. It's worth a shot anyway...
(They should start paying me advert kickbacks
-Chris
Posted by Chris Toohey At 09:13:17 On 31/07/2003 | - Website - |
<?php
function decraper($html, $delstyles=false) {
$whitespace = array("\t","\n","\r");
$spaces = array(' ',' ',' ',' ');
$html = str_replace($whitespace, '', $html);
for ($t = 1; $t <= 5; $t++) {
$html = (str_replace($spaces, ' ', $html));
}
$commoncrap = array('"'
,'font-weight: normal;'
,'font-style: normal;'
,'line-height: normal;'
,'font-size-adjust: none;'
,'font-stretch: normal;'); //If it is so normal, why they bother?
$replace = array("'");
$html = str_replace($commoncrap, $replace, $html);
$patterns = array();
$replacements = array();
$patterns[0] = '/(<table\s.*)(width=)(\d+%)(\D)/i'; # Fix unquoted non-alphanumeric characters in table tags
$patterns[1] = '/(<td\s.*)(width=)(\d+%)(\D)/i';
$patterns[2] = '/(<th\s.*)(width=)(\d+%)(\D)/i';
$patterns[3] = '/<td( colspan="[0-9]+")?( rowspan="[0-9]+")?( width="[0-9]+")?( height="[0-9]+")?.*?>/i';
$patterns[4] = '/<tr.*?>/i';
$patterns[5] = '/<\/st1:address>(<\/st1:\w*>)?<\/p>[\n\r\s]*<p[\s\w="\']*>/i';
$patterns[6] = '/<o:p.*?>/i';
$patterns[7] = '/<\/o:p>/i';
$patterns[8] = '/<o:SmartTagType[^>]*>/i';
$patterns[9] = '/<st1:[\w\s"=]*>/i';
$patterns[10] = '/<\/st1:\w*>/i';
$patterns[11] = '/<p class="MsoNormal"[^>]*>(.*?)<\/p>/i';
$patterns[12] = '/ style="margin-top: 0cm;"/i';
$patterns[13] = '/<(\w[^>]*) class=([^ |>]*)([^>]*)/i';
$patterns[14] = '/<ul(.*?)>/i';
$patterns[15] = '/<ol(.*?)>/i';
$patterns[17] = '/<br \/> <br \/>/i';
$patterns[18] = '/ <br \/>/i';
$patterns[19] = '/<!-.*?>/';
$patterns[20] = '/\s*style=(""|\'\')/';
$patterns[21] = '/ style=[\'"]tab-interval:[^\'"]*[\'"]/i';
$patterns[22] = '/behavior:[^;\'"]*;*(\n|\r)*/i';
$patterns[23] = '/mso-[^:]*:"[^"]*";/i';
$patterns[24] = '/mso-[^;\'"]*;*(\n|\r)*/i';
$patterns[25] = '/\s*font-family:[^;"]*;?/i';
$patterns[26] = '/margin[^"\';]*;?/i';
$patterns[27] = '/text-indent[^"\';]*;?/i';
$patterns[28] = '/tab-stops:[^\'";]*;?/i';
$patterns[29] = '/border-color: *([^;\'"]*)/i';
$patterns[30] = '/border-collapse: *([^;\'"]*)/i';
$patterns[31] = '/page-break-before: *([^;\'"]*)/i';
$patterns[32] = '/font-variant: *([^;\'"]*)/i';
$patterns[33] = '/<span [^>]*><br \/><\/span><br \/>/i';
$patterns[34] = '/" "/';
$patterns[35] = '/[\t\r\n]/';
$patterns[36] = '/\s\s/s';
$patterns[37] = '/ style=""/';
$patterns[38] = '/<span>(.*?)<\/span>/i';
$patterns[39] = '/<span>(.*?)<\/span>/i';//twice, nested spans
$patterns[40] = '/(;\s|\s;)/';
$patterns[41] = '/;;/';
$patterns[42] = '/";/';
$patterns[43] = '/<li(.*?)>/i';
$replacements[0] = '$1$2"$3"$4';
$replacements[1] = '$1$2"$3"$4';
$replacements[2] = '$1$2"$3"$4';
$replacements[3] = '<td$1$2$3$4>';
$replacements[4] = '<tr>';
$replacements[5] = '<br />';
$replacements[6] = '';
$replacements[7] = '<br />';
$replacements[8] = '';
$replacements[9] = '';
$replacements[10] = '';
$replacements[11] = '$1<br />';
$replacements[12] = '';
$replacements[13] = '<$1$3';
$replacements[14] = '<ul>';
$replacements[15] = '<ol>';
$replacements[17] = '<br />';
$replacements[18] = '<br />';
$replacements[19] = '';
$replacements[20] = '';
$replacements[21] = '';
$replacements[22] = '';
$replacements[23] = '';
$replacements[24] = '';
$replacements[25] = '';
$replacements[26] = '';
$replacements[27] = '';
$replacements[28] = '';
$replacements[29] = '';
$replacements[30] = '';
$replacements[31] = '';
$replacements[32] = '';
$replacements[33] = '<br />';
$replacements[34] = '""';
$replacements[35] = '';
$replacements[36] = '';
$replacements[37] = '';
$replacements[38] = '$1';
$replacements[39] = '$1';
$replacements[40] = ';';
$replacements[41] = ';';
$replacements[42] = '"';
$replacements[43] = '<li>';
if($delstyles===true){
$patterns[44] = '/ style=".*?"/';
$replacements[44] = '';
}
ksort($patterns);
ksort($replacements);
$html = preg_replace($patterns, $replacements, $html);
for ($t=1;$t<=3;$t++) {
$html = (str_replace($spaces, ' ', $html));
}
return $html;
}
or via javascript:
function demoroniser(html) {
html = html.replace(/\r\n/g, ' ').
replace(/\n/g, ' ').
replace(/\r/g, ' ').
replace(/\ \;/g,' ');
html = html.replace(/ class=[^\s|>]*/gi,'').
//replace(/<p [^>]*TEXT-ALIGN: justify[^>]*>/gi,'<p align="justify">').
replace(/ style=\"[^>]*\"/gi,'').
replace(/ align=[^\s|>]*/gi,'');
html = html.replace(/<b [^>]*>/gi,'<b>').
replace(/<i [^>]*>/gi,'<i>').
replace(/<li [^>]*>/gi,'<li>').
replace(/<ul [^>]*>/gi,'<ul>');
html = html.replace(/<b>/gi,'<strong>').
replace(/<\/b>/gi,'</strong>');
html = html.replace(/<em>/gi,'<i>').
replace(/<\/em>/gi,'</i>');
html = html.replace(/<\?xml:[^>]*>/g, '').
replace(/<\/?st1:[^>]*>/g,'').
replace(/<\/?[a-z]\:[^>]*>/g,'').
replace(/<\/?font[^>]*>/gi,'').
replace(/<\/?span[^>]*>/gi,' ').
replace(/<\/?div[^>]*>/gi,' ').
replace(/<\/?pre[^>]*>/gi,' ').
replace(/<\/?h[1-6][^>]*>/gi,' ');
oldlen = html.length + 1;
while(oldlen > html.length) {
oldlen = html.length;
html = html.replace(/<([a-z][a-z]*)> *<\/\1>/gi,' ').
replace(/<([a-z][a-z]*)> *<([a-z][^>]*)> *<\/\1>/gi,'<$2>');
}
html = html.replace(/<([a-z][a-z]*)><\1>/gi,'<$1>').
replace(/<\/([a-z][a-z]*)><\/\1>/gi,'<\/$1>');
html = html.replace(/ */gi,' ');
return html;
}
Posted by HF At 01:09:25 On 05/01/2006 | - Website - |
VX
Posted by VX At 23:31:37 On 24/09/2006 | - Website - |
Posted by John Roling ("Greyhawk68") At 12:08:51 On 01/08/2003 | - Website - |