header
 
     
 
pixel
pixel pixel

Convert HTML into text

If you want to extract the text content of a HTML document (eg - get rid of all the HTML and Javascript), try the following code:
<?php
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript
                 
"'<[/!]*?[^<>]*?>'si",          // Strip out HTML tags
                 
"'([rn])[s]+'",                // Strip out white space
                 
"'&(quot|#34);'i",                // Replace HTML entities
                 
"'&(amp|#38);'i",
                 
"'&(lt|#60);'i",
                 
"'&(gt|#62);'i",
                 
"'&(nbsp|#160);'i",
                 
"'&(iexcl|#161);'i",
                 
"'&(cent|#162);'i",
                 
"'&(pound|#163);'i",
                 
"'&(copy|#169);'i",
                 
"'&#(d+);'e");                    // evaluate as php

$replace = array ("",
                 
"",
                 
"\1",
                 
"\"",
                 
"&",
                 
"<",
                 
">",
                 
" ",
                 
chr(161),
                 
chr(162),
                 
chr(163),
                 
chr(169),
                 
"chr(\1)");

$text preg_replace($search$replace$document);
?>

Unhelpful Helpful Rating 4.0 (score out of 5, no. of ratings: 43)
Comments
Comment by Lorelle on 2004-12-20
Great code, but how do you use it? As a php file, but where and how do you tell it which document to parse with the script? Thanks.

Comment by Reks on 2005-04-26
well... is there any similar ways to remove everything within the part of a document?

Comment by reks on 2005-04-26
Sorry for this second post.. My previous question was: Is there a way to remove everything within the <head> and </head>. Forgot that html tags do not work on such forms :)

Comment by rocky karhe on 2005-06-01
But its not removing html completely. Some tags like are still there. How can i improve it to remove all html tags.

Comment by Anupam Sarkar on 2005-06-20
It is very much helpful and quite handy as well

Comment by fj on 2005-07-10
Good copy paste from the php manual

Comment by Chris Gibson on 2005-08-07
Can't you just use the functions strip_tags and htmlspecialchars?

Comment by djp on 2005-08-26
html_entity_decode converts all HTML entities to their text form

Comment by vimal on 2005-09-05
really thanks its so great and so simple to understand

Comment by Nick on 2006-01-30
The line below is incorrect. "'([rn])[s]+'", it should be "'([\r\n])[\s]+'", Also unless you want "start of heading" all over your string change. "\1", to " ", I also use... Search. "'&[A-Za-z]+;'si", Replace. " ", That gets rid of all html chars.

Comment by Moti on 2006-10-30
to remove 'si",

Comment by Moti on 2006-10-30
to remove style tag use \\\" \\\"\\\']*?>.*?\\\'si\\\",\\\"

Comment by Moti on 2006-10-30
my posts are not published as they should.for style tag removed just change script to style, that's all

Comment by keylogger on 2007-02-20
Thank you, is just that i was looking for !, but i will need to add more tags, i'am doing a searh engine.

Post a Comment
Name
Email
(optional)
Comment
RatingUnhelpful Helpful
Security Image* (this is just to prevent spam submissions)
Security Image