header
 
     
 
pixel
pixel pixel

Convert HTML into text

If you want to extract the text content of a HTML document (eg - get rid of all the HTML and Javascript), try the following code:
<?php
// $document should contain an HTML document.
// This will remove HTML tags, javascript sections
// and white space. It will also convert some
// common HTML entities to their text equivalent.

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript
                 
"'<[/!]*?[^<>]*?>'si",          // Strip out HTML tags
                 
"'([rn])[s]+'",                // Strip out white space
                 
"'&(quot|#34);'i",                // Replace HTML entities
                 
"'&(amp|#38);'i",
                 
"'&(lt|#60);'i",
                 
"'&(gt|#62);'i",
                 
"'&(nbsp|#160);'i",
                 
"'&(iexcl|#161);'i",
                 
"'&(cent|#162);'i",
                 
"'&(pound|#163);'i",
                 
"'&(copy|#169);'i",
                 
"'&#(d+);'e");                    // evaluate as php

$replace = array ("",
                 
"",
                 
"\1",
                 
"\"",
                 
"&",
                 
"<",
                 
">",
                 
" ",
                 
chr(161),
                 
chr(162),
                 
chr(163),
                 
chr(169),
                 
"chr(\1)");

$text preg_replace($search$replace$document);
?>

Unhelpful Helpful Rating 3.7 (score out of 5, no. of ratings: 126)
Comments
Comment by Lorelle on 2004-12-20
Great code, but how do you use it? As a php file, but where and how do you tell it which document to parse with the script? Thanks.

Comment by Reks on 2005-04-26
well... is there any similar ways to remove everything within the part of a document?

Comment by reks on 2005-04-26
Sorry for this second post.. My previous question was: Is there a way to remove everything within the <head> and </head>. Forgot that html tags do not work on such forms :)

Comment by rocky karhe on 2005-06-01
But its not removing html completely. Some tags like are still there. How can i improve it to remove all html tags.

Comment by Anupam Sarkar on 2005-06-20
It is very much helpful and quite handy as well

Comment by fj on 2005-07-10
Good copy paste from the php manual

Comment by Chris Gibson on 2005-08-07
Can't you just use the functions strip_tags and htmlspecialchars?

Comment by djp on 2005-08-26
html_entity_decode converts all HTML entities to their text form

Comment by vimal on 2005-09-05
really thanks its so great and so simple to understand

Comment by Nick on 2006-01-30
The line below is incorrect. "'([rn])[s]+'", it should be "'([\r\n])[\s]+'", Also unless you want "start of heading" all over your string change. "\1", to " ", I also use... Search. "'&[A-Za-z]+;'si", Replace. " ", That gets rid of all html chars.

Comment by Moti on 2006-10-30
to remove 'si",

Comment by Moti on 2006-10-30
to remove style tag use \\\" \\\"\\\']*?>.*?\\\'si\\\",\\\"

Comment by Moti on 2006-10-30
my posts are not published as they should.for style tag removed just change script to style, that's all

Comment by keylogger on 2007-02-20
Thank you, is just that i was looking for !, but i will need to add more tags, i'am doing a searh engine.

Comment by hi on 2008-08-05

Comment by SLA on 2009-06-27
Thank you! :)

Comment by shanenan on 2011-09-21
Discount Christian Louboutin Pumps sale are holding the promotion activity, christian louboutin sale 70%-80% off, free shipping,cheap christian louboutin boots, cheap christian louboutin pumps, 100% quality guarantee, 7 workdays to your door!

Comment by shanenan on 2011-09-21
Discount Christian Louboutin Pumps sale are holding the promotion activity, christian louboutin sale 70%-80% off, free shipping,cheap christian louboutin boots, cheap christian louboutin pumps, 100% quality guarantee, 7 workdays to your door!

Comment by discount Michael Jordan Shoes on 2010-09-17
Air Jordan Shoes are always so attractive, good quality can ensure the wearer's feet, not only that, the Michael Jordan Shoes has become a fashion, many young people are very fanatical Air Jordan. High price so many people can realize their dreams. Well now, our products directly from third-party products, so we can provide cheap Jordan Basketball Shoes and New Jordan Shoes, high quality, excellent service, come on, give yourself a choice. We offerAir Jordan 1, Air Jordan 2, Air Jordan 3,Air Jordan 4, Air Jordan 5, Air Jordan 6, Air Jordan 7, Air Jordan 8, Air Jordan 9, Air Jordan 10, Air Jordan 11 and so on!

Post a Comment
Name
Email
(optional)
Comment
RatingUnhelpful Helpful
Security Image* (this is just to prevent spam submissions)
Security Image