java – Remove HTML tags from a String

The Question :

435 people think this question is useful

Is there a good way to remove HTML from a Java string? A simple regex like

replaceAll("\\<.*?>", "") 

will work, but things like &amp; wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*? in the regex will disappear).

The Question Comments :

The Answer 1

594 people think this answer is useful

Use a HTML parser instead of regex. This is dead simple with Jsoup.

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup also supports removing HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. <b>, <i> and <u>.

See also:

The Answer 2

281 people think this answer is useful

If you’re writing for Android you can do this…

android.text.Html.fromHtml(instruction).toString()

The Answer 3

86 people think this answer is useful

If the user enters <b>hey!</b>, do you want to display <b>hey!</b> or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you’re fine. A modification to your code to implement the second option would be:

replaceAll("\\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!</b>.

You can also check out JTidy which will parse “dirty” html input, and should give you a way to remove the tags, keeping the text.

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will still need to make sure to encode any remaining HTML special characters to keep your output safe.

The Answer 4

30 people think this answer is useful

Another way is to use javax.swing.text.html.HTMLEditorKit to extract the text.

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

ref : Remove HTML tags from a file to extract only the TEXT

The Answer 5

24 people think this answer is useful

I think that the simpliest way to filter the html tags is:

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

The Answer 6

18 people think this answer is useful

Also very simple using Jericho, and you can retain some of the formatting (line breaks and links, for example).

    Source htmlSource = new Source(htmlText);
    Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());
    Renderer htmlRend = new Renderer(htmlSeg);
    System.out.println(htmlRend.toString());

The Answer 7

17 people think this answer is useful

The accepted answer of doing simply Jsoup.parse(html).text() has 2 potential issues (with JSoup 1.7.3):

  • It removes line breaks from the text
  • It converts text &lt;script&gt; into <script>

If you use this to protect against XSS, this is a bit annoying. Here is my best shot at an improved solution, using both JSoup and Apache StringEscapeUtils:

// breaks multi-level of escaping, preventing &amp;amp;lt;script&amp;amp;gt; to be rendered as <script>
String replace = input.replace("&amp;amp;", "");
// decode any encoded html, preventing &amp;lt;script&amp;gt; to be rendered as <script>
String html = StringEscapeUtils.unescapeHtml(replace);
// remove all html tags, but maintain line breaks
String clean = Jsoup.clean(html, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
// decode html again to convert character entities back into text
return StringEscapeUtils.unescapeHtml(clean);

Note that the last step is because I need to use the output as plain text. If you need only HTML output then you should be able to remove it.

And here is a bunch of test cases (input to output):

{"regular string", "regular string"},
{"<a href=\"link\">A link</a>", "A link"},
{"<script src=\"http://evil.url.com\"/>", ""},
{"&amp;lt;script&amp;gt;", ""},
{"&amp;amp;lt;script&amp;amp;gt;", "lt;scriptgt;"}, // best effort
{"\" ' > < \n \\ é å à ü and &amp; preserved", "\" ' > < \n \\ é å à ü and &amp; preserved"}

If you find a way to make it better, please let me know.

The Answer 8

16 people think this answer is useful

On Android, try this:

String result = Html.fromHtml(html).toString();

The Answer 9

11 people think this answer is useful

HTML Escaping is really hard to do right- I’d definitely suggest using library code to do this, as it’s a lot more subtle than you’d think. Check out Apache’s StringEscapeUtils for a pretty good library for handling this in Java.

The Answer 10

8 people think this answer is useful

This should work –

use this

  text.replaceAll('<.*?>' , " ") -> This will replace all the html tags with a space.

and this

  text.replaceAll('&amp;.*?;' , "")-> this will replace all the tags which starts with "&amp;" and ends with ";" like &amp;nbsp;, &amp;amp;, &amp;gt; etc.

The Answer 11

7 people think this answer is useful

You can simply use the Android’s default HTML filter

    public String htmlToStringFilter(String textToFilter){

    return Html.fromHtml(textToFilter).toString();

    }

The above method will return the HTML filtered string for your input.

The Answer 12

6 people think this answer is useful

You might want to replace <br/> and </p> tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.

The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines…

replaceAll("\\<[\s]*tag[^>]*>","")

Then HTML-decode special characters such as &amp;. The result should not be considered to be sanitized.

The Answer 13

5 people think this answer is useful

Alternatively, one can use HtmlCleaner:

private CharSequence removeHtmlFrom(String html) {
    return new HtmlCleaner().clean(html).getText();
}

The Answer 14

5 people think this answer is useful

Use Html.fromHtml

HTML Tags are

<a href=”…”> <b>,  <big>, <blockquote>, <br>, <cite>, <dfn>
<div align=”…”>,  <em>, <font size=”…” color=”…” face=”…”>
<h1>,  <h2>, <h3>, <h4>,  <h5>, <h6>
<i>, <p>, <small>
<strike>,  <strong>, <sub>, <sup>, <tt>, <u>

As per Android’s official Documentations any tags in the HTML will display as a generic replacement String which your program can then go through and replace with real strings.

Html.formHtml method takes an Html.TagHandler and an Html.ImageGetter as arguments as well as the text to parse.

Example

String Str_Html=" <p>This is about me text that the user can put into their profile</p> ";

Then

Your_TextView_Obj.setText(Html.fromHtml(Str_Html).toString());

Output

This is about me text that the user can put into their profile

The Answer 15

4 people think this answer is useful

The accepted answer did not work for me for the test case I indicated: the result of “a < b or b > c” is “a b or b > c”.

So, I used TagSoup instead. Here’s a shot that worked for my test case (and a couple of others):

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}

The Answer 16

4 people think this answer is useful

I know this is old, but I was just working on a project that required me to filter HTML and this worked fine:

noHTMLString.replaceAll("\\&amp;.*?\\;", "");

instead of this:

html = html.replaceAll("&amp;nbsp;","");
html = html.replaceAll("&amp;amp;"."");

The Answer 17

4 people think this answer is useful

Here’s a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya’s output as a guide.

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback {
    private static final Logger log = Logger
            .getLogger(Logger.GLOBAL_LOGGER_NAME);

    private StringBuffer stringBuffer;

    private Stack<IndexType> indentStack;

    public static class IndexType {
        public String type;
        public int counter; // used for ordered lists

        public IndexType(String type) {
            this.type = type;
            counter = 0;
        }
    }

    public HTML2Text() {
        stringBuffer = new StringBuffer();
        indentStack = new Stack<IndexType>();
    }

    public static String convert(String html) {
        HTML2Text parser = new HTML2Text();
        Reader in = new StringReader(html);
        try {
            // the HTML to convert
            parser.parse(in);
        } catch (Exception e) {
            log.severe(e.getMessage());
        } finally {
            try {
                in.close();
            } catch (IOException ioe) {
                // this should never happen
            }
        }
        return parser.getText();
    }

    public void parse(Reader in) throws IOException {
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("StartTag:" + t.toString());
        if (t.toString().equals("p")) {
            if (stringBuffer.length() > 0
                    &amp;&amp; !stringBuffer.substring(stringBuffer.length() - 1)
                            .equals("\n")) {
                newLine();
            }
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.push(new IndexType("ol"));
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.push(new IndexType("ul"));
            newLine();
        } else if (t.toString().equals("li")) {
            IndexType parent = indentStack.peek();
            if (parent.type.equals("ol")) {
                String numberString = "" + (++parent.counter) + ".";
                stringBuffer.append(numberString);
                for (int i = 0; i < (4 - numberString.length()); i++) {
                    stringBuffer.append(" ");
                }
            } else {
                stringBuffer.append("*   ");
            }
            indentStack.push(new IndexType("li"));
        } else if (t.toString().equals("dl")) {
            newLine();
        } else if (t.toString().equals("dt")) {
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.push(new IndexType("dd"));
            newLine();
        }
    }

    private void newLine() {
        stringBuffer.append("\n");
        for (int i = 0; i < indentStack.size(); i++) {
            stringBuffer.append("    ");
        }
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        log.info("EndTag:" + t.toString());
        if (t.toString().equals("p")) {
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("li")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.pop();
            ;
        }
    }

    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("SimpleTag:" + t.toString());
        if (t.toString().equals("br")) {
            newLine();
        }
    }

    public void handleText(char[] text, int pos) {
        log.info("Text:" + new String(text));
        stringBuffer.append(text);
    }

    public String getText() {
        return stringBuffer.toString();
    }

    public static void main(String args[]) {
        String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
        System.out.println(convert(html));
    }
}

The Answer 18

4 people think this answer is useful

Here is one more variant of how to replace all(HTML Tags | HTML Entities | Empty Space in HTML content)

content.replaceAll("(<.*?>)|(&.*?;)|([ ]{2,})", ""); where content is a String.

The Answer 19

3 people think this answer is useful

One more way can be to use com.google.gdata.util.common.html.HtmlToText class like

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

The Answer 20

3 people think this answer is useful

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}

The Answer 21

2 people think this answer is useful

Here is another way to do it:

public static String removeHTML(String input) {
    int i = 0;
    String[] str = input.split("");

    String s = "";
    boolean inTag = false;

    for (i = input.indexOf("<"); i < input.indexOf(">"); i++) {
        inTag = true;
    }
    if (!inTag) {
        for (i = 0; i < str.length; i++) {
            s = s + str[i];
        }
    }
    return s;
}

The Answer 22

2 people think this answer is useful

One could also use Apache Tika for this purpose. By default it preserves whitespaces from the stripped html, which may be desired in certain situations:

InputStream htmlInputStream = ..
HtmlParser htmlParser = new HtmlParser();
HtmlContentHandler htmlContentHandler = new HtmlContentHandler();
htmlParser.parse(htmlInputStream, htmlContentHandler, new Metadata())
System.out.println(htmlContentHandler.getBodyText().trim())

The Answer 23

1 people think this answer is useful

One way to retain new-line info with JSoup is to precede all new line tags with some dummy string, execute JSoup and replace dummy string with “\n”.

String html = "<p>Line one</p><p>Line two</p>Line three<br/>etc.";
String NEW_LINE_MARK = "NEWLINESTART1234567890NEWLINEEND";
for (String tag: new String[]{"</p>","<br/>","</h1>","</h2>","</h3>","</h4>","</h5>","</h6>","</li>"}) {
    html = html.replace(tag, NEW_LINE_MARK+tag);
}

String text = Jsoup.parse(html).text();

text = text.replace(NEW_LINE_MARK + " ", "\n\n");
text = text.replace(NEW_LINE_MARK, "\n\n");

The Answer 24

1 people think this answer is useful
classeString.replaceAll("\\<(/?[^\\>]+)\\>", "\\ ").replaceAll("\\s+", " ").trim() 

The Answer 25

0 people think this answer is useful

My 5 cents:

String[] temp = yourString.split("&amp;amp;");
String tmp = "";
if (temp.length > 1) {

    for (int i = 0; i < temp.length; i++) {
        tmp += temp[i] + "&amp;";
    }
    yourString = tmp.substring(0, tmp.length() - 1);
}

The Answer 26

0 people think this answer is useful

To get formateed plain html text you can do that:

String BR_ESCAPED = "&amp;lt;br/&amp;gt;";
Element el=Jsoup.parse(html).select("body");
el.select("br").append(BR_ESCAPED);
el.select("p").append(BR_ESCAPED+BR_ESCAPED);
el.select("h1").append(BR_ESCAPED+BR_ESCAPED);
el.select("h2").append(BR_ESCAPED+BR_ESCAPED);
el.select("h3").append(BR_ESCAPED+BR_ESCAPED);
el.select("h4").append(BR_ESCAPED+BR_ESCAPED);
el.select("h5").append(BR_ESCAPED+BR_ESCAPED);
String nodeValue=el.text();
nodeValue=nodeValue.replaceAll(BR_ESCAPED, "<br/>");
nodeValue=nodeValue.replaceAll("(\\s*<br[^>]*>){3,}", "<br/><br/>");

To get formateed plain text change <br/> by \n and change last line by:

nodeValue=nodeValue.replaceAll("(\\s*\n){3,}", "<br/><br/>");

The Answer 27

0 people think this answer is useful

I know it is been a while since this question as been asked, but I found another solution, this is what worked for me:

Pattern REMOVE_TAGS = Pattern.compile("<.+?>");
    Source source= new Source(htmlAsString);
 Matcher m = REMOVE_TAGS.matcher(sourceStep.getTextExtractor().toString());
                        String clearedHtml= m.replaceAll("");

The Answer 28

0 people think this answer is useful

Worth noting that if you’re trying to accomplish this in a Service Stack project, it’s already a built-in string extension

using ServiceStack.Text;
// ...
"The <b>quick</b> brown <p> fox </p> jumps over the lazy dog".StripHtml();

The Answer 29

0 people think this answer is useful

I often find that I only need to strip out comments and script elements. This has worked reliably for me for 15 years and can easily be extended to handle any element name in HTML or XML:

// delete all comments
response = response.replaceAll("<!--[^>]*-->", "");
// delete all script elements
response = response.replaceAll("<(script|SCRIPT)[^+]*?>[^>]*?<(/script|SCRIPT)>", "");

The Answer 30

0 people think this answer is useful

Sometimes the html string come from xml with such &lt. When using Jsoup we need parse it and then clean it.

Document doc = Jsoup.parse(htmlstrl);
Whitelist wl = Whitelist.none();
String plain = Jsoup.clean(doc.text(), wl);

While only using Jsoup.parse(htmlstrl).text() can’t remove tags.

Add a Comment