Matching HTML tags with Java regular expressions

01 Oct 2023

regex

When working with HTML content in Java, you may encounter situations where you need to parse and extract specific HTML tags from a string. One way to achieve this is by using regular expressions in Java. Regular expressions (regex) provide a powerful and flexible way to match patterns within a string.

Here’s an example code snippet in Java that demonstrates how to use regular expressions to match HTML tags:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HtmlTagMatcher {
    public static void main(String[] args) {
        String htmlContent = "<p>This is a <b>sample</b> HTML <i>string</i>.</p>";

        Pattern pattern = Pattern.compile("<(.*?)>");
        Matcher matcher = pattern.matcher(htmlContent);

        while (matcher.find()) {
            String tag = matcher.group(1);
            System.out.println(tag);
        }
    }
}

In the above code, we define a regular expression pattern using the <(.*?)> syntax. This pattern will match any HTML tag, including the angle brackets. The .*? is a non-greedy quantifier, ensuring that the matching stops at the first closing angle bracket encountered.

We create a Pattern object using Pattern.compile() method and pass our regular expression pattern as a parameter. Then, we create a Matcher object using pattern.matcher() method and provide the HTML content as input.

Next, we iterate through each match using the while (matcher.find()) loop and extract the matched content using matcher.group(1). This will give us the inner content of each HTML tag without the angle brackets.

Finally, we can use the matched tags as per our requirements. In this example, we simply print each matched tag to the console.

Remember that using regular expressions to parse HTML can be error-prone and may not handle all possible scenarios. It’s generally recommended to use a dedicated HTML parsing library like Jsoup for more robust and reliable HTML parsing in Java.

#java #regex