我正在尝试构建一个正则表达式来“减少”Java中字符串中重复的连续子串.例如,对于以下输入:
The big black dog big black dog is a friendly friendly dog who lives nearby nearby.
我想获得以下输出:
The big black dog is a friendly dog who lives nearby.
这是我到目前为止的代码:
String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
Pattern dupPattern = Pattern.compile("((\\b\\w+\\b\\s)+)\\1+", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);
while (matcher.find()) {
input = input.replace(matcher.group(), matcher.group(1));
}
对于除句子结尾之外的所有重复子字符串,这样做很好:
The big black dog is a friendly dog who lives nearby nearby.
我知道我的正则表达式在子字符串中的每个单词后面都需要一个空格,这意味着它不会捕获具有句点而不是空格的情况.我似乎无法为此找到解决方法,我已尝试使用捕获组并更改正则表达式以查找空格或句点而不仅仅是空格,但此解决方案仅在存在时才有效子串的每个重复部分之后的句点(“nearby.nearby.”).
有人能指出我正确的方向吗?理想情况下,此方法的输入将是短段而不仅仅是单行.
解决方法:
您可以使用
input.replaceAll("([ \\w]+)\\1", "$1");
import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
Pattern dupPattern = Pattern.compile("([ \\w]+)\\1", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);
while (matcher.find()) {
input = input.replaceAll("([ \\w]+)\\1", "$1");
}
System.out.println(input);
}
}