我在一个文件中有一大组句子(10,000).该文件包含每个文件一个句子.在整个集合中,我想找出一个句子中出现的单词及其频率.
例句:
"Proposal 201 has been accepted by the Chief today.",
"Proposal 214 and 221 are accepted, as per recent Chief decision",
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",
"Proposal 214, ValueMania, has been accepted by the Chief."};
我想编写以下输出.我应该能够提供三个起始单词作为程序参数:“Chief,accepted,Proposal”
Chief accepted Proposal 5
Chief accepted Proposal has 3
Chief accepted Proposal has been 3
...
...
for all combinations.
我知道组合可能很大.
我在网上搜索但找不到.我写了一些代码,但无法理解它.也许知道域名的人可能知道.
ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();
try {
String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
for (String t : tmp){
String[] keys = t.split(" ");
String[] uniqueKeys;
int count = 0;
System.out.println(t);
uniqueKeys = getUniqueKeys(keys);
for(String key: uniqueKeys)
{
if(null == key)
{
break;
}
for(String s : keys)
{
if(key.equals(s))
{
count++;
}
}
System.out.println("Count of ["+key+"] is : "+count);
count=0;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
private static String[] getUniqueKeys(String[] keys) {
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for (int i = 1; i < keys.length; i++) {
for (int j = 0; j <= uniqueKeyIndex; j++) {
if (keys[i].equals(uniqueKeys[j])) {
keyAlreadyExists = true;
}
}
if (!keyAlreadyExists) {
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
有人可以帮忙编码吗?
解决方法:
您可以应用标准信息检索数据结构,尤其是倒排索引.这是你如何做到的.
考虑你的原始句子.使用一些整数标识符为它们编号,如下所示:
- “Proposal 201 has been accepted by the Chief today.”,
- “Proposal 214 and 221 are accepted, as per recent Chief decision”,
- “This proposal has been accepted by the Chief.”,
- “Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.”,
- “Proposal 214, ValueMania, has been accepted by the Chief.”
对于您在句子中遇到的每对单词,将其添加到倒置索引,该索引将该对映射到句子标识符的集合(一组唯一项).对于长度为N的句子,有N-choose-2对.
适当的Java数据结构将是Map< String,Map< String,Set< Integer>>.按字母顺序排列对,以便“有”和“建议”对仅出现(“有”,“建议”)而不出现(“建议”,“有”).
此地图将包含以下内容:
"has", "Proposal" --> Set(1, 5)
"accepted", "Proposal" --> Set(1, 2, 5)
"accepted", "has" --> Set(1, 3, 5)
etc.
例如,单词对“has”和“Proposal”具有一组(1,5),意味着它们在句子1和5中找到.
现在假设您要查找“已接受”,“有”和“提案”列表中单词的共现次数.生成此列表中的所有对并与其各自的列表相交(使用Java的Set.retainAll()).这里的结果将最终设置为(1,5).它的大小为2,意味着有两个句子包含“已接受”,“有”和“提案”.
要生成所有对,只需根据需要迭代地图.要生成大小为N的所有单词元组,您需要根据需要迭代并使用递归.