java – 句子中的单词cooccurence

我在一个文件中有一大组句子(10,000).该文件包含每个文件一个句子.在整个集合中,我想找出一个句子中出现的单词及其频率.

例句:

"Proposal 201 has been accepted by the Chief today.", 
"Proposal 214 and 221 are accepted, as per recent Chief decision",     
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",     
"Proposal 214, ValueMania, has been accepted by the Chief."};

我想编写以下输出.我应该能够提供三个起始单词作为程序参数:“Chief,accepted,Proposal”

Chief accepted Proposal            5
Chief accepted Proposal has        3
Chief accepted Proposal has been   3

... 
...
for all combinations.

我知道组合可能很大.

我在网上搜索但找不到.我写了一些代码,但无法理解它.也许知道域名的人可能知道.

ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();

            try {
                String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
                for (String t : tmp){
                      String[] keys = t.split(" ");
                      String[] uniqueKeys;
                      int count = 0;
                      System.out.println(t);
                      uniqueKeys = getUniqueKeys(keys);
                        for(String key: uniqueKeys)
                        {
                            if(null == key)
                            {
                                break;
                            }           
                            for(String s : keys)
                            {
                                if(key.equals(s))
                                {
                                    count++;
                                }               
                            }
                            System.out.println("Count of ["+key+"] is : "+count);
                            count=0;
                        }
                }
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }

private static String[] getUniqueKeys(String[] keys) {
        String[] uniqueKeys = new String[keys.length];

        uniqueKeys[0] = keys[0];
        int uniqueKeyIndex = 1;
        boolean keyAlreadyExists = false;

        for (int i = 1; i < keys.length; i++) {
            for (int j = 0; j <= uniqueKeyIndex; j++) {
                if (keys[i].equals(uniqueKeys[j])) {
                    keyAlreadyExists = true;
                }
            }

            if (!keyAlreadyExists) {
                uniqueKeys[uniqueKeyIndex] = keys[i];
                uniqueKeyIndex++;
            }
            keyAlreadyExists = false;
        }
        return uniqueKeys;
    }

有人可以帮忙编码吗?

解决方法:

您可以应用标准信息检索数据结构,尤其是倒排索引.这是你如何做到的.

考虑你的原始句子.使用一些整数标识符为它们编号,如下所示:

  1. “Proposal 201 has been accepted by the Chief today.”,
  2. “Proposal 214 and 221 are accepted, as per recent Chief decision”,
  3. “This proposal has been accepted by the Chief.”,
  4. “Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.”,
  5. “Proposal 214, ValueMania, has been accepted by the Chief.”

对于您在句子中遇到的每对单词,将其添加到倒置索引,该索引将该对映射到句子标识符的集合(一组唯一项).对于长度为N的句子,有N-choose-2对.

适当的Java数据结构将是Map< String,Map< String,Set< Integer>>.按字母顺序排列对,以便“有”和“建议”对仅出现(“有”,“建议”)而不出现(“建议”,“有”).

此地图将包含以下内容:

"has", "Proposal" --> Set(1, 5)
"accepted", "Proposal" --> Set(1, 2, 5)
"accepted", "has" --> Set(1, 3, 5)
etc.

例如,单词对“has”和“Proposal”具有一组(1,5),意味着它们在句子1和5中找到.

现在假设您要查找“已接受”,“有”和“提案”列表中单词的共现次数.生成此列表中的所有对并与其各自的列表相交(使用Java的Set.retainAll()).这里的结果将最终设置为(1,5).它的大小为2,意味着有两个句子包含“已接受”,“有”和“提案”.

要生成所有对,只需根据需要迭代地图.要生成大小为N的所有单词元组,您需要根据需要迭代并使用递归.

上一篇:如何在JSONObject中保留LinkedHashMap排序?


下一篇:Android LinkedHashMap.eldest()方法不可用