由于大多数(所有?)进行HTML清理的PHP库(如HTML Purifier)严重依赖于正则表达式,我认为尝试编写使用DOMDocument和相关类的HTML清理程序将是一个值得的实验.虽然我现在处于非常早期的阶段,但到目前为止,该项目显示了一些希望.
我的想法围绕一个类,该类使用DOMDocument遍历提供的标记中的所有节点,将它们与白名单进行比较,并删除不在白名单上的任何内容. (第一个实现是非常基本的,只根据类型删除节点,但我希望能够更加复杂并分析节点的属性,链接是否会在将来对不同域中的项进行寻址等).
我的问题是如何遍历DOM树?据我所知,DOM *对象有一个childNodes属性,所以我需要在整个树上进行递归吗?此外,DOMNodeLists的早期实验表明,您需要非常小心删除的顺序,否则您可能会留下项目或触发异常.
* Recursivly remove elements from the DOM that aren't whitelisted
* @param DOMNode $elem
* @return array List of elements removed from the DOM
* @throws Exception If removal of a node failed than an exception is thrown
private function cleanNodes (DOMNode $elem)
$removed = array ();
if (in_array ($elem -> nodeName, $this -> whiteList))
if ($elem -> hasChildNodes ())
* Iterate over the element's children. The reason we go backwards is because
* going forwards will cause indexes to change when elements get removed
$children = $elem -> childNodes;
$index = $children -> length;
while (--$index >= 0)
$removed = array_merge ($removed, $this -> cleanNodes ($children -> item ($index)));
// The element is not on the whitelist, so remove it
if ($elem -> parentNode -> removeChild ($elem))
$removed [] = $elem;
throw new Exception ('Failed to remove node from DOM');
return ($removed);
> https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator
class RecursiveDOMIterator implements RecursiveIterator
* Current Position in DOMNodeList
* @var Integer
protected $_position;
* The DOMNodeList with all children to iterate over
* @var DOMNodeList
protected $_nodeList;
* @param DOMNode $domNode
* @return void
public function __construct(DOMNode $domNode)
$this->_position = 0;
$this->_nodeList = $domNode->childNodes;
* Returns the current DOMNode
* @return DOMNode
public function current()
return $this->_nodeList->item($this->_position);
* Returns an iterator for the current iterator entry
* @return RecursiveDOMIterator
public function getChildren()
return new self($this->current());
* Returns if an iterator can be created for the current entry.
* @return Boolean
public function hasChildren()
return $this->current()->hasChildNodes();
* Returns the current position
* @return Integer
public function key()
return $this->_position;
* Moves the current position to the next element.
* @return void
public function next()
* Rewind the Iterator to the first element
* @return void
public function rewind()
$this->_position = 0;
* Checks if current position is valid
* @return Boolean
public function valid()
return $this->_position < $this->_nodeList->length;
您需要注意的另一件事是,DOMDocument的任何操作都会立即影响您从XPath查询中获得的任何DOMNodeList,并且在操作它们时可能会导致跳过节点.有关示例,请参见DOMNode replacement with PHP’s DOM classes.