Skip to content

Sanitizing filter broken in 0.90 #72

@gsnedders

Description

@gsnedders

http://code.google.com/p/html5lib/issues/detail?id=162

Reported by gdr@garethrees.org, Oct 10, 2010

DESCRIPTION

Consider the following interaction with html5lib 0.90:

    >>> from html5lib import html5parser, serializer, treebuilders, treewalkers
    >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
    >>> dom = p.parse("""<body onload="sucker()">""") 
    >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)
    >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))
    u'<body onload=sucker()>'

This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.

ANALYSIS

The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.

Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:

    >>> from html5lib import tokenizer
    >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))
    {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}

But during filtering, tokens look like this:

    >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]
    {'namespace': u'http:/​/​www.w3.org/​1999/​xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}

When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.

OBSERVATION

Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?

WORKAROUND

I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:

  1. Serialize the DOM to HTML without sanitization.
  2. Re-parse the HTML from step 1, using the sanitizing tokenizer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions