======
README
======

The util.py module provides methods for cleanup HTML content. Let's test what 
the method can do for us. As you can see the method only removes the html and
body tags.:

  >>> from pprint import pprint
  >>> from p01.editor import util


clean
-----

A simple clean method call with default allowed tags looks like:

  >>> raw = "<html><body><div>Here comes content</div></body></html>"
  >>> print util.clean(raw)
  <div>Here comes content</div>

The method will allways convert <br> tags to <br />:

  >>> raw = "<html><body><div>Here<br> comes</div></body></html>"
  >>> print util.clean(raw)
  <div>Here<br /> comes</div>

and bold <b> tags will always converted to <strong> tags:

  >>> raw = "<html><body><div><b>Here</b> comes content</div></body></html>"
  >>> print util.clean(raw)
  <div><strong>Here</strong> comes content</div>

The Id attribute is by default allowed:

  >>> raw = '<html><body><div><div id="foo">Here</div> comes</div></body></html>'
  >>> print util.clean(raw)
  <div><div id="foo">Here</div> comes</div>

but any style attribute get removed:

  >>> raw = '<html><body><div><div style="font-size:11px">Here</div> comes</div></body></html>'
  >>> print util.clean(raw)
  <div><div>Here</div> comes</div>

Bad tags also get cleanup up:

  >>> raw = "<html><body><div><b>Here</div></body></html>"
  >>> print util.clean(raw)
  <div><strong>Here</strong></div>

And of corse <a> tags get rendered with it's relevant attributes and the query
arguments get escaped:

  >>> url = 'http://sv1.refline.ch/100000/0004/index.html?cid=<CID>&amp;lang=<LANG>'
  >>> link = '<A class="apply" href="%s" target="_top">link</a>' % url
  >>> raw = '<html><body><div>%s</div></body></html>' % link
  >>> print util.clean(raw)
  <div><a class="apply" href="http://sv1.refline.ch/100000/0004/index.html?cid=<CID>&amp;lang=<LANG>" target="_top">link</a></div>


simpleHTML
----------

We provide some cleanup methods which uses a predefined tag white list. Let's
test the simeHTML method which only uses a small set of HTML tags. First check
the tag white list:

  >>> pprint(util.ALLOWED_TAGS)
  ['a',
   'abbr',
   'acronym',
   'b',
   'br',
   'blockquote',
   'code',
   'div',
   'em',
   'i',
   'li',
   'ol',
   'p',
   'span',
   'strong',
   'ul']

and our default allowed attributes

  >>> pprint(util.ALLOWED_ATTRIBUTES)
  {'a': ['href', 'target', 'id', 'class', 'name'],
   'abbr': ['id', 'class', 'title'],
   'acronym': ['id', 'class', 'title'],
   'b': ['id', 'class', 'title'],
   'blockquote': ['id', 'class', 'title'],
   'br': [],
   'code': ['id', 'class', 'title'],
   'div': ['id', 'class', 'title'],
   'em': ['id', 'class', 'title'],
   'i': ['id', 'class', 'title'],
   'li': ['id', 'class', 'title'],
   'ol': ['id', 'class', 'title'],
   'p': ['id', 'class', 'title'],
   'span': ['id', 'class', 'title'],
   'strong': ['id', 'class', 'title'],
   'ul': ['id', 'class', 'title']}

simpleHTML - <div>:

  >>> raw = "<html><body><div><div>Here</div> comes</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div><div>Here</div> comes</div>

simpleHTML - <p>:

  >>> raw = "<html><body><div><p>Here</p> comes</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div><p>Here</p> comes</div>

simpleHTML - <strong>:

  >>> raw = "<html><body><div><strong>Here</strong> comes</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div><strong>Here</strong> comes</div>

simpleHTML - <em>:

  >>> raw = "<html><body><div><em>Here</em> comes</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div><em>Here</em> comes</div>

simpleHTML - <ul>/<li>:

  >>> raw = "<html><body><div><ul><li>Here</li></ul> comes</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div><ul><li>Here</li></ul> comes</div>

simpleHTML - <ol>/<li>:

  >>> raw = "<html><body><div><ol><li>Here</li></ol> comes</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div><ol><li>Here</li></ol> comes</div>

simpleHTML - <br> -> <br />:

  >>> raw = "<html><body><div>Here<br> comes</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div>Here<br /> comes</div>

simpleHTML - <b></b> -> <strong></strong>:

  >>> raw = "<html><body><div><b>Here</b> comes content</div></body></html>"
  >>> print util.simpleHTML(raw)
  <div><strong>Here</strong> comes content</div>

simpleHTML - id attribute doesn't get removed:

  >>> raw = '<html><body><div><div id="foo">Here</div> comes</div></body></html>'
  >>> print util.simpleHTML(raw)
  <div><div id="foo">Here</div> comes</div>

simpleHTML - class attribute doesn't get removed:

  >>> raw = '<html><body><div><div class="foo">Here</div> comes</div></body></html>'
  >>> print util.simpleHTML(raw)
  <div><div class="foo">Here</div> comes</div>

simpleHTML - style attribute get removed:

  >>> raw = '<html><body><div><div style="font-size:11px">Here</div> comes</div></body></html>'
  >>> print util.simpleHTML(raw)
  <div><div>Here</div> comes</div>

simpleHTML - href with our special cid and language marker:

  >>> url = 'http://sv1.refline.ch/100000/0004/index.html?cid=<CID>&amp;lang=<LANG>'
  >>> link = '<A class="apply" href="%s" target="_top">link</a>' % url
  >>> raw = '<html><body><div>%s</div></body></html>' % link
  >>> print util.simpleHTML(raw)
  <div><a class="apply" href="http://sv1.refline.ch/100000/0004/index.html?cid=<CID>&amp;lang=<LANG>" target="_top">link</a></div>
